File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4030_metho.xml

Size: 4,856 bytes

Last Modified: 2025-10-06 14:08:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4030">
  <Title>Nearly-Automated Metadata Hierarchy Creation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"> WordNet is a manually built lexical system where words are organized into synonym sets (synsets) linked by different relations (Fellbaum, 1998). It can be viewed as a huge graph, where the synsets are the nodes and the relations are the links. Our algorithm for converting it to create metadata categories for information organization and browsing consists of the following steps:  1. Select representative words from the collection.</Paragraph>
    <Paragraph position="1"> 2. Get the WordNet hypernym paths for one sense of each selected word.</Paragraph>
    <Paragraph position="2"> 3. Build a tree from the hypernym paths.</Paragraph>
    <Paragraph position="3"> 4. Compress the tree.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Select Representative Words
</SectionTitle>
      <Paragraph position="0"> To make the hierarchy size manageable, we select only a subset of the words that are intended to best reflect the topics covered in the documents (although in principle the method can be used on all of the words in the collection).</Paragraph>
      <Paragraph position="1"> The criteria for choosing the target words is information gain (Mitchell, 1997). Define the set a28 to be all the unique words in the the document set a29 . Let the distribution of a worda30 be the number of documents in D that the word occurs in. Initially, the words in a28 are ordered according to their distribution in the entire collection a29 .</Paragraph>
      <Paragraph position="2"> At each iteration, the highest-scoring word a30 is added to an initially-empty set a31 and removed from a28 , and the documents covered by a30 are removed from a29 . The process repeats until no more documents are left in a29 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Get Hypernym Paths
</SectionTitle>
      <Paragraph position="0"> For every word in a31 , we obtain the hypernym path of the word from WordNet. In the current implementation, we take the hypernym for the first sense of the word only,  paths of words red and blue, (d) The uncompacted tree for words red, blue and green, (e) The path after eliminating parents with less than two children, and (f) after eliminating children with name included in parent's name. which is usually the most general. (In the future, we plan to explore how to disambiguate between senses based on the context in which the word appears in the document; see Discussion.) Figures 1(a) and 1(b) show the hypernym paths for words red and blue.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Build the Tree
</SectionTitle>
      <Paragraph position="0"> Next we take the union of the hypernym paths of all words in set S, obtaining a tree, as shown in Figure 1(c).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Compress the Tree
</SectionTitle>
      <Paragraph position="0"> The hypernym path length varies widely in WordNet, so we compress the tree using three rules:  1. Eliminate selected top-level (very general) categories, like abstraction, entity.</Paragraph>
      <Paragraph position="1"> 2. Starting from the leaves, eliminate a parent that has fewer than n children, unless the parent is the root. 3. Eliminate a child whose name appears within the parent's.  For example, consider the tree in Figure 1(d) and assume that a32a34a33a36a35 (eliminate parents that have fewer than two children). Starting from the leaves, by applying Rule 2, nodes red, redness, blue, blueness, and green, greenness, are eliminated since they have only one child. Figure 1(e) shows the resulting tree. Next, by applying Rule 3, node chromatic color is eliminated, since it contains the word color which also appears in the name of its parent. The final tree presented in Figure 1(f) produces a structure that is likely to be a good level of description for an information architecture.</Paragraph>
      <Paragraph position="2"> Mihalcea and Moldovan (2001) describe a sophisticated method for simplifying WordNet, focusing on combining synsets with very similar meanings or dropping rarely used synsets. Their rules include what we define above as Rule 3. However, they focus on simplifying WordNet in general, rather than tailoring it to a specific collection, and focus on NLP applications that are likely to make use of every sense of a WordNet word. Nevertheless, it may be useful to explore using their simplified version of WordNet in future.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML