File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/n04-4030_relat.xml

Size: 3,636 bytes

Last Modified: 2025-10-06 14:15:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4030">
  <Title>Nearly-Automated Metadata Hierarchy Creation</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> There has been surprisingly little work on precisely the problem that we tackle in this paper. The literature on automated text categorization is enormous, but assumes that a set of categories has already been created, whereas the problem here is to determine the categories of interest.</Paragraph>
    <Paragraph position="1"> There has also been extensive work on finding synonymous terms and word associations, as well as automatic acquisition of IS-A (or genus-head) relations from dictionary definitions and glosses (Klavans and Whitman, 2001) and from free text (Hearst, 1992; Caraballo, 1999).</Paragraph>
    <Paragraph position="2"> Sanderson and Croft (1999) propose a method called subsumption for building a hierarchy for a set of documents retrieved for a query. For two terms x and y, x is said to subsume y if the following conditions hold: a2a4a3a6a5a8a7a9a11a10a13a12a15a14a17a16a18a20a19a21a2a4a3a6a9a22a7a5a23a10a25a24a27a26 . The evaluation consisted of asking people to define the relation that holds between the pairs of words shown; only 23% of the pairs were found to hold a parent-child relation; 49% were found to fall into a more general related-to category. For a set of medical texts, the top level consisted of the terms: disease, post polio, serious disease, dengue, infection control, immunology, etc. This kind of listing is not systematic enough to appear on a navigation page for a website.</Paragraph>
    <Paragraph position="3"> Lawrie et al. (2001) use language models to produce summaries of text collections. The results are also associational; for example, the top level for a query on &amp;quot;Abuses of Email&amp;quot; are abuses, human, States Act, and Nursing Home Abuses, and the second level under abuses is e-mail, send, Money, Fax, account, address, Internet, etc. These again are too scattered to be appropriate for a human-readable index into a document collection.</Paragraph>
    <Paragraph position="4"> Hofmann (1999) uses probabilistic document clustering to impose topic hierarchies. For a collection of articles from the journal Machine Learning, the top level cluster is labeled learn, paper, base, model, new, train and the second level clusters are labeled process, experi, knowledge, develop, inform, design and algorithm, function, present, result, problem, model. We would prefer something more like the ACM classification hierarchy.</Paragraph>
    <Paragraph position="5"> The Word Space algorithm (Schutze, 1993) uses linear regression on term co-occurrence statistics to create groups of semantically related words. For every word, a context vector is computed for every position at which it occurs in text. A vector is defined as the sum of all four-grams in a window of 1001 fourgrams centered around the word. Cosine distance is used to compute the similarity between word vectors.</Paragraph>
    <Paragraph position="6"> Probably the closest work to that described here is the SONIA system (Sahami et al., 1998) which used a combination of unsupervised and supervised methods to organize a set of documents. The unsupervised method (document clustering) imposes an initial organization on a personal information collection which the user can then modify. The resulting organization is then used to train a supervised text categorization algorithm which automatically classifies new documents.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML