File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1401_intro.xml
Size: 5,252 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1401"> <Title>Disambiguating Noun Compounds with Latent Semantic Indexing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The majority of corpus statistical approaches to compound disambiguation use a variation of what Lauer (1995) refers to as the adjacency algorithm. This algorithm was originally proposed by Marcus (1980), and essentially operates by comparing the acceptability of immediately adjacent noun pairs. Specifically, given a sequence of three nouns n1 n2 n3, if</Paragraph> <Paragraph position="2"> There remains the question of how &quot;acceptability&quot; is to be determined computationally. Several researchers (e.g., (Barker, 1998; Evans and Zhai, 1996; Pohlmann and Kraaij, 1997; Pustejovsky et al., 1993; Strzalkowski and Vauthey, 1992)) collect statistics on the occurrence frequency of structurally unambiguous two-noun compounds to inform the analysis of the ambiguous compound. For example, given the compound &quot;computer data bases&quot;, the structure (computer (data bases)) would be preferred if (data bases) occurred more frequently than (computer data) in the corpus. However, by assuming that sufficient examples of sub-components exist in the training corpus, all the above approaches risk falling foul of the sparse data problem. Most noun-noun compounds are rare, and statistics based on such infrequent events may lead to an unreliable estimation of the acceptability of particular modifier-head pairs.</Paragraph> <Paragraph position="3"> The work of Resnik (1993) goes some way towards alleviating this problem. Rather than collecting statistics on individual words, he instead counts co-occurrences of concepts (as represented by WordNet synsets). He uses these statistics to derive a measure, motivated by information theory, called selectional association (see Resnik (1993) for full details). &quot;Acceptability&quot; in the adjacency algorithm is then measured in terms of the selectional association between a modifier and head. Selectional associations were calculated by training on approximately 15,000 noun-noun compounds from the Wall Street Journal corpus in the Penn Treebank. Of a sample of 156 three-noun compounds drawn from the corpus, Resnik's method achieved 72.6% disambiguation accuracy. null Lauer (1995) similarly generalises from individual nouns to semantic classes or concepts; however, his classes are derived from semantic categories in Roget's Thesaurus. Similar to the approaches discussed above, Lauer extracts a training set of approximately 35,000 unambiguous noun-noun modifier-head compounds to estimate the degree of association between Roget categories. He calls this measure conceptual association, and uses this to calculate the acceptability of noun pairs for the disambiguation of three-noun compounds. However, his approach differs from most others in that he does not use the adjacency algorithm, instead using a dependency algorithm which operates as follows: Given a three-noun compound n1 n2 n3, if(n1 n3)ismoreacceptablethan(n1 n2), then build (n1 (n2 n3)); else build ((n1 n2) n3).</Paragraph> <Paragraph position="4"> Lauer tested both the dependency and adjacency algorithms on a set of 244 three-noun compounds extracted from Grolier's Encyclopedia and found that the dependency algorithm consistently outperformed the adjacency algorithm, achieving a maximum of 81% accuracy on the task. Overall, he found that estimating the parameters of his probabilistic model based on the distribution of concepts rather than that of individual nouns resulted in superior performance, thus providing further evidence of the effectiveness of conceptual association in noun compound disambiguation.</Paragraph> <Paragraph position="5"> All these approaches rely on a variation of finding subconstituents elsewhere in the corpus and using these to decide how the longer, ambiguous compounds are structured. However, there is always the possibility that these systems might encounter modifier-head pairs in testing which never occurred in training, forcing the system to &quot;back off&quot; to some default strategy. This problem is alleviated somewhat in the work of Resnik and Lauer where statistics are collected on pairs of concepts rather than pairs of noun tokens. However, the methods of Resnik and Lauer both depend on hand-crafted knowledge sources; the applicability of their approaches is therefore limited by the coverage of these resources. Thus their methods would almost certainly perform less well when applied to more technical domains where much of the vocabulary used would not be available in either WordNet or Roget's Thesaurus. Knowledge sources such as these would have to be manually augmented each time the system was ported to a new domain. Therefore, it would be preferable to have a method of measuring conceptual associations which is less domain-dependent and which does not rely on the presence of unambiguous subconstituents in training; we investigated whether Latent Semantic Indexing might satisfy these requirements.</Paragraph> </Section> class="xml-element"></Paper>