File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1401_metho.xml
Size: 10,637 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1401"> <Title>Disambiguating Noun Compounds with Latent Semantic Indexing</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Latent Semantic Indexing </SectionTitle> <Paragraph position="0"> the vector-space approach to information retrieval. It takes as input a collection of documents, from which it constructs an mPSn word-document matrix A; cell aij of the matrix denotes the frequency with which term i occurs in document j. At the core of LSI is singular value decomposition (SVD), a mathematical technique closely related to eigenvector decomposition and factor analysis. SVD factors the matrix A into the product of three matrices: A = USV T. U and V contain the left and right singular vectors of A, respectively, while S is a diagonal matrix containing the singular values of A in descending order. By retaining only the k largest singular values1 and setting the remaining smaller ones to zero, a new diagonal matrix Sk is obtained; then the product of USkV T is the m PS n matrix Ak which is only approximately equal to A. This truncated SVD re-represents the word-document relationships in A using only the axes of greatest variation, in effect compressing and smoothing the data in A. It is this compression step which is said to capture important regularities in the patterns of word co-occurrences while ignoring smaller variations that may be due to idiosyncrasies in the word usage of individual documents. The result of condensing the matrix in this way is that words which occur in similar documents pirically, and will depend on the particular application. will be represented by similar vectors, even if these words never actually co-occur in the same document. Thus it is claimed that LSI captures deeper associative relationships than mere word-word co-occurrences. See Berry et al.</Paragraph> <Paragraph position="1"> (1995) and Deerwester et al. (1990) for more thorough discussions of SVD and its application to information retrieval.</Paragraph> <Paragraph position="2"> Because word vectors are originally based on their distribution of occurrence across documents, each vector can be interpreted as a summary of a word's contextual usage; words are thus similar to the extent that they occur in similar contexts. Of interest for our purposes is the fact that a measure of the similarity or association between pairs of words can be calculated geometrically, typically by computing the cosine of the angle between word vectors. Any two words, which may or may not occur adjacently in text, can be compared in this way; this frees us from the restriction of relying on unambiguous subconstituents in training to inform the analysis of ambiguous compounds in testing.</Paragraph> <Paragraph position="3"> There is a growing body of literature indicating that distributional information of the kind captured by LSI plays an important role in various aspects of human cognition. For the work reported here, the most interesting aspect of distributional information is its purported ability to model conceptual categorisation. Several studies (Burgess and Lund, 1999; Laham, 1997; Landauer et al., 1998; Levy and Bullinaria, 2001) have shown that similarity between concepts can be measured quite successfully using simple vectors of contextual usage; results show that the performance of such systems correlates well with that of humans on the same tasks. These results are all the more impressive when we consider that such systems use no hand-coded semantic knowledge; the conceptual representations are derived automatically from training corpora.</Paragraph> <Paragraph position="4"> Noun compound disambiguation appears to be an NLP application for which such measures of conceptual association would be useful. Both the adjacency and dependency algorithms described above in Section 2 rely on some measure of the &quot;acceptability&quot; of pairs of nouns to disambiguate noun compounds. Techniques such as LSI offer a simple, robust, and domain- null of a manually bracketed compound, along with its branching direction.</Paragraph> <Paragraph position="5"> independent way in which concepts and the associations between them can be represented. In the next section, we describe an experiment exploring the efficacy of LSI's conceptual representations in disambiguating noun compounds.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 LSI and Noun Compound </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Disambiguation 4.1 Method </SectionTitle> <Paragraph position="0"> We used four corpora in our study: The Lotus Ami Pro Word Processor for Windows User's Guide Release 3, a software manual (AmiPro); document abstracts in library science (CISI); document abstracts on aeronautics (CRAN); and articles from Time magazine (Time). We first ran the LSI software on the corpora to create the word-by-document matrices. The software also subsequently performed singular value decomposition on the resulting matrices. Stop-words were not excluded, as previous experience had shown that doing so degraded performance slightly.</Paragraph> <Paragraph position="1"> We used Brill's (1994) tagger to identify three-noun sequences in each of the corpora.</Paragraph> <Paragraph position="2"> Tagging was imperfect, and sequences which were not true three-noun compounds were discarded. The remaining noun compounds were bracketed manually and constituted test sets for each corpus; some examples are shown in Table 1. Table 2 summarises the datasets used in our study.</Paragraph> <Paragraph position="3"> Both the adjacency and dependency models were investigated (see Section 2). Recall that the adjacency algorithm operates by comparing the acceptability of the subcomponents (n1 n2) and (n2 n3), whereas the dependency algorithmcomparestheacceptabilityof(n1 n2)and (n1 n3). &quot;Acceptability&quot; in our approach was measured by calculating the cosine of the angle between each pair of word vectors. The cosine ranges from !1:0 to 1:0; a higher cosine indicated a stronger association between each word in a pair. In the case of a tie, a left branching analysis was preferred, as the literature suggests that this is the more common structure (Lauer, 1995; Resnik, 1993). Thus a default strategy of always guessing a left branching analysis served as the baseline in this study. Each of the corpora contained terms not covered by WordNet or Roget's; thus it was not possible to use the techniques of Resnik (1993) and Lauer (1995) as baselines.</Paragraph> <Paragraph position="4"> As we could not tell beforehand what the optimal value of k would be (see Section 3 above), we used a range of factor values. The values used ranged from 2 to the total number of documents in each collection. For each factor value, we obtained the percentage accuracy of both the adjacency and dependency models.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Results and Discussion </SectionTitle> <Paragraph position="0"> The results of the experiment are summarised in Table 3 and Figure 1. In most cases the performance rises quickly as the number of SVD factors used increases, and then tends to level off.</Paragraph> <Paragraph position="1"> The best performance was 84% for the AmiPro collection, obtained using the adjacency algorithm and 280 SVD factors. As the task involved choosing the best binary bracketing for a noun compound, we would expect an accuracy of 50% by chance. These results compare favourably with those of Resnik (1993) and Lauer (1995) (73% and 81%, respectively), but as their studies were conducted on different corpora, it would be imprudent to make direct comparisons at this stage. Results for the other collections were less impressive--however, above-baseline performances were obtained in each case.</Paragraph> <Paragraph position="2"> Substantial differences in the performances of the adjacency and dependency algorithms were only observed for the AmiPro collection, suggesting that the superior performance of the dependency algorithm in Lauer's (1995) study was largely corpus-dependent. This is reinforced by the considerably superior performance of the accuracy of always choosing a left-branching analysis. Highest accuracies for the Adjacency and Dependency algorithms are shown, with the corresponding number of SVD factors in parentheses. adjacency algorithm on the AmiPro data set.</Paragraph> <Paragraph position="3"> Another interesting finding was that there were more right-branching (52%) than left-branching (48%) compounds in the Time collection. This contrasts with previous studies which discuss the predominance of left-branching compounds, and suggests that the choice for the default branching must be corpus-dependent (see Barker (1998) for similar findings).</Paragraph> <Paragraph position="4"> There also appears to be a positive relationship between performance and the token-type ratio. The number of tokens per type in the AmiPro collection was 46.3; the worst performance was found for the Time collection, which had only 11.5 tokens per type. There are at least two possible explanations for this relationship between performance and token-type ratio: First, there were more samples of each word type in the AmiPro collection--this may have helped LSI construct vectors which were more representative of each word's contextual usage, thus leading to the superior performance on the AmiPro compounds.</Paragraph> <Paragraph position="5"> Second, LSI constructs a single vector for each token--if a particular token is polysemous in text then its vector will be a &quot;noisy&quot; amalgamation of its senses, a factor often contributing to poor performance. However, due to the controlled language and vocabulary used in the software manual domain, few if any of the words in the AmiPro collection are used to convey more than one sense; once again, this may have resulted in &quot;cleaner&quot;, more accurate vectors leading to the superior disambiguation performance on the AmiPro compounds.</Paragraph> <Paragraph position="6"> These points lead us to the tentative suggestion that our approach appears most suitable for technical writing such as software manuals.</Paragraph> <Paragraph position="7"> As usual, however, this is a matter for future investigation.</Paragraph> </Section> </Section> class="xml-element"></Paper>