File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3026_metho.xml
Size: 3,171 bytes
Last Modified: 2025-10-06 14:09:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3026"> <Title>A Practical Solution to the Problem of Automatic Word Sense Induction</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm </SectionTitle> <Paragraph position="0"> As in previous work (Rapp, 2002), our computations are based on a partially lemmatized version of the British National Corpus (BNC) which has the function words removed. Starting from the list of 12 ambiguous words provided by Yarowsky (1995) which is shown in table 2, we created a concordance for each word, with the lines in the concordances each relating to a context window of +-20 words. From the concordances we computed 12 term/context-matrices (analogous to table 1) whose binary entries indicate if a word occurs in a particular context or not. Assuming that the amount of information that a context word provides depends on its association strength to the ambiguous word, in each matrix we removed all words that are not among the top 30 first order associations to the ambiguous word. These top 30 associations were computed fully automatically based on the log-likelihood ratio. We used the procedure described in Rapp (2002), with the only modification being the multiplication of the log-likelihood values with a triangular function that depends on the logarithm of a word's frequency.</Paragraph> <Paragraph position="1"> This way preference is given to words that are in the middle of the frequency range. Figures 1 to 3 are based on the association lists for the words palm and poach.</Paragraph> <Paragraph position="2"> Given that our term/context matrices are very sparse with each of their individual entries seeming somewhat arbitrary, it is necessary to detect the regularities in the patterns. For this purpose we applied the SVD to each of the matrices, thereby reducing their number of columns to the three main dimensions. This number of dimensions may seem low. However, it turned out that with our relatively small matrices (matrix size is the occurrence frequency of a word times the number of associations considered) it was sometimes not possible to compute more than three singular values, as there are dependencies in the data. Therefore, we decided to use three dimensions for all matrices.</Paragraph> <Paragraph position="3"> The last step in our procedure involves applying a clustering algorithm to the 30 words in each matrix. For our condensed matrices of 3 rows and 30 columns this is a rather simple task. We decided to use the hierarchical clustering algorithm readily available in the MATLAB (MATrix LABoratory) programming language. After some testing with various similarity functions and linkage types, we finally opted for the cosine coefficient and single linkage which is the combination that apparently gave the best results.</Paragraph> <Paragraph position="4"> axes: grid/tools bass: fish/music crane: bird/machine drug: medicine/narcotic duty: tax/obligation motion: legal/physical palm: tree/hand plant: living/factory poach: steal/boil sake: benefit/drink space: volume/outer tank: vehicle/container</Paragraph> </Section> class="xml-element"></Paper>