File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1654_metho.xml

Size: 13,372 bytes

Last Modified: 2025-10-06 14:10:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1654">
  <Title>Random Indexing using Statistical Weight Functions</Title>
  <Section position="5" start_page="457" end_page="458" type="metho">
    <SectionTitle>
3 Weights
</SectionTitle>
    <Paragraph position="0"> Our initial experiments using Random Indexing to extract synonymy relations produced worse results than those using full vector measures, such as JACCARD (Curran, 2004), when the full vector is weighted. We experiment using weight functions with Random Indexing.</Paragraph>
    <Paragraph position="1"> Only a linear weighting scheme can be applied while maintaining incremental sampling. While incremental sampling is part of the rationale behind its development, it is not required for Random Indexing to work as a dimensionality reduction technique.</Paragraph>
    <Paragraph position="2"> To this end, we revise Random Indexing to enable us to use weight functions. For each unique</Paragraph>
    <Paragraph position="4"> context attribute, a d length index vector will be generated. The context vector of a term w is then created by the weighted sum of each of its attributes. The results of the original Random Indexing algorithm are reproduced using frequency weighting (FREQ).</Paragraph>
    <Paragraph position="5"> Weights are generated using the frequency distribution of each term and its contexts. This increases the overhead, as we must store the context attributes for each term. Rather than the context vector being generated by adding each individual context, it is generated by adding each the index vector for each unique context multiplied by its weight.</Paragraph>
    <Paragraph position="6"> The time to calculate the weight of all attributes of all terms is negligible. The original technique scales to O(dnm) in construction, for n terms and m unique attributes. Our new technique scales to O(d(a + nm)) for a non-zero context attributes per term, which since a m is also O(dnm).</Paragraph>
    <Paragraph position="7"> Following the notation of Curran (2004), a context relation is defined as a tuple (w,r,wprime) where w is a term, which occurs in some grammatical relation r with another word wprime in some sentence. We refer to the tuple (r,wprime) as an attribute of w. For example, (dog, direct-obj, walk) indicates that dog was the direct object of walk in a sentence.</Paragraph>
    <Paragraph position="8"> An asterisk indicates the set of all existing values of that component in the tuple.</Paragraph>
    <Paragraph position="9"> (w, , ) f(r,wprime)j9(w,r,wprime)g The frequency of a tuple, that is the number of times a word appears in a context is f(w,r,wprime).</Paragraph>
    <Paragraph position="10"> f(w, , ) is the instance or token frequency of the contexts in which w appears. n(w, , ) is the type frequency. This is the number of attributes of w.</Paragraph>
    <Paragraph position="12"> Most experiments limited weights to the positive range; those evaluated with an unrestricted range are marked with a +- suffix. Some weights were also evaluated with an extra log2(f(w,r,wprime) + 1) factor to promote the influence of higher frequency attributes, indicated by a LOG suffix. Alternative functions are marked with a dagger.</Paragraph>
    <Paragraph position="13"> The context vector of each term w is thus:</Paragraph>
    <Paragraph position="15"> where vector(r,wprime) is the index vector of the context (r,wprime). The weights functions we evaluate are those from Curran (2004) and are given in Table 1.</Paragraph>
  </Section>
  <Section position="6" start_page="458" end_page="459" type="metho">
    <SectionTitle>
4 Semantic Similarity
</SectionTitle>
    <Paragraph position="0"> The first use of Random Indexing was to measure semantic similarity using distributional similarity.</Paragraph>
    <Paragraph position="1"> Kanerva et al. (2000) used Random Indexing to find the best synonym match in Test of English as a Foreign Language (TOEFL). TOEFL was used by Landauer and Dumais (1997), who reported an accuracy 36% using un-normalised vectors, which was improved to 64% using LSA. Kanerva et al.</Paragraph>
    <Paragraph position="2"> (2000) produced an accuracy of 48-51% using the same type of document based contexts and Random Indexing, which improved to 62-70% using narrow context windows. Karlgren and Sahlgren (2001) improved this to 72% using lemmatisation and POS tagging.</Paragraph>
    <Section position="1" start_page="459" end_page="459" type="sub_section">
      <SectionTitle>
4.1 Distributional Similarity
</SectionTitle>
      <Paragraph position="0"> Measuring distributional similarity first requires the extraction of context information for each of the vocabulary terms from raw text. The contexts for each term are collected together and counted, producing a vector of context attributes and their frequencies in the corpus. These terms are then compared for similarity using a nearest-neighbour search based on distance calculations between the statistical descriptions of their contexts.</Paragraph>
      <Paragraph position="1"> The simplest algorithm for finding synonyms is a k-nearest-neighbour search, which involves pair-wise vector comparison of the context vector of the target term with the context vector of every other term in the vocabulary.</Paragraph>
      <Paragraph position="2"> We use two types of context extraction to produce both high and low quality context descriptions. The high quality contexts were extracted from grammatical relations extracted using the SEXTANT relation extractor (Grefenstette, 1994) and are lemmatised. This is the same data used in Curran (2004).</Paragraph>
      <Paragraph position="3"> The low quality contexts were extracted taking a window of one word to the left and right of the target term. The context is marked as to whether it preceded or followed the term. Curran (2004) found this extraction technique to provided reasonable results on the non-speech portion of the BNC when the data was lemmatised. We do not lemmatise, which produces noisier data.</Paragraph>
    </Section>
    <Section position="2" start_page="459" end_page="459" type="sub_section">
      <SectionTitle>
4.2 Bilingual Lexicon Acquisition
</SectionTitle>
      <Paragraph position="0"> A variation on the extraction of synonymy relations, is the extraction of bilingual lexicons. This is the task of finding for a word in one language words of a similar meaning in a second language.</Paragraph>
      <Paragraph position="1"> The results of this can be used to aid manual construction of resources or directly aid translation.</Paragraph>
      <Paragraph position="2"> This task was first approached as a distributional similarity-like problem by Brown et al.</Paragraph>
      <Paragraph position="3"> (1988). Their approach uses aligned corpora in two or more languages: the source language, from which we are translating, and the target language, to which we are translating. For a each aligned segment, they measure co-occurrence scores between each word in the source segment and each word in the target segment. These co-occurrence scores are used to measure the similarity between source and target language terms Sahlgren and Karlgren's approach models the problem as a distributional similarity problem us- null ing the paragraph as context. In Table 2, the source language is limited to the words a, b and c and the target language to the words x, y and z. Three paragraphs in each of these languages are presented as pairs of translations labelled as a context: aaabbc is translated as xxyzzz and labelled context I. The frequency weighted context vector for a is fI:3, III:2g and for x is fI:2, II:1, III:1g.</Paragraph>
      <Paragraph position="4"> A translation candidate for a term in the source language is found by measuring the similarity between its context vector and the context vectors of each of the terms in the target language. The most similar target language term is the most likely translation candidate.</Paragraph>
      <Paragraph position="5"> Sahlgren and Karlgren (2005) use Random Indexing to produce the context vectors for the source and target languages. We re-implement their system and apply weighting functions in an attempt to achieve improved results.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="459" end_page="460" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> For the experiments extracting synonymy relations, high quality contexts were extracted from the non-speech portion of the British National Corpus (BNC) as described above. This represents 90% of the BNC, or 90 million words.</Paragraph>
    <Paragraph position="1"> Comparisons between low frequency terms are less accurate than between high frequency terms as there is less evidence describing them (Curran and Moens, 2002). This is compounded in randomised vector techniques because the randomised nature of the representation means that a low frequency term may have a similar context vector to a high frequency term while not sharing many contexts. A frequency cut-off of 100 was found to balance this inaccuracy with the reduction in vocabulary size. This reduces the original 246,046 word vocabulary to 14,862 words. Experiments showed d = 1000 and epsilon1 = 10 to provide a balance between speed and accuracy.</Paragraph>
    <Paragraph position="2"> Low quality contexts were extracted from portions of the entire of the BNC. These formed corpora of 100,000, 500,000, 1 million, 5 million, 10  million, 50 million and 100 million words, chosen from random documents. This allowed us test the effect of both corpus size and context quality. This produced vocabularies of between 10,380 and 522,163 words in size. Because of the size of the smallest corpora meant that a high cutoff would remove to many terms for a fair test, a cut-off of 5 was applied. The values d = 1000 and epsilon1 = 6 were used.</Paragraph>
    <Paragraph position="3"> For our experiments in bilingual lexicon acquisition we follow Sahlgren and Karlgren (2005). We use the Spanish-Swedish and the English-German portions of the Europarl corpora (Koehn, 2005).1 These consist of 37,379 aligned paragraphs in Spanish-Swedish and 45,556 in English-German. The text was lemmatised using Connexor Machinese (Tapanainen and J&amp;quot;avinen, 1997)2 producing vocabularies of 42,671 terms of Spanish, 100,891 terms of Swedish, 40,181 terms of English and 70,384 terms of German. We use d = 600 and epsilon1 = 6 and apply a frequency cut-off of 100.</Paragraph>
  </Section>
  <Section position="8" start_page="460" end_page="461" type="metho">
    <SectionTitle>
6 Evaluation Measures
</SectionTitle>
    <Paragraph position="0"> The simplest method for evaluation is the direct comparison of extracted synonyms with a manually created gold standard (Grefenstette, 1994).</Paragraph>
    <Paragraph position="1"> To reduce the problem of limited coverage, our evaluation of the extraction of synonyms combines three electronic thesauri: the Macquarie, Roget's and Moby thesauri.</Paragraph>
    <Paragraph position="2"> We follow Curran (2004) and use two performance measures: direct matches (DIRECT) and inverse rank (INVR). DIRECT is the number of returned synonyms found in the gold standard.</Paragraph>
    <Paragraph position="3"> INVR is the sum of the inverse rank of each matching synonym, e.g. matches at ranks 3, 5 and 28 give an inverse rank score of 13 + 15 + 128. With at most 100 matching synonyms, the maximum INVR is 5.187. This more fine grained as it incorporates the both the number of matches and their ranking.</Paragraph>
    <Paragraph position="4"> The same 300 single word nouns were used for evaluation as used by Curran (2004) for his large scale evaluation. These were chosen randomly from WordNet such that they covered a range over the following properties: frequency, number of senses, specificity and concreteness. On average each evaluation term had 301 gold-standard syn- null onyms. For each of these terms, the closest 100 terms and their similarity scores were extracted. For the evaluation of bilingual lexicon acquisition we use two online lexical resources used by Sahlgren and Karlgren (2005) as gold standards: Lexin's online Swedish-Spanish lexicon3 and TU Chemnitz' online English-German dictionary.4 Each of the elements in a compound or multi-word expression is treated as a potential translation. The German abblendlicht (low beam light) is treated as a translation candidate for low, beam and light separately.</Paragraph>
    <Paragraph position="5"> Low coverage is more of problem than in our thesaurus task as we have not used combined resources. There are an average of 19 translations for each of the 3,403 Spanish terms and 197 translations for each of the 4,468 English terms. The English-German translation count is skewed by the presence of connectives in multi-word expressions, such as of and on, producing mistranslations. Sahlgren and Karlgren (2005) provide good commentary on the evaluation of this task.</Paragraph>
    <Paragraph position="6"> Spanish and English are used as the source languages. The 200 closest terms in the target language are found for all terms in both the source vocabulary and the gold-standards.</Paragraph>
    <Paragraph position="7"> We measure the DIRECT score and INVR as above. In addition we measure the precision of the closest translation candidate, as used in Sahlgren and Karlgren (2005).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML