File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1030_metho.xml
Size: 12,793 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1030"> <Title>Scaling Context Space</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Automatic Thesaurus Extraction </SectionTitle> <Paragraph position="0"> Thesauri have traditionally been used in information retrieval tasks to expand words in queries with synonymous terms (e.g. Ruge, (1997)). More re-Computational Linguistics (ACL), Philadelphia, July 2002, pp. 231-238. Proceedings of the 40th Annual Meeting of the Association for cently, semantic resources have also been used in collocation discovery (Pearce, 2001), smoothing and model estimation (Brown et al., 1992; Clark and Weir, 2001) and text classi cation (Baker and Mc-Callum, 1998). Unfortunately, thesauri are very expensive and time-consuming to produce manually, and tend to suffer from problems of bias, inconsistency, and lack of coverage. In addition, thesaurus compilers cannot keep up with constantly evolving language use and cannot afford to build new thesauri for the many subdomains that information extraction and retrieval systems are being developed for. There is a clear need for methods to extract thesauri automatically or tools that assist in the manual creation and updating of these semantic resources.</Paragraph> <Paragraph position="1"> Most existing work on thesaurus extraction and word clustering is based on the general observation that related terms will appear in similar contexts.</Paragraph> <Paragraph position="2"> The differences tend to lie in the way context is de ned and in the way similarity is calculated. Most systems extract co-occurrence and syntactic information from the words surrounding the target term, which is then converted into a vector-space representation of the contexts that each target term appears in (Brown et al., 1992; Pereira et al., 1993; Ruge, 1997; Lin, 1998b). Other systems take the whole document as the context and consider term co-occurrence at the document level (Crouch, 1988; Sanderson and Croft, 1999). Once these contexts have been de ned, these systems then use clustering or nearest neighbour methods to nd similar terms.</Paragraph> <Paragraph position="3"> Finally, some systems extract synonyms directly without extracting and comparing contextual representations for each term. Instead, these systems recognise terms within certain linguistic patterns (e.g. X, Y and other Zs) which associate synonyms and hyponyms (Hearst, 1992; Caraballo, 1999).</Paragraph> <Paragraph position="4"> Thesaurus extraction is a good task to use to experiment with scaling context spaces. The vector-space model with nearest neighbour searching is simple, so we needn't worry about interactions between the contexts we select and a learning algorithm (such as independence of the features). But also, thesaurus extraction is a task where success has been limited when using small corpora (Grefenstette, 1994); corpora of the order of 300 million words have already been shown to be more successful at this task (Lin, 1998b).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> Vector-space thesaurus extraction can be separated into two independent processes. The rst step extracts the contexts from raw text and compiles them into a vector-space statistical description of the contexts each potential thesaurus term appears in.</Paragraph> <Paragraph position="1"> We de ne a context relation as a tuple (w;r;w0) where w is a thesaurus term, occurring in relation type r, with another word w0 in the sentence. The type can be grammatical or the position of w0 in a context window: the relation (dog, direct-obj, walk) indicates that the term dog, was the direct object of the verb walk. Often we treat the tuple (r;w0) as a single unit and refer to it as an attribute of w.</Paragraph> <Paragraph position="2"> The context extraction systems used for these experiments are described in the following section.</Paragraph> <Paragraph position="3"> The second step in thesaurus extraction performs clustering or nearest-neighbour analysis to determine which terms are similar based on their context vectors. Our second component is similar to Grefenstette's SEXTANT system, which performs nearest-neighbour calculations for each pair of potential thesaurus terms. For nearest-neighbour measurements we must de ne a function to judge the similarity between two context vectors (e.g. the cosine measure) and a function to combine the raw instance frequencies for each context relation into weighted vector components.</Paragraph> <Paragraph position="4"> SEXTANT uses a generalisation of the Jaccard measure to measure similarity. The Jaccard measure is the cardinality ratio of the intersection and union of attribute sets (atts(wn) is the attribute set for wn):</Paragraph> <Paragraph position="6"> The generalised Jaccard measure allows each relation to have a signi cance weight (based on word, attribute and relation frequencies) associated with it:</Paragraph> <Paragraph position="8"> where f (wi;a j) is the frequency of the relation and n(a j) is the number of different words a j appears in However, we have found that using the t-test between the joint and independent distributions of a word and its attribute:</Paragraph> <Paragraph position="10"> gives superior performance (Curran and Moens, 2002) and is therefore used for our experiments.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Context Extractors </SectionTitle> <Paragraph position="0"> We have experimented with a number of different systems for extracting the contexts for each word.</Paragraph> <Paragraph position="1"> These systems show a wide range in complexity of method and implementation, and hence development effort and execution time.</Paragraph> <Paragraph position="2"> The simplest method we implemented extracts the occurrence counts of words within a particular window surrounding the thesaurus term. These window extractors are very easy to implement and run very quickly. The window geometries used in this experiment are listed in Table 1. Extractors marked with an asterisk, for example W(L1R1 ), do not distinguish (within the relation type) between different positions of the word w0 in the window.</Paragraph> <Paragraph position="3"> At a greater level of complexity we have two shallow NLP systems which provide extra syntactic information in the extracted contexts. The rst system is based on the syntactic relation extractor from SEXTANT with a different POS tagger and chunker.</Paragraph> <Paragraph position="4"> The SEXTANT-based extractor we developed uses a very simple Nacurrency1 ve Bayes POS tagger and chunker. This is very simple to implement and is extremely fast since it optimises the tag selection locally at the current word rather than performing beam or Viterbi search over the entire sentence. After the raw text has been POS tagged and chunked, the SEXTANT relation extraction algorithm is run over the text. This consists of ve passes over each sentence that associate each noun with the modi ers and verbs from the syntactic contexts that it appears in.</Paragraph> <Paragraph position="5"> The second shallow parsing extractor we used was the CASS parser (Abney, 1996), which uses cascaded nite state transducers to produce a limited depth parse of POS tagged text. We used the output of the Nacurrency1 ve Bayes POS tagger output as input to the CASS. The context relations used were extracted directly by the tuples program (using e8 demo grammar) included in the CASS distribution.</Paragraph> <Paragraph position="6"> The FST parsing algorithm is very ef cient and so CASS also ran very quickly. The times reported below include the Nacurrency1 ve Bayes POS tagging time. The nal, most sophisticated extractor used was the MINIPAR parser (Lin, 1998a), which is a broad-coverage principle-based parser. The context relations used were extracted directly from the full parse tree. Although fast for a full parser, MINIPAR was no match for the simpler extractors.</Paragraph> <Paragraph position="7"> For this experiment we needed a large quantity of text which we could group into a range of corpus sizes. We combined the BNC and Reuters corpus to produce a 300 million word corpus. The respective sizes of each are shown in Table 2. The sentences were randomly shuf ed together to produce a single homogeneous corpus. This corpus was split into two 150M word corpora over which the main experimental results are averaged. We then created smaller corpora of size 12 down to 164 th of each 150M corpus.</Paragraph> <Paragraph position="8"> The next section describes the method of evaluating each thesaurus created by the combination of a given context extraction system and corpus size.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> For the purposes of evaluation, we selected 70 single word noun terms for thesaurus extraction. To avoid sample bias, the words were randomly selected from Wordnet such that they covered a range of values for the following word properties: occurrence frequency based on frequency counts from the Penn Treebank, BNC and Reuters; number of senses based on the number of Wordnet synsets and Macquarie Thesaurus entries; generality/speci city based on depth of the term in the Wordnet hierarchy; abstractness/concreteness based on even distribution across all Wordnet subtrees.</Paragraph> <Paragraph position="1"> Table 3 shows some of the selected terms with frequency and synonym set data. For each term we extracted a thesaurus entry with 200 potential synonyms and their weighted Jaccard scores.</Paragraph> <Paragraph position="2"> The most dif cult aspect of thesaurus extraction is evaluating the quality of the result. The simplest method of evaluation is direct comparison of the extracted thesaurus with a manually created gold standard (Grefenstette, 1994). However on smaller corpora direct matching alone is often too coarse-grained and thesaurus coverage is a problem.</Paragraph> <Paragraph position="3"> Our experiments use a combination of three thesauri available in electronic form: The Macquarie Thesaurus (Bernard, 1990), Roget's Thesaurus (Roget, 1911), and the Moby Thesaurus (Ward, 1996).</Paragraph> <Paragraph position="4"> Each thesaurus is structured differently: Roget's and Macquarie are topic ordered and the Moby thesaurus is head term ordered. Roget's is quite dated and has low coverage, and contains a deep hierarchy (depth up to seven) with terms grouped in 8696 small synonym sets at the leaves of the hierarchy. The Macquarie consists of 812 large topics (often in antonym related pairs), each of which is separated into 21174 small synonym sets. Roget's and the Macquarie provide sense distinctions by placing terms in multiple synonym sets. The Moby thesaurus consists of 30259 head terms and large synonym lists which con ate all the head term senses. The extracted thesaurus does not distinguish between different head senses. Therefore, we convert the Roget's and Macquarie thesaurus into head term ordered format by combining each small sense set that the head term appears in.</Paragraph> <Paragraph position="5"> We create a gold standard thesaurus containing the union of the synonym lists from each thesaurus, giving a total of 23207 synonyms for the 70 terms.</Paragraph> <Paragraph position="6"> With these gold standard resources in place, it is possible to use precision and recall measures to calculate the performance of the thesaurus extraction systems. To help overcome the problems of coarse-grained direct comparisons we use three different types of measure to evaluate thesaurus quality: 1. Direct Match (DIRECT) 2. Precision of the n top ranked synonyms (P(n)) 3. Inverse Rank (INVR) A match is an extracted synonym that appears in the corresponding gold standard synonym list. The direct match score is the number of such matches for each term. Precision of the top n is the percentage of matches in the top n extracted synonyms. In these experiments, we calculate this for n = 1; 5; and 10. The inverse rank score is the sum of the inverse rank of each match. For example, if matching synonyms appear in the extracted synonym list at ranks 3, 5 and 28, then the inverse rank score is 13 + 15 + 128 = 0:569. The maximum inverse rank score is 5.878 for a synonym list of 200 terms. Inverse rank is a good measure of subtle differences in ranked results. Each measure is averaged over the extracted synonym lists for all 70 thesaurus terms.</Paragraph> </Section> class="xml-element"></Paper>