File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2104_metho.xml

Size: 15,878 bytes

Last Modified: 2025-10-06 14:07:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2104">
  <Title>Experiments in Automated Lexicon Building for Text Searching</Title>
  <Section position="4" start_page="719" end_page="721" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> Tile focus of onr experiment was on units of text in which the constituents must fit together in order for the discourse to be coherent. We made the assumption that the documents in our corpus were coherent and reasoned that if we had enough text, covering a broad range of topics, we could pick out domain-independent associations. For example, testimony can be about virtually anything, since anything can wind up in a court dispute. But over a large enough collection of text, the terms that directly relate to tile &amp;quot;who,&amp;quot; &amp;quot;what&amp;quot; and &amp;quot;where&amp;quot; of testimony per se should appear in segments with testimony more frequently than chance.</Paragraph>
    <Paragraph position="1"> These associations do not necessarily appear in a dictionary or thesaurus. When huntans explain all unfamiliar word, they often use scenarios and analogies. null We divided the experiments in two groups: one group that looks at co-occurrences within a single unit, and another that looks at a sequence of units.</Paragraph>
    <Paragraph position="2"> In the first group of experinmnts, we considered paragraphs, sentences and clauses, each with and without prepositional phrases.</Paragraph>
    <Paragraph position="3">  * Single paragraphs with/without PP * Single sentences with/without PP * Single clauses with/without PP 720 \]in the second group, we considered two clauses and sequences of subject 110un phrases from two to six chmses. Ill this group, we had: ,, Two clauses with/without Pl) ,, A sequence of subject NPs fl'onl 2 clauses A sequence of subject NPs Dora 3 clauses ,, A sequence of subject NPs from 4 clauses * A sequence of subject NPs fi'om 5 clauses ,, A sequence of subject NPs from 6 clauses  The intuition for the second groul) is that a topic flows from one granmm.tical unit to another so that the salient nouns, l)articularly the surface subjects, in successive clauses should reveal the associations we are seeldng.</Paragraph>
    <Paragraph position="4"> '\[lo illustrate the method, consider the three-clause configuration: Say that ~vordi apl)ears in clausc,~. We maintain a table of all word pairs and increment the entries for O,,o,'(h , ',,,o,'d~ ), where ,,0,% is a sub-ject noun in cla'usc,~, clauscn+~, or ell'use,+2. No effort was made to resolve pronomial references, and these were skipped.</Paragraph>
    <Paragraph position="5"> We used nollnS Olfly' because l)reliminary tests showed that pairings between nouns seemed to stand out. V~Te included tokens that were tagged as 1)roper nallleS when they also have have conlnlon nleanings. For example, consider the Linguistic Data Consorl;ium at the University of Pennsylvania. Data, Consortium and University wouM be on tile list used to build the table of nmtchul)s with other nouns, \])lit l)emlsylvania would not. V~To also collected noun modifiers as well as head nouns as they can carry more information than the surface heads, such as &amp;quot;business group&amp;quot;, '&amp;quot;.science class&amp;quot; or &amp;quot;crinm scene.&amp;quot; The corpus consisted of all tile general-interest articles from the New York Tinms newswire in 1996 in the North American News Corlms , and (lid not include either st)orts or l)usiness news. We tirst removed dul)licate articles. The data fl'om 1996 was too slmrse for the sequence-of-subjects contiguralions. '\]'o l)alance the expcrinmnts better, we added another year's worth of newswire articles, from 1995, tbr the sequence-of subject configurations so that we had more than one million matchups for each configuration (Table 1).</Paragraph>
    <Paragraph position="6"> The I)roeess is flflly automatic, requiring no su1)ervision or training examples. The corpus was tagged with a decision-tree tagger (Schmid, 1994) and parsed with a finite-state parser (Abney, 1996) using a specially written context-fi'ee-grannnar that focused on locating clause boundaries. The grammar also identified extended noun l)hrases in tile sub-ject position, verb l)hrases and other noun l)hrases and prepositional 1)hrases. The nouns in the tagged, parsed corl)uS were reduced to their syntactic roots (removing l)lurals from nouns) with a lookup table created t'rom Wordnet (Miller, 1990) and CELEX (1995). We. performed this last step mainly to address the sparse data problem. There were a substantial nunfl)er of paMngs that occurred only once. We elinfinated from considerat;ion all such singletons, although it did not al)peal to have much etfect on the overall outcome.</Paragraph>
    <Paragraph position="7">  notes the inclusion of 1995 data There were about 1.2 million paragraphs, 2.2 million sentences and 3.4 million clauses in the selected portions of the 1996 COl'pus. The total number of words was 57 million. Table 2 shows the nmnl)er of distinct nouns.</Paragraph>
    <Paragraph position="8">  To score the nmtchups in our initial exlmriments , we used the Dice Coeliicient, which l)roduces values i'ronl 0 to 1, to measure the association between pairs of words and then produced an ordered association list fl'om the co-occurrence table, ranked according to the scores of the entries.</Paragraph>
    <Paragraph position="10"> One 1)roblem was immediately a l)parent: The quality of tile association lists wxried greatly. Tile scoring was doing an acceptable job in ranking the words within each list, but tile scores varied greatly from one list to another. Our initial strategy was to choose a cutoff, which we set at 21 tbr each list, and we tried several alternatives to weed out weak associations.</Paragraph>
    <Paragraph position="11">  In one method, we filtered the association lists by cross-referencing, removing from the association list for wordi any wordj that failed to reciprocate and to give a high rank to wordi on its association list. Another similar approach was to try to con&gt; bine evidence fl'om different experiments by taking the results fl'om two configurations into consideration. A third strategy was to calculate the mutual information between the target word and the other words on its association list.</Paragraph>
    <Paragraph position="13"> Using the mutual information computation provided an way of using a single measure that was able to compare matchups across lists. We set a threshold of lxl.0 -6 for all matchups. Thus these association lists vary in length, depending on the distributions for the words, allowing them to grow up to 40, while some ended up with only one or two words.</Paragraph>
  </Section>
  <Section position="5" start_page="721" end_page="723" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> The evaluation of a system like ours is problematic.</Paragraph>
    <Paragraph position="1"> The judgments we made to determine correctness were not only highly subjective but time-consunfing.</Paragraph>
    <Paragraph position="2"> We had 12 large lexicons fl'om the different configurations. We had chosen a random sample of 10 percent of the 2,700 words that occurred at least 100 times in tile corpus, and manually constructed an answer key, which ended up with ahnost 30,000 entries.</Paragraph>
    <Paragraph position="3"> From the resulting 270 words, we discarded 15 of those that coincided with common names of people, such as &amp;quot;carter,&amp;quot; which could refer to the former American president, Chris Carter (creator of tile television show &amp;quot;X-Files&amp;quot;), among others. We thought it better to delay making decisions on how to handle such cases, especially since it would require distinguishing one Carter fl'om another. Such words presented several difficulties. Unless the individuals involved were well-known, it was often impossible to distinguish whether the system was making errors or whether the resulting descriptive terms were intbrmative. null Tables 3 and 4 show an example from the answer key tbr the word &amp;quot;faculty.&amp;quot; The overall results from the first stage of the process, before the cross-referencing filter are shown in Table 5, ranging from 73% to 80% correct. The configurations that included prepositional phrases and those that used sequences of subject noun phrases outperformed the configurations that relied on suhjects and objects in a single grammatical unit. These differences were statistically significant, with p &lt; 0.01 in all eases.</Paragraph>
    <Paragraph position="4"> The overall results after cross-referencing, in Table 6, showed improvements of 5 to 10 percentage enrollment hiring adnfinistrator  points, while the effect of the number of matchups was diminished. Here, the subject-sequence configurations showed a distinct advantage. While more noise might be expected when a large segment of text; is considered, these results support the notion that the nnderlying coherence of a discourse can be recovered with the prol)er selection of linguistic features. The improvements in each configuration over the corresponding configuration in the first stage were all statistically significant, with p &lt; 0.01. Likewise, the edge the sequence-of subjects configurations had over tile other configurations, was also statistically significant.</Paragraph>
    <Paragraph position="5"> The results fl'om combining the evidence from different configurations, in Table 7, showed a much higher accnrae&gt; but a sharp drop in the total nnmber of associated words found. The most fl'uitful pairs of experiments were those that combined distinct approaches, for example, tile five-subject configuration with either fifll paragraphs or with sentences with prepositional phrases. It will remain unclear until we conduct a task-based evaluation whether the smaller number of associations will be harnfful.</Paragraph>
    <Paragraph position="6"> The final experiment, computing the mutual information statistic tbr the matchul)s of a key word with co-occurring words was perhaps the most illteresting because it gave us the ability to apply a  the effort of performing the cross-retbrencing calm&gt; lations and providing a deeper assorl:ment in SOllle C~lSeS. lilt lnost of the configurations, lltlltllPl illfOrmat.ion gave 118 lllore \Vol'ds, and greater ln'ecision at; the sanle time, but nlost of all, gave us a reasonable threshold to apply throughout the exlicrinlent. While the accuracies in most of the configurations were close to one another, those that used only sing\]e units tended to be weaker than the multi-clause units. Note that the paragraI)h contiguration was tested with far more data than any of the others.</Paragraph>
    <Paragraph position="7"> Our system maD~s no eth)rt to aeCOllnt for lexical aml)iguil;y. The uses we intend for our lexicon should provide some insulation from the ett'ects of polysemy, since searches will be conducted on a nun&gt; l)er of terms, which should converge to one meaning.</Paragraph>
    <Paragraph position="8"> It is clear that in lists for key words with multiple senses, the donfinant sense where there is one, al)pears much lnore frequently, such as &amp;quot;faculty ,&amp;quot; where the meaning of &amp;quot;teacher&amp;quot; is more t:'re(tuent than the meaning of &amp;quot;al)ility.&amp;quot; Figure \] shows the top 21 words in the sequence-otCsix subjects, beibre the cross-referencing iilter was applied. Twenty of the 21. entries were scored aeceptal)le.</Paragraph>
    <Paragraph position="9"> After the cross-referencing is applied, doctorate,, education and revision were elinfinated.</Paragraph>
    <Paragraph position="10">  The results from the single clause configuration (Figure 2) were almost as strong, with three erroFs, and a fair amount of overlap between the two.</Paragraph>
    <Paragraph position="11"> The word &amp;quot;admiral&amp;quot; was more difficult %r the ex\])erilllellt ilSilig the l)ice coefficient. The. list shows some of l.he confusion arising from our strate.~y Oll prot)er nouns. Admiral would be expected to occur with many proper ll~tnles, illcluding some that axe st)elled liD; common 11o1111.q, bill the list h)r the single clause q pp configuratkm presented a lmzzling list (Figure 3).</Paragraph>
    <Paragraph position="12"> The sparseness of the data is also al)lmrent, but it was the dog reDxenees that al)peared quite strange at a ghulce: Inspection of the. articles showed that they callle froln all a.rticle on the pets of famous people. Note that the dogs did not al)l)ear in top ranks of the sequence of subjects configuration in the Dice experiment (Figure 4), nor were they in the results t'rom the experiments with cross-referencing, combining evidence and mutual information.</Paragraph>
    <Paragraph position="13"> After cross-reR;rencing, the much-shorter list for the Sub j-6 configuration had &amp;quot;aviator&amp;quot;, &amp;quot;break-up&amp;quot;, ';commander&amp;quot;, &amp;quot;decoration&amp;quot;, &amp;quot;equal-ot)portunity&amp;quot;, &amp;quot;tleet&amp;quot;, &amp;quot;merino&amp;quot;, &amp;quot;navf', &amp;quot;pearl&amp;quot;, &amp;quot;promotion&amp;quot;, &amp;quot;rear&amp;quot; ~ alia &amp;quot;short&amp;quot;.</Paragraph>
    <Paragraph position="14"> 'l'he combined-evidence list contained only eight words: &amp;quot;navy&amp;quot;, &amp;quot;short&amp;quot;, &amp;quot;aviator&amp;quot;, &amp;quot;merino&amp;quot;, &amp;quot;dishonor&amp;quot;, &amp;quot;decoration&amp;quot;, &amp;quot;sul)&amp;quot; and &amp;quot;break-ul)&amp;quot;. Using the mutual intbrlnation scoring, the list in the Subj-6 configuration tbr admiral had only</Paragraph>
    <Paragraph position="16"> ulty&amp;quot; from the Subj-6-Clause configuration before cross-referencing. The nmnbers in parentheses are the number of matchups and the real umnbers following are the scores. Errors are in</Paragraph>
    <Paragraph position="18"> ulty&amp;quot; under the single clause confignration. Errors are in bold.</Paragraph>
    <Paragraph position="19"> nine words: &amp;quot;navy&amp;quot;, &amp;quot;general&amp;quot;, &amp;quot;commander&amp;quot;, &amp;quot;vice&amp;quot;, &amp;quot;promotion&amp;quot;, &amp;quot;officer&amp;quot;, &amp;quot;fleet&amp;quot;, &amp;quot;military&amp;quot; and &amp;quot;smith.&amp;quot; Finally, the even-sparser mutual information list for the paragraph configuration lists only &amp;quot;navy&amp;quot; and &amp;quot;suicide.&amp;quot;</Paragraph>
  </Section>
  <Section position="6" start_page="723" end_page="723" type="metho">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> Our results are encouraging. We were able to decipher a broad type of word association, and showed that our method of searching sequences of subjects outperformed the snore traditional approaches in finding collocations. We believe we can use tiffs technique to build a large-scale lexicon to help in difficult information retrieval and information extraction tasks like question answering.</Paragraph>
    <Paragraph position="1"> The most interesting aspect of&amp;quot; this work lies in the system's ability to look across several clauses and strengthen tile connections between associated words. We are able to deal with input that contains numerous errors from the tagging and shallow parsing processes. Local context has been studied extensively in recent years with sophisticated statistical tools and the availability of enormous amounts of text in digital form. Perhaps we can expand this perspective to look at a window of perhaps several sentences by extracting the correct linguistic units in order to explore a large range of language processing problems.</Paragraph>
    <Paragraph position="3"/>
    <Paragraph position="5"/>
  </Section>
class="xml-element"></Paper>
Download Original XML