File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-3063_metho.xml

Size: 10,068 bytes

Last Modified: 2025-10-06 14:12:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3063">
  <Title>Automatic Processing of Large Corpora fbr the Resolution of Anaphor References</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Statistical Approach
</SectionTitle>
    <Paragraph position="0"> According to the statistical model, cooccurrence patterns that were observed in tile corpns are used as selection patterns. Whenever several alternatives are presented by an ambiguous construct, we prefer the one correspot~ding t.o more frequent patterns.</Paragraph>
    <Paragraph position="1"> When using selectional constraints for anaphora resolution, the referent must satisfy the constraints which are imposed on the anaphor. If the anaphor participates in a certain syntactic relation, like being an object of some verb, then the substitution of the anaphor with the referent must satisfy the selectional constraim.s. In the statistical model, we substitute each of the candidt~tes with the anaphor and approve only those candidates which produce frequent cooccurrence patterns. Consider, for exampie, the following sentence, taken from the Hansard corpus of the proceedings of the Canadian parlia- null ment \[Brown et al. 1988\]: (1) They know full well that the companies held  tax money aside for collection later on the b~sis that the government said it was going to collect it.</Paragraph>
    <Paragraph position="2"> There are two occurrences of &amp;quot;it&amp;quot; in this sentence. The first serves ~ the subject of &amp;quot;collect&amp;quot; and the second as its object. We gathered the statistics for three candidates which occur in the sentence: &amp;quot;collection&amp;quot;, &amp;quot;money&amp;quot; and &amp;quot;government&amp;quot;. According to the syntactic structure of the sentence, each of them may serw~' aL, s the referent for each of the occurrences of the pronoun. The following table lists the patterns that were produced by substituting each can330 1 didate with the anaphor, and the number of times each of these patterns occurred in the corpus: subject-verb collection collect 0 subject-verb money collect S subject-verb government collect 198 verb-object collect collection 0 verb.-obj ect collect money 149 verb-~object collect government 0 According to these statistics &amp;quot;government&amp;quot; is preferred as the reti~rent of the first &amp;quot;it&amp;quot;, and &amp;quot;money&amp;quot; of the second.</Paragraph>
    <Paragraph position="3"> This example demonstrates the case of definite se.mantle constraints which eliminate all but the correct alternative. In other cases, several alternatives may ,;atisfy the selectional constraints, and may be observed in the corpus a significant number of times. In such cases the tlnal selection between the approved candidates should be performed by other means, such as syntactic heuristics or asking the user. Another passibility may be to use statistical preferences, and prefer the relatively more frequent patterns, tlowever, at this stage it is not clear to us how useflfl the statistical preference can be, and we use the statistics only relative to a certain threshold, approving any patterns that pass this threshold.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Implementing the Acquisi-
</SectionTitle>
    <Paragraph position="0"> tion Phase The use of the statistical model involves two separate phases. The first is the acquisition pha.se, in which the corpus is processed and the statistical database is built. The second is the disambiguation phase, in which the statistical datab~Lse is used to resolve ambiguities.</Paragraph>
    <Paragraph position="1"> The statistical database contains cooccurrence patterns for various syntactic relations. In the experiment reported here we have used constraints for the %ubject-verb&amp;quot;, &amp;quot;verb-object&amp;quot; and &amp;quot;adjectivenoun&amp;quot; relations. To locate these relations in the sentences of the corpus, each sentence is parsed by the PEG parser \[Jensen 1986\]. Then, a post-processing algorithm identifies the various relations in the parse tree. As wa.s noted in \[Grishman et al. 1986\], the cooccurrence patterns reflect regularized or canonical structure. Therefore the post-processing algorithm has to map surface structures into the normalized relations. During our experiments we have used two different implementations for this algorithm \[Lappin et al. 1988\] \[Jensen 1989\], which take into account structures like passives, subclauses, questions and relative and infinitive clauses. The use of an automatic procedure for extracting information from a corpus that was not preprocessed manually raises a basic problem of circularity. Since the corpus was not disambiguated, it is not possible to distinguish the semantically correct patterns from the incorrect ones. Both types of ambiguity, syntactic and lexical, may cause the system to acquire or use inappropriate patterns. This problems is consid~ ered very important when dealing with a corpus: it was the re,Leon for the substantial human intervention in the procedure of \[Grishman et al. 1986\], and it is the reason why other techniques use manually tagged corpora (e.g. \[Church 1988\]).</Paragraph>
    <Paragraph position="2"> In practice, however, we have discovered that the problem is not so cruciah semantically vMid patterns have occurred many more times in syntactically unambiguous constructs than in mnbiguous ones. Thus, they could be identified without the need of first disambiguating the sentences. Semantically non-valid patterns indeed occurred in the inappropriate parses but they were too rare to pass the threshold. As tbr lcxical ambiguities, the chance that one sense of a word will be confused with another during disambiguation seems to be very small, and it never happened in our experiment.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Experiment
</SectionTitle>
    <Paragraph position="0"> An experiment was performed to resolve references of the anaphor &amp;quot;it&amp;quot; in the IIansard corpus. The examples of the ambiguous sentences were selected in tile following way: First, sentences containing the word &amp;quot;it&amp;quot; were extracted randomly from the corpus. Then, we manually filtered out sentences that were not relevant for the use of selectional constraints in resolving anaphoric references. Such cases were non-anaphoric occurrences of &amp;quot;it&amp;quot;, cases where the referent was not a noun phrase and cases where the anaphor was not involved in one of the three relations that we used. In addition, we have excluded cases where there was only one possible referent, so that our results will reflect correctly the perfor~ mance of the disambiguation method. The filtering process eliminated about two t.hirds of the original sentences, and we proceeded with 59 examples. The alternative candidates for the referent (which satisfy definite syntactic constrair, ts such as number, gender and requirements for reflexives) were identified manually in each example. 1 The statistics were collected from part o\[ the corpus, of about 28 million words. For 21 out of the 59 examples the statistics were not meaningful (we used a threshold of 5 occurrences for each of the alternative patterns). In these cases the algorithm cannot approve any of the candidates, getting a &amp;quot;coverage&amp;quot; of 38/59 (64%).</Paragraph>
    <Paragraph position="1"> As explained in Section 2, the output of the sta- null utive sentences. Therefore, we identified only candidaLes within the same sentence as the anaphor. &amp;quot;I'o provide enough candidales, we examined occurrences of &amp;quot;it&amp;quot; ~ffter the 15th word of the senLence. The examples provided between 2 to 5 candidates, with an average of 2,8 candidates per anaphor. 2 331 constraints. This is done by approving all patterns which appeared a significant nunaber of times.</Paragraph>
    <Paragraph position="2"> Therefore, the output is considered correct if the appropriate candidate is approved. This happened in 33 cases, getting &amp;quot;accuracy&amp;quot; of 33/38 (87%). In 18 of these cases, the appropriate candidate was the only one which was approved, getting a complete resolution of the ambiguity.</Paragraph>
    <Paragraph position="3"> This last result demonstrates the advantage of the statistical data over semantic constraints. While semantic constraints should approve any combination of arguments in a syntactic relation that may occur in the text, the statistics approve only those combinations that actually occur and reject others.</Paragraph>
    <Paragraph position="4"> Manual observation of the 18 sentences in which the statistics completely resolved the ambiguity showed that only in 7 cases the ambiguity could be eliminated by traditional selectional constraints. This is consistent with the evaluation in \[Itobbs 1978\], where only in 12 out of 132 sentences the ambiguity was eliminated by selectional constraints.</Paragraph>
    <Paragraph position="5"> An additional note should be made concerning the technical methodology of the experiment. Within the limited resources of our research, it was not feasible to build the statistical database for the entire Itansard corpus, which contains about 60 million words. The expensive resources are the parsing time and the storage for the cooceurrence patterns and their statistics. ~ However, it turns out that parsing the entire corpus is not necessary to evaluate the success of the statistical model! As the evaluation relates to a limited number of examples, it is sufficient to collect the statistics only for patterns that are relevant for the disambiguation of these examples. Therefore, we have extracted from the corpus only those sentences that contained at least one cooccurrenee of words from a relevant pattern. This procedure allowed us to parse only 10,000 sentences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML