File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1046_intro.xml
Size: 4,700 bytes
Last Modified: 2025-10-06 14:06:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1046"> <Title>Using Parsed Corpora for Structural Disambiguation in the TRAINS Domain</Title> <Section position="3" start_page="0" end_page="345" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> KANKEI I is a first attempt at a TRAINS disambiguation module. Like the systems in (Hindle and Rooth, 1993) and (Collins and Brooks, 1995), KANKEI records attachment statistics on informa- null tion extracted from a corpus. This information consists of phrase head patterns around the possible locations of PP/adverb attachments. Figure 1 shows how the format of these patterns allows for combinations including a verb, NP-head (rightmost NP before the postmodifier), and either the preposition and head noun in the PP, or one or more adverbs. 2 These patterns are similar to ones used by the disambiguation system in (Collins and Brooks, 1995) and (Brill and Resnik, 1994) except that Brill and Resnik form rules from these patterns while KANKEI and the system of Collins and Brooks use the attachment statistics of multiple patterns. While KANKEI combines the statistics of multiple patterns to make a disambiguation decision, Collins and Brooks' model is a backed-off model that uses 4-gram statistics where possible, 3-gram statistics where possible if no 4-gram statistics are available, and bigram statistics Most items in this specification are optional. The only requirement is that patterns have at least two items: a preposition or adverb and a verb or NPhead. The singular forms of nouns and the base forms of verbs are used. These patterns (with hyphens separating the items) form keys to two hash tables; one records attachments to NPs while the other records attachments to VPs. Numbers are stored under these keys to record how often such a pattern was seen in a not necessarily ambiguous VP or NP attachment. Sentence 1 instantiates the longest possible pattern, a 4-gram that here consists of need, orange, in, and Elmira.</Paragraph> <Paragraph position="1"> I) I need the oranges in Elmira.</Paragraph> <Paragraph position="2"> The TRAINS corpora are much too small for KANKEI to rely only on the full pattern of phrase heads around an ambiguous attachment. While searching for attachment statistics for sentence 1, KANKEI will check its hash tables for the key need-orange-in-Elmira. If it relied entirely on full patterns, then if the pattern had not been seen, KANKEI would have to randomly guess the attachment. Such a technique will be referred to as full matching. Normally KANKEI will do partial matching, i.e., if it cannot find a pattern such as need-orange-in-Elmira, it will look for smaller partial patterns which here would be: need-in, orange-in, orange-in-Elmira, need-in-Elmira, and need-orangein. The frequency with which NP and VP attachment occurs for these patterns is totaled to see if one attachment is preferred. Currently, we count partial patterns equally, but in future refinements we would 2Examples of trailing adverb pairs are first off and right now.</Paragraph> <Paragraph position="3"> like to choose weights more judiciously. For instance, we would expect shorter patterns such as need-in to carry less weight than longer ones. The need to choose weights is a drawback of the approach. However, the intuition is that one source of evidence is insufficient for proper disambiguation. Future work needs to further test this hypothesis.</Paragraph> <Paragraph position="4"> The statistics used by KANKEI for partial or full matching can be obtained in various ways. One is to use the same kinds of full and partial pattern matching in training as are used in disambiguation. This is called comprehensive training. Another method, called raw training, is to record only full patterns for ambiguous and unambiguous attachments in the corpus. (Note that full patterns can be as small as bigrams, such as when an adverb follows an NP acting as a subject.) Although raw training only collects a subset of the data collected by comprehensive training, it still gives KANKEI some flexibility when disambiguating phrases. If the full pattern of an ambiguity has not been seen, KANKEI can test whether a partial pattern of this ambiguous attachment occurred as an unambiguous attachment in the training corpus.</Paragraph> <Paragraph position="5"> Like the disambiguation system of (Brill and Resnik, 1994), KANKEI can also use word classes for some of the words appearing in its patterns. The rudimentary set of noun word classes used in this project is composed of city and commodity classes and a train class including cars and engines.</Paragraph> </Section> class="xml-element"></Paper>