File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2004_metho.xml
Size: 14,871 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2004"> <Title>The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation</Title> <Section position="4" start_page="0" end_page="25" type="metho"> <SectionTitle> 2 Data Sets </SectionTitle> <Paragraph position="0"> The unlabeled corpus is the Reuters RCV1 corpus, about 80,000,000 words of newswire text (Lewis et al., 2004). Three different subsets, corresponding to roughly 10%, 50% and 100% of the corpus, were created for experiments related to the size of the unannotated corpus. (Two weeks after Aug 5, 1997, were set apart for future experiments.) The labeled corpus is the Penn Wall Street Journal treebank (Marcus et al., 1993). We created the 5 subsets shown in Table 1 for experiments related to the size of the annotated corpus.</Paragraph> <Paragraph position="1"> unlabeled Reuters (R) corpus for attachment statistics, labeled Penn treebank (WSJ) for training the Collins parser.</Paragraph> <Paragraph position="2"> The test set, sections 13-24, is larger than in most studies because a single section does not contain a sufficient number of RC attachment ambiguities for a meaningful evaluation.</Paragraph> <Paragraph position="3"> ties in the Penn Treebank. Number of instances with high attachment (highA), low attachment (lowA), verb attachment (verbA), and noun attachment (nounA) according to the gold standard.</Paragraph> <Paragraph position="4"> All instances of RC and PP attachments were extracted from development and test sets, yielding about 250 RC ambiguities and 12,000 PP ambiguities per set (Table 2). An RC attachment ambiguity was defined as a sentence containing the pattern NP1 Prep NP2 which. For example, the relative clause in Example 1 can either attach to mechanism or to System.</Paragraph> <Paragraph position="5"> (1) ... the exchange-rate mechanism of the European Monetary System, which links the major EC currencies.</Paragraph> <Paragraph position="6"> A PP attachment ambiguity was defined as a subtree matching either [VP [NP PP]] or [VP NP PP]. An example of a PP attachment ambiguity is Example 2 where either the approval or the transaction is performed by written consent. null (2) ...a majority ...have approved the transaction by written consent ...</Paragraph> <Paragraph position="7"> Both data sets are available for download (Web Appendix, 2006). We did not use the PP data set described by (Ratnaparkhi et al., 1994) because we are using more context than the limited context available in that set (see below).</Paragraph> </Section> <Section position="5" start_page="25" end_page="27" type="metho"> <SectionTitle> 3 Methods </SectionTitle> <Paragraph position="0"> Collins parser. Our baseline method for ambiguity resolution is the Collins parser as implemented by Bikel (Collins, 1997; Bikel, 2004). For each ambiguity, we check whether the attachment ambiguity is resolved correctly by the 5 parsers corresponding to the different training sets. If the attachment ambiguity is not recognized (e.g., because parsing failed), then the corresponding ambiguity is excluded for that instance of the parser. As a result, the size of the effective test set varies from parser to parser (see Table 4).</Paragraph> <Paragraph position="1"> Minipar. The unannotated corpus is analyzed using minipar (Lin, 1998), a partial dependency parser. The corpus is parsed and all extracted dependenciesare stored for later use.</Paragraph> <Paragraph position="2"> Dependencies in ambiguous PP attachments (those corresponding to [VP NP PP] and [VP [NP PP]] subtrees) are not indexed. An experiment with indexing both alternatives for ambiguous structures yielded poor results. For example, indexing both alternatives will create a large number of spurious verb attachments of of, which in turn will result in incorrect high attachments by our disambiguation algorithm.</Paragraph> <Paragraph position="3"> For relative clauses, no such filtering is necessary. For example, spurious subject-verb dependencies due to RC ambiguities are rare compared to a large number of subject-verb dependencies that can be extracted reliably.</Paragraph> <Paragraph position="4"> Inverted index. Dependencies extracted by minipar are stored in an inverted index (Witten et al., 1999), implemented in Lucene (Lucene, 2006). For example, &quot;john subj buy&quot;, the analysis returned by minipar for John buys, is stored as &quot;john buy john<subj subj<buy john<subj<buy&quot;. All words, dependencies and partial dependencies of a sentence are stored together as one document.</Paragraph> <Paragraph position="5"> This storage mechanism enables fast on-line queries for lexical and dependency statistics, e.g., how many sentences contain the dependency &quot;john subj buy&quot;, how often does john occur as a subject, how often does buy have john as a subject and car as an object etc.</Paragraph> <Paragraph position="6"> Query results are approximate because double occurrences are only counted once and structures giving rise to the same set of dependencies (a piece of a tile of a roof of a house vs. a piece of a roof of a tile of a house) cannot be distinguished. We believe that an inverted index is the most efficient data structure for our purposes. For example, we need not compute expensive joins as would be required in a database implementation. Our long-term goal is to use this inverted index of dependencies as a versatile component of NLP systems in analogy to the increasingly important role of search engines for association and word count statistics in NLP.</Paragraph> <Paragraph position="7"> A total of three inverted indexes were created, one each for the 10%, 50% and 100% Reuters subset.</Paragraph> <Paragraph position="8"> Lattice-Based Disambiguation. Our disambiguation method is Lattice-Based Disambiguation (LBD, (Atterer and Sch&quot;utze, 2006)). We formalize a possible attachment as a triple < R,i,X > where X is (the parse of) a phrase with two or more possible attachment nodes in a sentence S, i is one of these attachment nodes and R is (the relevant part of a parse of) S with X removed. For example, the two attachments in Example 2 are represented as the triples: < approvedi1 the transactioni2,i1,by consent >, < approvedi1 the transactioni2,i2,by consent >. We decide between attachment possibilities based on pointwise mutual information, the well-known measure of how surprising it is to see R and X together given their individual frequencies:</Paragraph> <Paragraph position="10"> where the probabilities of the dependency structures < R,i,X >, R and X are estimated on the unlabeled corpus by querying the in- null M: premodifying adjective or noun (upper or lower NP), N: head noun (upper or lower NP), p: Preposition.</Paragraph> <Paragraph position="11"> verted index. Unfortunately, these structures will often not occur in the corpus. If this is the case we back off to generalizations of R and X. The generalizations form a lattice as shown in Figure 1 for PP attachment. For example, MN:pMN corresponds to commercial transaction by unanimous consent, N:pM to transaction by unanimous etc. For 0:p we compute MI of the two events &quot;noun attachment&quot; and &quot;occurrence of p&quot;. Points in the lattice in Figure 1 are created by successive elimination of material from the complete context R:X.</Paragraph> <Paragraph position="12"> A child c directly dominated by a parent p is created by removing exactly one contextual element from p, either on the right side (the attachment phrase) or on the left side (the attachment node). For RC attachment, generalizations other than elimination are introduced such as the replacement of a proper noun (e.g., Canada) by its category (country) (see below).</Paragraph> <Paragraph position="13"> The MI of each point in the lattice is computed. We then take the maximum over all MI values of the lattice as a measure of the affinity of attachment phrase and attachment node. The intuition is that we are looking for the strongest evidence available for the attachment. The strongest evidence is often not provided by the most specific context (MN:pMN in the example) since contextual elements like modifierswill only add noise to the attachment decision in some cases. The actual syntactic disambiguation is performed by computing the affinity (maximum over MI values in the lattice) for each possible attachment and selecting the attachment with highest affinity. (The default attachment is selected if the two values are equal.) The second lattice for PP attachment, the lattice for attachment to the verb, has a structure identical to Figure 1, but the attachment node is SV instead of MN, where S denotes the subject and V the verb. So the supremum of that lattice is SV:pMN and the infimum is 0:p (which in this case corresponds to the MI of verb attachment and occurrence of the preposition).</Paragraph> <Paragraph position="14"> LBD is motivated by the desire to use as much context as possible for disambiguation.</Paragraph> <Paragraph position="15"> Previous work on attachment disambiguation has generally used less context than in this paper (e.g., modifiers have not been used for PP attachment). No change to LBD is necessary if the lattice of contexts is extended by adding additional contextual elements (e.g., the preposition between the two attachment nodes in RC, which we do not consider in this paper).</Paragraph> </Section> <Section position="6" start_page="27" end_page="28" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The Reuters corpus was parsed with minipar and all dependencies were extracted. Three inverted indexes were created, corresponding to 10%, 50% and 100% of the corpus.1 Five parameter sets for the Collins parser were created by training it on the WSJ training sets in Table 1. Sentences with attachment ambiguities in the WSJ corpus were parsed with minipar to generate Lucene queries. (We chose this procedureto ensurecompatibility of query and index formats.) The Lucene queries were run on the three indexes. LBD disambiguation was then applied based on the statistics returned by the queries. LBD results are applied to the output of the Collins parser by simply replacing all attachment decisions with LBD decisions.</Paragraph> <Section position="1" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 4.1 RC attachment </SectionTitle> <Paragraph position="0"> The lattice for LBD in RC attachment is shown in Figure 2. When disambiguating an RC attachment, two instances of the lattice are formed, one for NP1 and one 1In fact, two different sets of inverted indexes were created, one each for PP and RC disambiguation. The RC index indexes all dependencies, including ambiguous PP dependencies. Computing the RC statistics on the PP index should not affect the RC results presented here, but we didn't have time to confirm this experimentally for this paper.</Paragraph> <Paragraph position="1"> for NP2 in NP1 Prep NP2 RC. Figure 2 shows the maximum possible lattice. If contextual elements are not present in a context (e.g., a modifier), then the lattice will be smaller. The supremum of the lattice corresponds to a query that includes the entire NP (including modifying adjectives and nouns)2, the verb and its object. modifying adjective or noun, Nf: head noun with lexical modifiers, N: head noun only, n: head noun in lower case, C: class of NP, V: verb in relative clause, O: object of verb in the relative clause.</Paragraph> <Paragraph position="2"> To generalize contexts in the lattice, the following generalization operations are employed: biguation is the noun-verb link, but the other dependencies also improve the accuracy of disambiguation (Atterer and Sch&quot;utze, 2006). For example, light verbs like make and have only provide disambiguation information when their objects are also considered.</Paragraph> <Paragraph position="3"> Downcasing and hypernym generalizations were used because proper nouns often cause sparse data problems. Named entity classes were identified with LingPipe (LingPipe, 2006). Named entities identified as companies or organizations are replaced with company in the query. Locations are replaced with country. Persons block RC attachment because which-clauses do not attach to person names, resulting in an attachment of the RC to the other NP.</Paragraph> <Paragraph position="4"> Table 3 shows queries and mutual information values for Example 1. The highest values are 12.2 for high attachment (mechanism) and 3 for low attachment (System). The algorithm therefore selects high attachment.</Paragraph> <Paragraph position="5"> The value 3 for low attachment is the default value for the empty context. This value reflects the bias for low attachment: the majority of relative clauses are attached low. If all MI-values are zero or otherwise low, this procedure will automatically result in low attachment.3 null 3We experimented with a number of values (2, 3, and 4) on the development set. Accuracy of the algorithm was best for a value of 3. The results presented here differ slightly from those in (Atterer and Sch&quot;utze, 2006) due to a coding error.</Paragraph> <Paragraph position="6"> Decision list. For increased accuracy, LBD is embedded in the following decision list.</Paragraph> <Paragraph position="7"> 1. If minipar has already chosen high attachment, choose high attachment (this only occurs if NP1 Prep NP2 is a named entity).</Paragraph> <Paragraph position="8"> 2. If there is agreement between the verb and only one of the NPs, attach to this NP.</Paragraph> <Paragraph position="9"> 3. Ifone of theNPs isin a list of personentities, attach to the other NP.4 4. If possible, use LBD.</Paragraph> <Paragraph position="10"> 5. If none of the above strategies was successful (e.g. in the case of parsing errors), attach low.</Paragraph> </Section> <Section position="2" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 4.2 PP attachment </SectionTitle> <Paragraph position="0"> The two lattices for LBD applied to PP attachment were described in Section 3 and Figure 1. The only generalization operation used in these two lattices is elimination of contextual elements (in particular, there is no downcasing and named entity recognition). Note that in RC attachment, we compare affinities of two instances of the same lattice (the one shown in Figure 2). In PP attachment, we compare affinities of two different lattices since the two attachment points (verb vs. noun) are different. The basic version of LBD (with the untuned default value 0 and without decision lists) was used for PP attachment.</Paragraph> </Section> </Section> class="xml-element"></Paper>