File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1011_metho.xml
Size: 19,244 bytes
Last Modified: 2025-10-06 14:09:41
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1011"> <Title>Probabilistic disambiguation models for wide-coverage HPSG parsing</Title> <Section position="4" start_page="0" end_page="83" type="metho"> <SectionTitle> 2 Disambiguation models for HPSG </SectionTitle> <Paragraph position="0"> Discriminative log-linear models are now becoming a de facto standard for probabilistic disambiguation models for deep parsing (Johnson et al., 1999; Riezler et al., 2002; Geman and Johnson, 2002; Miyao and Tsujii, 2002; Clark and Curran, 2004b; Kaplan et al., 2004). Previous studies on probabilistic models for HPSG (Toutanova and Manning, 2002; Baldridge and Osborne, 2003; Malouf and van Noord, 2004) also adopted log-linear models. HPSG exploits feature structures to represent linguistic constraints. Such constraints are known to introduce inconsistencies in probabilistic models estimated using simple relative frequency (Abney, 1997). Log-linear models are required for credible probabilistic models and are also beneficial for incorporating various overlapping features.</Paragraph> <Paragraph position="1"> This study follows previous studies on the probabilistic models for HPSG. The probability, D4B4D8CYD7B5,of producing the parse result D8 from a given sentence D7 is defined as B4D8CYD7B5 is a reference distribution (usually assumed to be a uniform distribution), and CCB4D7B5 is a set of parse candidates assigned to D7. The feature func-</Paragraph> </Section> <Section position="5" start_page="83" end_page="83" type="metho"> <SectionTitle> tion CU CX </SectionTitle> <Paragraph position="0"> B4D8BND7B5 represents the characteristics of D8 and D7, while the corresponding model parameter AL</Paragraph> </Section> <Section position="6" start_page="83" end_page="83" type="metho"> <SectionTitle> CX B4D8BND7B5 is </SectionTitle> <Paragraph position="0"> its weight. Model parameters that maximize the log-likelihood of the training data are computed using a numerical optimization method (Malouf, 2002).</Paragraph> <Paragraph position="1"> Estimation of the above model requires a set of</Paragraph> <Paragraph position="3"> is the correct parse for sentence D7. While D8 is provided by a treebank, CCB4D7B5 is computed by parsing each D7 in the treebank. Previous studies assumed CCB4D7B5 could be enumerated; however, the assumption is impractical because the size of CCB4D7B5 is exponentially related to the length of D7. The problem of exponential explosion is inevitable in the wide-coverage parsing of real-world texts because many parse candidates are produced to support various constructions in long sentences.</Paragraph> </Section> <Section position="7" start_page="83" end_page="84" type="metho"> <SectionTitle> 3 Packed representation of HPSG parse </SectionTitle> <Paragraph position="0"> trees To avoid exponential explosion, we represent CCB4D7B5 in a packed form of HPSG parse trees. A parse tree of HPSG is represented as a set of tuples CWD1BND0BND6CX, where D1BND0BN and D6 are the signs of mother, left daughter, and right daughter, respectively . In chart parsing, partial parse candidates are stored in a chart,in which phrasal signs are identified and packed into an equivalence class if they are determined to be equivalent and dominate the same word sequence. A set of parse trees is then represented as a set of relations among equivalence classes.</Paragraph> <Paragraph position="1"> Figure 1 shows a chart for parsing &quot;he saw a girl with a telescope&quot;, where the modifiee (&quot;saw&quot; or &quot;girl&quot;)of&quot;with&quot; is ambiguous. Each feature structure expresses an equivalence class, and the arrows represent immediate-dominance relations. The phrase, &quot;saw a girl with a telescope&quot;, has two trees (A in the figure). Since the signs of the top-most nodes are equivalent, they are packed into an equivalence class. The ambiguity is represented as two pairs of arrows that come out of the node.</Paragraph> <Paragraph position="2"> Formally, a set of HPSG parse trees is represented in a chart as a tuple CWBXBNBX BNABCX, where BX is a set of equivalence classes, BX</Paragraph> <Paragraph position="4"> is a function to represent immediate-dominance relations. Our representation of the chart can be interpreted as an instance of a feature forest (Miyao and Tsujii, 2002; Geman and Johnson, 2002). A feature forest is an &quot;and/or&quot; graph to represent exponentially-many tree structures in a packed form. If CCB4D7B5 is represented in a feature forest, D4B4D8CYCCB4D7B5B5 can be estimated using dynamic programming without unpacking the chart. A feature forest is formally defined as a tuple, CWBVBNBWBNCABNADBNAECX, where BV is a set of conjunctive nodes, BW is a set of disjunctive nodes, CA AI BV is a set of root nodes For the ease of explanation, the definition of root node is slightly different from the original.</Paragraph> <Paragraph position="5"> in Figure 1 daughter function. The feature functions CU</Paragraph> <Paragraph position="7"> assigned to conjunctive nodes.</Paragraph> <Paragraph position="8"> The simplest way to map a chart of HPSG parse trees into a feature forest is to map each equivalence class CT BE BX to a conjunctive node CR BE BV.However, in HPSG parsing, important features for disambiguation are combinations of a mother and its daughters, i.e., CWD1BND0BND6CX. Hence, we map the tuple CX, which corresponds to CWD1BND0BND6CX, into a conjunctive node.</Paragraph> <Paragraph position="9"> Figure 2 shows (a part of) the HPSG parse trees in Figure 1 represented as a feature forest. Square boxes are conjunctive nodes, dotted lines express a disjunctive daughter function, and solid arrows represent a conjunctive daughter function.</Paragraph> <Paragraph position="10"> The mapping is formally defined as follows.</Paragraph> </Section> <Section position="8" start_page="84" end_page="84" type="metho"> <SectionTitle> AF BV BP CUCWCT </SectionTitle> <Paragraph position="0"/> </Section> <Section position="9" start_page="84" end_page="85" type="metho"> <SectionTitle> 4 Filtering by preliminary distribution </SectionTitle> <Paragraph position="0"> The above method allows for the tractable estimation of log-linear models on exponentially-many HPSG parse trees. However, despite the development of methods to improve HPSG parsing efficiency (Oepen et al., 2002a), the exhaustive parsing of all sentences in a treebank is still expensive.</Paragraph> <Paragraph position="1"> Our idea is that we can omit the computation of parse trees with low probabilities in the estimation stage because CCB4D7B5 can be approximated with parse trees with high probabilities. To achieve this, we first prepared a preliminary probabilistic model whose estimation did not require the parsing of a treebank. The preliminary model was used to reduce the search space for parsing a training treebank.</Paragraph> <Paragraph position="2"> The preliminary model in this study is a unigram model, AMD4B4D8CYD7B5 BP</Paragraph> <Paragraph position="4"> word in the sentence D7, and D0 is a lexical entry assigned to DB. This model can be estimated without parsing a treebank.</Paragraph> <Paragraph position="5"> Given this model, we restrict the number of lexical entries used to parse a treebank. With a threshold D2 for the number of lexical entries and a threshold AF for the probability, lexical entries are assigned to a word in descending order of probability, until the number of assigned entries exceeds D2, or the accumulated probability exceeds AF. If the lexical entry necessary to produce the correct parse is not assigned, it is additionally assigned to the word.</Paragraph> <Paragraph position="6"> Figure 3 shows an example of filtering lexical entries assigned to &quot;saw&quot;. With AF BPBCBMBLBH, four lexical entries are assigned. Although the lexicon includes other lexical entries, such as a verbal entry taking a sentential complement (D4 BPBCBMBCBD in the figure), they are filtered out. This method reduces the time for RULE the name of the applied schema DIST the distance between the head words of the daughters COMMA whether a comma exists between daughters and/or inside of daughter phrases SPAN the number of words dominated by the phrase SYM the symbol of the phrasal category (e.g. NP, VP) WORD the surface form of the head word POS the part-of-speech of the head word LE the lexical entry assigned to the head word parsing a treebank, while this approximation causes bias in the training data and results in lower accuracy. The trade-off between the parsing cost and the accuracy will be examined experimentally.</Paragraph> <Paragraph position="7"> We have several ways to integrate AMD4 with the estimated model D4B4D8CYCCB4D7B5B5. In the experiments, we will empirically compare the following methods in terms of accuracy and estimation time.</Paragraph> <Paragraph position="8"> Filtering only The unigram probability AMD4 is used only for filtering.</Paragraph> <Paragraph position="9"> Product The probability is defined as the product of AMD4 and the estimated model D4.</Paragraph> <Paragraph position="10"> Reference distribution AMD4 is used as a reference distribution of D4.</Paragraph> <Paragraph position="11"> Feature function D0D3CV AMD4 is used as a feature function of D4. This method was shown to be a generalization of the reference distribution method (Johnson and Riezler, 2000).</Paragraph> </Section> <Section position="10" start_page="85" end_page="85" type="metho"> <SectionTitle> 5 Features </SectionTitle> <Paragraph position="0"> Feature functions in the log-linear models are designed to capture the characteristics of CWCT In this paper, we investigate combinations of the atomic features listed in Table 1. The following combinations are used for representing the characteristics of the binary/unary schema applications.</Paragraph> </Section> <Section position="11" start_page="85" end_page="86" type="metho"> <SectionTitle> BP CWRULE,SYM,WORD,POS,LECX </SectionTitle> <Paragraph position="0"> In addition, the following is for expressing the condition of the root node of the parse tree.</Paragraph> <Paragraph position="1"> is for the root node, in which the phrase symbol is S and the surface form, part-of-speech, and lexical entry of the lexical head are &quot;saw&quot;, VBD, and a transitive verb, respectively. CU binary is for the binary rule application to &quot;saw a girl&quot; and &quot;with a telescope&quot;, in which the applied schema is the Head-Modifier Schema, the left daughter is VP headed by &quot;saw&quot;, and the right daughter is PP headed by &quot;with&quot;, whose part-of-speech is IN and the lexical entry is a VP-modifying preposition.</Paragraph> <Paragraph position="2"> In an actual implementation, some of the atomic features are abstracted (i.e., ignored) for smoothing. Table 2 shows a full set of templates of combined features used in the experiments. Each row represents a template of a feature function. A check means the atomic feature is incorporated while a hyphen means the feature is ignored.</Paragraph> <Paragraph position="3"> Restricting the domain of feature functions to design. Although it is true to some extent, this does not necessarily mean the impossibility of incorporating features on nonlocal dependencies into the model. This is because a feature forest model does not assume probabilistic independence of conjunctive nodes. This means that we can unpack a part of the forest without changing the model. Actually, in our previous study (Miyao et al., 2003), we successfully developed a probabilistic model including features on nonlocal predicate-argument dependencies. However, since we could not observe significant improvements by incorporating nonlocal features, this paper investigates only the features described above.</Paragraph> </Section> <Section position="12" start_page="86" end_page="86" type="metho"> <SectionTitle> RULE DIST COMMA SPAN SYM WORD POS LE </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"/> </Section> <Section position="13" start_page="86" end_page="86" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> We used an HPSG grammar derived from Penn Treebank (Marcus et al., 1994) Section 02-21 (39,832 sentences) by our method of grammar development (Miyao et al., 2004). The training data was the HPSG treebank derived from the same portion of the Penn Treebank . For the training, we eliminated sentences with no less than 40 words and for which the parser could not produce the correct parse. The resulting training set consisted of 33,574 sentences. The treebanks derived from Sections 22 and 23 were used as the development (1,644 sentences) and final test sets (2,299 sentences). We measured the accuracy of predicate-argument dependencies output by the parser. A dependency is defined as a tuple CWAPBNDB</Paragraph> </Section> <Section position="14" start_page="86" end_page="86" type="metho"> <SectionTitle> CW BNCPBNDB CP </SectionTitle> <Paragraph position="0"> CX, where AP is the predicate type (e.g., adjective, intransitive verb), DB</Paragraph> </Section> <Section position="15" start_page="86" end_page="86" type="metho"> <SectionTitle> CW </SectionTitle> <Paragraph position="0"> is the head word of the predicate, CP is the argument label (MODARG, ARG1, ..., ARG4), and DB</Paragraph> </Section> <Section position="16" start_page="86" end_page="86" type="metho"> <SectionTitle> CP </SectionTitle> <Paragraph position="0"> is the head word of the argument. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identified by the parser, while unlabeled precision/recall (UP/UR) is the ratio of DB</Paragraph> </Section> <Section position="17" start_page="86" end_page="88" type="metho"> <SectionTitle> CW and DB CP </SectionTitle> <Paragraph position="0"> correctly identified regardless of AP and CP. The F-score is the harmonic mean of LP and LR. The accuracy was measured by parsing test sentences with part-of-speech tags pro- null The programs to make the grammar and the tree-bank from Penn Treebank are available at http://wwwtsujii.is.s.u-tokyo.ac.jp/enju/. null vided by the treebank. The Gaussian prior was used for smoothing (Chen and Rosenfeld, 1999), and its hyper-parameter was tuned for each model to maximize the F-score for the development set. The optimization algorithm was the limited-memory BFGS method (Nocedal and Wright, 1999). All the following experiments were conducted on AMD Opteron servers with a 2.0-GHz CPU and 12-GB memory.</Paragraph> <Paragraph position="1"> Table 3 shows the accuracy for the development/test sets. Features occurring more than twice were included in the model (598,326 features). Filtering was done by the reference distribution method with D2 BP BDBC and AF BP BCBMBLBH. The unigram model for filtering was a log-linear model with two feature templates, CWWORDBN POSBN LECX and CWPOSBN LECX (24,847 features). Our results cannot be strictly compared with other grammar formalisms because each formalism represents predicate-argument dependencies differently; for reference, our results are competitive with the corresponding measures reported for Combinatory Categorial Grammar (CCG) (LP/LR = 86.6/86.3) (Clark and Curran, 2004b). Different from the results of CCG and PCFG (Collins, 1999; Charniak, 2000), the recall was clearly lower than precision. This results from the HPSG grammar having stricter feature constraints and the parser not being able to produce parse results for around one percent of the sentences. To improve recall, we need techniques of robust processing with HPSG.</Paragraph> <Paragraph position="2"> duced in Section 4. In all of the following experiments, we show the accuracy for the test set (BO 40 words) only. Table 4 revealed that our simple method of filtering caused a fatal bias in training data when a preliminary distribution was used only for filtering. However, the model combined with a preliminary model achieved sufficient accuracy. The reference distribution method achieved higher accuracy and lower cost. The feature function method achieved lower accuracy in our experiments. A possible reason is that a hyper-parameter of the prior was set to the same value for all the features including the feature of the preliminary distribution. Table 5 shows the results of changing the filtering threshold. We can determine the correlation between the estimation/parsing cost and accuracy. In our experiment, D2 AL BDBC and AF AL BCBMBLBC seem necessary to preserve the F-score over BKBHBMBC.</Paragraph> <Paragraph position="3"> Figure 5 shows the accuracy for each sentence length. It is apparent from this figure that the accuracy was significantly higher for shorter sentences (BO 10 words). This implies that experiments with only short sentences overestimate the performance of parsers. Sentences with at least 10 words are nec- null essary to properly evaluate the performance of parsing real-world texts.</Paragraph> <Paragraph position="4"> Figure 6 shows the learning curve. A feature set was fixed, while the parameter of the prior was optimized for each model. High accuracy was attained even with small data, and the accuracy seemed to be saturated. This indicates that we cannot further improve the accuracy simply by increasing training data. The exploration of new types of features is necessary for higher accuracy.</Paragraph> <Paragraph position="5"> Table 6 shows the accuracy with difference feature sets. The accuracy was measured by removing some of the atomic features from the final model.</Paragraph> <Paragraph position="6"> The last row denotes the accuracy attained by the preliminary model. The numbers in bold type represent that the difference from the final model was significant according to stratified shuffling tests (Cohen, 1995) with p-value BO BCBMBCBH. The results indicate that DIST, COMMA, SPAN, WORD, and POS features contributed to the final accuracy, although the dif- null ferences were slight. In contrast, RULE, SYM, and LE features did not affect the accuracy. However, if each of them was removed together with another feature, the accuracy decreased drastically. This implies that such features had overlapping information. Table 7 shows the manual classification of the causes of errors in 100 sentences randomly chosen from the development set. In our evaluation, one error source may cause multiple errors of dependencies. For example, if a wrong lexical entry was assigned to a verb, all the argument dependencies of the verb are counted as errors. The numbers in the table include such double-counting. Major causes were classified into three types: argument/modifier distinction, attachment ambiguity, and lexical ambiguity. While attachment/lexical ambiguities are well-known causes, the other is peculiar to deep parsing. Most of the errors cannot be resolved by features we investigated in this study, and the design of other features is crucial for further improvements.</Paragraph> </Section> class="xml-element"></Paper>