File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0623_metho.xml
Size: 20,716 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0623"> <Title>Exploiting Diversity in Natural Language Processing: Combining Parsers</Title> <Section position="3" start_page="0" end_page="191" type="metho"> <SectionTitle> 2 Techniques for Combining Parsers </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="187" type="sub_section"> <SectionTitle> 2.1 Parse Hybridization </SectionTitle> <Paragraph position="0"> We are interested in combining the substructures of the input parses to produce a better parse. We call this approach parse hybridization. The sub-structures that are unanimously hypothesized by the parsers should be preserved after combination, and the combination technique should not foolishly create substructures for which there is no supporting evidence. These two principles guide experimentation in this framework, and together with the evaluation measures help us decide which specific type of substructure to combine.</Paragraph> <Paragraph position="1"> The precision and recall measures (described in more detail in Section 3) used in evaluating Tree-bank parsing treat each constituent as a separate entity, a minimal unit of correctness. Since our goal is to perform well under these measures we will similarly treat constituents as the minimal substructures for combination.</Paragraph> <Paragraph position="2"> One hybridization strategy is to let the parsers vote on constituents' membership in the hypothesized set. If enough parsers suggest that a particular constituent belongs in the parse, we include it. We call this technique constituent voting. We include a constituent in our hypothesized parse if it appears in the output of a majority of the parsers. In our particular case the majority requires the agreement of only two parsers because we have only three. This technique has the advantage of requiring no training, but it has the disadvantage of treating all parsers equally even though they may have differing accuracies or may specialize in modeling different phenomena.</Paragraph> <Paragraph position="3"> Another technique for parse hybridization is to use a naive Bayes classifier to determine which constituents to include in the parse. The development of a naive Bayes classifier involves learning how much each parser should be trusted for the decisions it makes. Our original hope in combining these parsers is that their errors are independently distributed.</Paragraph> <Paragraph position="4"> This is equivalent to the assumption used in proba- null bility estimation for naive Bayes classifiers, namely that the attribute values are conditionally independent when the target value is given. For this reason, naive Bayes classifiers are well-matched to this problem. null In Equations 1 through 3 we develop the model for constructing our parse using naive Bayes classification. C is the union of the sets of constituents suggested by the parsers. It(c) is a binary function returning t (for true) precisely when the constituent c 6 C should be included in the hypothesis. Mi(c) is a binary function returning t when parser i (from among the k parsers) suggests constituent c should be in the parse. The hypothesized parse is then the set of constituents that are likely (P > 0.5) to be in the parse according to this model.</Paragraph> <Paragraph position="5"> argmax P(Tr(c)\[M1 (c)... Mk (c) )</Paragraph> <Paragraph position="7"> The estimation of the probabilities in the model is carried out as shown in Equation 4. Here N(.) counts the number of hypothesized constituents in the development set that match the binary predicate specified as an argument.</Paragraph> <Paragraph position="9"> Under certain conditions the constituent voting and naive Bayes constituent combination techniques are guaranteed to produce sets of constituents with no crossing brackets. There are simply not enough votes remaining to allow any of the crossing structures to enter the hypothesized constituent set.</Paragraph> <Paragraph position="10"> Lemma: If the number of votes required by constituent voting is greater than half of the parsers under consideration the resulting structure has no crossing constituents.</Paragraph> <Paragraph position="11"> Proof: Assume a pair of crossing constituents appears in the output of the constituent voting technique using k parsers. Call the crossing constituents A and B. A receives a votes, and B receives b votes.</Paragraph> <Paragraph position="12"> Each of the constituents must have received at least \[~--~-~q votes from the k parsers, so a > \[k2~ \] and 2 / b > F~.2_!l Let s = a + b. None of the parsers pro- _ / 2 /&quot; duce parses with crossing brackets, so none of them votes for both of the assumed constituents. Hence, s _< k. But by addition of the votes on the two parses, s > 9.F~.t.11 > k, a contradiction. * ---~ 2 ! Similarly, when the naive Bayes classifier is configured such that the constituents require estimated probabilities strictly larger than 0.5 to be accepted, there is not enough probability mass remaining on crossing brackets for them to be included in the hypothesis. null</Paragraph> </Section> <Section position="2" start_page="187" end_page="188" type="sub_section"> <SectionTitle> 2.2 Parser Switching </SectionTitle> <Paragraph position="0"> In general, the lemma of the previous section does not ensure that all the productions in the combined parse are found in the grammars of the member parsers. There is a guarantee of no crossing brackets but there is no guarantee that a constituent in the tree has the same children as it had in any of the three original parses. One can trivially create situations in which strictly binary-branching trees are combined to create a tree with only the root node and the terminal nodes, a completely flat structure.</Paragraph> <Paragraph position="1"> This drastic tree manipulation is not appropriate for situations in which we want to assign particular structures to sentences. For example, we may have semantic information (e.g. database query operations) associated with the productions in a grammar. If the parse contains productions from outside our grammar the machine has no direct method for handling them (e.g. the resulting database query may be syntactically malformed).</Paragraph> <Paragraph position="2"> We have developed a general approach for combining parsers when preserving the entire structure of a parse tree is important. The combining algorithm is presented with the candidate parses and asked to choose which one is best. The combining technique must act as a multi-position switch indicating which parser should be trusted for the particular sentence.</Paragraph> <Paragraph position="3"> We call this approach parser switching. Once again we present both a non-parametric and a parametric technique for this task.</Paragraph> <Paragraph position="4"> First we present the non-parametric version of parser switching, similarity switching: 1. From each candidate parse, 7ri, for a sentence, create the constituent set Si by converting each constituent into its tuple representation.</Paragraph> <Paragraph position="5"> 2. The score for rri is ~ IS# N Sil, where j ranges #C/i over the candidate parses for the sentence.</Paragraph> <Paragraph position="6"> 3. Switch to (use) the parser with the highest score for the sentence. Ties are broken arbitrarily.</Paragraph> <Paragraph position="7"> The intuition for this technique is that we can measure a similarity between parses by counting the constituents they have in common. We pick the parse that is most similar to the other parses by choosing the one with the highest sum of pairwise similarities. This is the parse that is closest to the centroid of the Observed parses under the similarity metric.</Paragraph> <Paragraph position="8"> The probabilistic version of this procedure is straightforward: We once again assume independence among our various member parsers. Furthermore, we know one of the original parses will be the hypothesized parse, so the direct method of determining which one is best is to compute the probability of each of the candidate parses using the probabilistic model iwe developed in Section 2.1. We model each parse as the decisions made to create it, and model those decisions as independent events. Each decision determines the inclusion or exclusion of a candidate constituent. The set of candidate constituents comes from the union of all the constituents suggested by the member parsers. This is summarized in Equation 5. The computation of P(Tri(c)lM1...Mk(c)) has been sketched before in Equations 1 th~:ough 4. In this case we are interested in findingl the maximum probability parse, ~'~, and Mi is the s:et of relevant (binary) parsing decisions made by parser i. 7r~ is a parse selected from among the outputs of the individual parsers. It is chosen such that the decisions it made in including or excluding constituents are most probable under the models for all of the parsers.</Paragraph> <Paragraph position="10"> The three parsers were trained and tuned by their creators on various sections of the WSJ portion of the Penn Treebank, leaving only sections 22 and 23 completely untouched during the development of any of the parsers, i We used section 23 as the development set for our combining techniques, and section 22 only for final testing. The development set contalned 44088 constituents in 2416 sentences and the test set contained 30691 constituents in 1699 sentences. A sentence was withheld from section 22 because its exireme length was troublesome for a couple of the parsers) The standard measures for evaluating Penn Tree-bank parsing performance are precision and recall of the predicted Constituents. Each parse is converted into a set of constituents represented as a tuples: length and included nested quotes and parenthetical expressions. null (label, start, end). The set is then compared with the set generated from the Penn Treebank parse to determine the precision and recall. Precision is the portion of hypothesized constituents that are correct and recall is the portion of the Treebank constituents that are hypothesized.</Paragraph> <Paragraph position="11"> For our experiments we also report the mean of precision and recall, which we denote by (P + R)/2 and F-measure. F-measure is the harmonic mean of precision and recall, 2PR/(P + R). It is closer to the smaller value of precision and recall when there is a large skew in their values.</Paragraph> <Paragraph position="12"> We performed three experiments to evaluate our techniques. The first shows how constituent features and context do not help in deciding which parser to trust. We then show that the combining techniques presented above give better parsing accuracy than any of the individual parsers. Finally we show the combining techniques degrade very little when a poor parser is added to the set.</Paragraph> </Section> <Section position="3" start_page="188" end_page="189" type="sub_section"> <SectionTitle> 3.1 Context </SectionTitle> <Paragraph position="0"> It is possible one could produce better models by introducing features describing constituents and their contexts because one parser could be much better than the majority of the others in particular situations. For example, one parser could be more accurate at predicting noun phrases than the other parsers. None of the models we have presented utilize features associated with a particular constituent (i.e. the label, span, parent label, etc.) to influence parser preference. This is not an oversight. Features and context were initially introduced into the models, but they refused to offer any gains in performance. While we cannot prove there are no such useful features on which one should condition trust, we can give some insight into why the features we explored offered no gain.</Paragraph> <Paragraph position="1"> Because we are working with only three parsers, the only situation in which context will help us is when it can indicate we should choose to believe a single parser that disagrees with the majority hypothesis instead of the majority hypothesis itself.</Paragraph> <Paragraph position="2"> This is the only important case, because otherwise the simple majority combining technique would pick the correct constituent. One side of the decision making process is when we choose to believe a constituent should be in the parse, even though only one parser suggests it. We call such a constituent an isolated constituent. If we were working with more than three parsers we could investigate minority constituents, those constituents that are suggested by at least one parser, but which the majority of the parsers do not suggest.</Paragraph> <Paragraph position="3"> Adding the isolated constituents to our hypothesis parse could increase our expected recall, but in the cases we investigated it would invariably hurt our precision more than we would gain on recall.</Paragraph> <Paragraph position="4"> Consider for a set of constituents the isolated constituent precision parser metric, the portion of isolated constituents that are correctly hypothesized. When this metric is less than 0.5, we expect to incur more errors 2 than we will remove by adding those constituents to the parse.</Paragraph> <Paragraph position="5"> We show the results of three of the experiments we conducted to measure isolated constituent precision under various partitioning schemes. In Table 1 we see with very few exceptions that the isolated constituent precision is less than 0.5 when we use the constituent label as a feature. The counts represent portions of the approximately 44000 constituents hypothesized by the parsers in the development set.</Paragraph> <Paragraph position="6"> In the cases where isolated constituent precision is larger than 0.5 the affected portion of the hypotheses is negligible.</Paragraph> <Paragraph position="7"> Similarly Figures 1 and 2 show how the isolated constituent precision varies by sentence length and the size of the span of the hypothesized constituent. In each figure the upper graph shows the isolated constituent precision and the bottom graph shows the corresponding number of hypothesized constituents. Again we notice that the isolated constituent precision is larger than 0.5 only in those partitions that contain very few samples. From this we see that a finer-grained model for parser combination, at least for the features we have examined, will not give us any additional power.</Paragraph> </Section> <Section position="4" start_page="189" end_page="189" type="sub_section"> <SectionTitle> 3.2 Performance Testing </SectionTitle> <Paragraph position="0"> The results in Table 2 were achieved on the development set. The first two rows of the table are baselines. The first row represents the average accuracy of the three parsers we combine. The second row is the accuracy of the best of the three parsers. 3 The next two rows are results of oracle experiments.</Paragraph> <Paragraph position="1"> The parser switching oracle is the upper bound on the accuracy that can be achieved on this set in the parser switching framework. It is the performance we could achieve if an omniscient observer told us which parser to pick for each of the sentences. The maximum precision row is the upper bound on accuracy if we could pick exactly the correct constituents from among the constituents suggested by the three parsers. Another way to interpret this is that less than 5% of the correct constituents are missing from the hypotheses generated by the union of the three parsers. The maximum precision oracle is an upper bound on the possible gain we can achieve by parse hybridization.</Paragraph> <Paragraph position="2"> We do not aim to compare the performance of the individual parsers, nor do we want to bias further research by giving the individual parser results for the test set.</Paragraph> <Paragraph position="3"> We do not show the numbers for the Bayes models in Table 2 because the parameters involved were established using this set. The precision and recall of similarity switching and constituent voting are both significantly better than the best individual parser, and constituent voting is significantly better than parser switching in precision. 4 Constituent voting gives the highest accuracy for parsing the Penn Tree-bank reported to date.</Paragraph> <Paragraph position="4"> Table 3 contains the results for evaluating our systems on the test set (section 22). All of these systems were run on data that was not seen during their development. The difference in precision between similarity and Bayes switching techniques is significant, but the difference in recall is not. This is the first set that gives us a fair evaluation of the Bayes models, and the Bayes switching model performs significantly better than its non-parametric counterpart. The constituent voting and naive Bayes techniques are equivalent because the parameters learned in the training set did not sufficiently discriminate between the three parsers.</Paragraph> <Paragraph position="5"> Table 4 shows how much the Bayes switching technique uses each of the parsers on the test set. Parser 3, the most accurate parser, was chosen 71% of the time, and Parser 1, the least accurate parser was chosen 16% of the time. Ties are rare in Bayes switching because the models are fine-grained - many estimated probabilities are involved in each decision.</Paragraph> </Section> <Section position="5" start_page="189" end_page="191" type="sub_section"> <SectionTitle> 3.3 Robustness Testing </SectionTitle> <Paragraph position="0"> In the interest of testing the robustness of these combining techniques, we added a fourth, simple non-lexicalized PCFG parser. The PCFG was trained from the same sections of the Penn Treebank as the other three parsers. It was then tested on section 22 of the Treebank in conjunction with the other parsers.</Paragraph> <Paragraph position="1"> The results of this experiment can be seen in Table 5. The entries in this table can be compared with those of Table 3 to see how the performance of the combining techniques degrades in the presence of an inferior parser. As seen by the drop in average individual parser performance baseline, the introduced parser does not perform very well. The average individual parser accuracy was reduced by more than 5% when we added this new parser, but the preci4All significance claims are made based on a binomial hypothesis test of equality with an a < 0.01 confidence level. sion of the constituent voting technique was the only result that decreased significantly. The Bayes models were able to achieve significantly higher precision than their non-parametric counterparts. We see from these results that the behavior of the parametric techniques are robust in the presence of a poor parser* Surprisingly, the non-parametric switching technique also exhibited robust behaviour in this situation. null</Paragraph> </Section> </Section> <Section position="4" start_page="191" end_page="193" type="metho"> <SectionTitle> 4 Conclusion </SectionTitle> <Paragraph position="0"> We have presented two general approaches to studying parser combination: parser switching and parse hybridization. For each experiment we gave an non- null parametric and a parametric technique for combining parsers: All four of the techniques studied result in parsing systems that perform better than any previously reported. Both of the switching techniques, as well as the parametric hybridization technique were also shown to be robust when a poor parser was introduced into the experiments. Through parser combination we have reduced the precision error rate by 30% and the recall error rate by 6% compared to the best previously published result.</Paragraph> <Paragraph position="1"> Combining multiple highly-accurate independent parsers yields promising results. We plan to explore more powerful techniques for exploiting the diversity of parsing methods.</Paragraph> </Section> <Section position="5" start_page="193" end_page="193" type="metho"> <SectionTitle> 5 Acknowledgements </SectionTitle> <Paragraph position="0"> We would like to thank Eugene Charniak, Michael Collins, and Adwait Ratnaparkhi for enabling all of this research by providing us with their parsers and helpful comments.</Paragraph> <Paragraph position="1"> This work was funded by NSF grant IRI-9502312.</Paragraph> <Paragraph position="2"> Both authors are members of the Center for Language and Speech Processing at Johns Hopkins University. null</Paragraph> </Section> class="xml-element"></Paper>