XML Viewer - w05-1518

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1518_metho.xml
Size: 17,901 bytes
Last Modified: 2025-10-06 14:09:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1518">
  <Title>Improving Parsing Accuracy by Combining Diverse Dependency Parsers</Title>
  <Section position="5" start_page="171" end_page="172" type="metho">
    <SectionTitle>
3 Component parsers
</SectionTitle>
    <Paragraph position="0"> The parsers involved in our experiments are summarized in Table 1. Most of them use unique strategies, the exception being thl and thr, which differ only in the direction in which they process the sentence.</Paragraph>
    <Paragraph position="1"> The table also shows individual parser accuracies on our Test data. There are two state-of-the art parsers, four not-so-good parsers, and one quite poor parser. We included the two best parsers (ec+mc) in all our experiments, and tested the contributions of various selections from the rest.</Paragraph>
    <Paragraph position="2"> The necessary assumption for a meaningful combination is that the outputs of the individual parsers are sufficiently uncorrelated, i.e. that the parsers do not produce the same errors. If some  A maximum-entropy inspired parser, home in constituency-based structures. English version described in Charniak (2000), Czech adaptation 2002 - 2003, unpublished.</Paragraph>
    <Paragraph position="3"> 83.6 85.0 mc Michael Collins Uses a probabilistic context-free grammar, home in constituency-based structures. Described in (Haji et al., 1998; Collins et al., 1999).</Paragraph>
    <Paragraph position="4">  individual parsers. Higher numbers in the right column reflect just the fact that the Test part is slightly easier to parse.</Paragraph>
    <Paragraph position="5">  parsers produced too similar results, there would be the danger that they push all their errors through, blocking any meaningful opinion of the other parsers.</Paragraph>
    <Paragraph position="6"> To check the assumption, we counted (on the Tune data set) for each parser in a given parser selection the number of dependencies that only this parser finds correctly. We show the results in Table 2. They demonstrate that all parsers are independent on the others at least to some extent.</Paragraph>
  </Section>
  <Section position="6" start_page="172" end_page="173" type="metho">
    <SectionTitle>
4 Combining techniques
</SectionTitle>
    <Paragraph position="0"> Each dependency structure consists of a number of dependencies, one for each word in the sentence.</Paragraph>
    <Paragraph position="1"> Our goal is to tell for each word, which parser is the most likely to pick its dependency correctly.</Paragraph>
    <Paragraph position="2"> By combining the selected dependencies we aim at producing a better structure. We call the complex system (of component parsers plus the selector) the superparser.</Paragraph>
    <Paragraph position="3"> Although we have shown how different strategies lead to diversity in the output of the parsers, there is little chance that any parser will be able to push through the things it specializes in. It is very difficult to realize that a parser is right if most of the others reject its proposal. Later in this section we assess this issue; however, the real power is in majority of votes.</Paragraph>
    <Section position="1" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
4.1 Voting
</SectionTitle>
      <Paragraph position="0"> The simplest approach is to let the member parsers vote. At least three parsers are needed. If there are exactly three, only the following situations really matter: 1) two parsers outvote the third one; 2) a tie: each parser has got a unique opinion. It would be democratic in the case of a tie to select randomly. However, that hardly makes sense once we know the accuracy of the involved parsers on the Tune set. Especially if there is such a large gap between the parsers' performance, the best parser (here ec) should get higher priority whenever there  individual parsers' strategies and with their contributions to the overall success. The &amp;quot;at least&amp;quot; rows give clues about what can be got by majority voting (if the number represents over 50 % of parsers compared) or by hypothetical oracle selection (if the number represents 50 % of the parsers or less, an oracle would generally be needed to point to the parsers that know the correct attachment).  is no clear majority of votes. Van Halteren et al.</Paragraph>
      <Paragraph position="1"> (1998) have generalized this approach for higher number of classifiers in their TotPrecision voting method. The vote of each classifier (parser) is weighted by their respective accuracy. For instance, mc + zz would outvote ec + thr, as 81.7 + 74.3 = 156 &gt; 154.6 = 83.6 + 71.0.</Paragraph>
    </Section>
    <Section position="2" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
4.2 Stacking
</SectionTitle>
      <Paragraph position="0"> If the world were ideal, we would have an oracle, able to always select the right parser. In such situation our selection of parsers would grant the accuracy as high as 95.8 %. We attempt to imitate the oracle by a second-level classifier that learns from the Tune set, which parser is right in which situations. Such technique is usually called classifier stacking. Parallel to (van Halteren et al., 1998), we ran experiments with two stacked classifiers, Memory-Based, and Decision-Tree-Based. This approach roughly corresponds to (Henderson and Brill, 1999)'s Naive Bayes parse hybridization.</Paragraph>
    </Section>
    <Section position="3" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
4.3 Unbalanced combining
</SectionTitle>
      <Paragraph position="0"> For applications preferring precision to recall, unbalanced combination -- introduced by Brill and Hladka in (Haji et al., 1998) -- may be of interest. In this method, all dependencies proposed by at least half of the parsers are included. The term unbalanced reflects the fact that now precision is not equal to recall: some nodes lack the link to their parents. Moreover, if the number of member parsers is even, a node may get two parents.</Paragraph>
    </Section>
    <Section position="4" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
4.4 Switching
</SectionTitle>
      <Paragraph position="0"> Finally, we develop a technique that considers the whole dependency structure rather than each dependency alone. The aim is to check that the resulting structure is a tree, i.e. that the dependencyselecting procedure does not introduce cycles.1 Henderson and Brill prove that under certain conditions, their parse hybridization approach cannot 1 One may argue that &amp;quot;treeness&amp;quot; is not a necessary condition for the resulting structure, as the standard accuracy measure does not penalize non-trees in any way (other than that there is at least one bad dependency). Interestingly enough, even some of the component parsers do not produce correct trees at all times. However, non-trees are both linguistically and technically problematic, and it is good to know how far we can get with the condition in force.</Paragraph>
      <Paragraph position="1"> introduce crossing brackets. This might seem an analogy to our problem of introducing cycles -but unfortunately, no analogical lemma holds. As a workaround, we have investigated a crossbreed approach between Henderson and Brill's Parser Switching, and the voting methods described above. After each step, all dependencies that would introduce a cycle are banned. The algorithm is greedy -- we do not try to search the space of dependency combinations for other paths. If there are no allowed dependencies for a word, the whole structure built so far is abandoned, and the structure suggested by the best component parser is used instead.2</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="173" end_page="176" type="metho">
    <SectionTitle>
5 Experiments and results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="173" end_page="174" type="sub_section">
      <SectionTitle>
5.1 Voting
</SectionTitle>
      <Paragraph position="0"> We have run several experiments where various selections of parsers were granted the voting right.</Paragraph>
      <Paragraph position="1"> In all experiments, the TotPrecision voting scheme of (van Halteren et al., 1998) has been used. The voting procedure is only very moderately affected by the Tune set (just the accuracy figures on that set are used), so we present results on both the Test and the Tune sets.</Paragraph>
      <Paragraph position="2">  According to the results, the best voters pool consists of the two best parsers, accompanied by 2 We have not encountered such situation in our test data. However, it indeed is possible, even if all the component parsers deliver correct trees, as can be seen from the following example. Assume we have a sentence #ABCD and parsers P1 (85 votes), P2 (83 votes), P3 (76 votes). P1 suggests the tree A-D-B-C-#, P2 suggests B-D-A-C-#, P3 suggests B-D-A-#, C-#. Then the superparser P gradually introduces the following dependencies: 1. A-D; 2. B-D;  3. C-#; 4. D-A or D-B possible but both lead to a cycle.  the two average parsers. The table also suggests that number of diverse strategies is more important than keeping high quality standard with all the parsers. Apart from the worst parser, all the other together do better than just the first two and the fourth. (On the other hand, the first three parsers are much harder to beat, apparently due to the extreme distance of the strategy of zz parser from all the others.) Even the worst performing parser combination (all seven parsers) is significantly3 better than the best component parser alone.</Paragraph>
      <Paragraph position="3"> We also investigated some hand-invented voting schemes but no one we found performed better than the ec+mc+zz+dz combination above.</Paragraph>
      <Paragraph position="4"> Some illustrative results are given in the Table 4. Votes were not weighted by accuracy in these experiments, but accuracy is reflected in the priority given to ec and mc by the human scheme inventor.</Paragraph>
    </Section>
    <Section position="2" start_page="174" end_page="175" type="sub_section">
      <SectionTitle>
5.2 Stacking - using context
</SectionTitle>
      <Paragraph position="0"> We explored several ways of using context in pools of three parsers.4 If we had only three parsers we could use context to detect two kinds of situations: null  more parsers as well. However, the number of possible features is much higher and the data sparser. We were not able to gain more accuracy on context-sensitive combination of more parsers.</Paragraph>
      <Paragraph position="1">  1. Each parser has its own proposal and a parser other than ec shall win.</Paragraph>
      <Paragraph position="2"> 2. Two parsers agree on a common pro- null posal but even so the third one should win. Most likely the only reasonable instance is that ec wins over mc + the third one.</Paragraph>
      <Paragraph position="3"> &amp;quot;Context&amp;quot; can be represented by a number of features, starting at morphological tags and ending up at complex queries on structural descriptions. We tried a simple memory-based approach, and a more complex approach based on decision trees. Within the memory-based approach, we use just the core features the individual parsers themselves train on: the POS tags (morphological tags or m-tags in PDT terminology). We consider the m-tag of the dependent node, and the m-tags of the governors proposed by the individual parsers. We learn the context-based strengths and weaknesses of the individual parsers on their performance on the Tune data set. In the following table, there are some examples of contexts in which ec is better than the common opinion of mc + dz.</Paragraph>
      <Paragraph position="4">  J^ are coordination conjunctions, # is the root, V* are verbs, Nn are nouns in case n, R* are prepositions, Z* are punctuation marks, An are adjectives. For the experiment with decision trees, we used the C5 software package, a commercial version of the well-known C4.5 tool (Quinlan, 1993). We considered the following features: For each of the four nodes involved (the dependent and the three governors suggested by the three component parsers):  * 12 attributes derived from the morphological tag (part of speech, subcategory, gender, number, case, inner gender, inner number, person, degree of comparison, negativeness, tense and voice) * 4 semantic attributes (such as Proper-Name, Geography etc.) For each of the three governor-dependent pairs involved: * mutual position of the two nodes (Left-Neighbor, RightNeighbor, LeftFar, RightFar) * mutual position expressed numerically * for each parser pair a binary flag whether they do or do not share opinions null The decision tree was trained only on situations where at least one of the three parsers was right and at least one was wrong.</Paragraph>
    </Section>
    <Section position="3" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
Voters Scheme Accuracy
</SectionTitle>
      <Paragraph position="0"> ec+mc+dz context free 86.2 ec+mc+dz memory-based 86.3 ec+mc+zz context free 86.7 ec+mc+zz decision tree 86.9  on the Tune data set, accuracy figures apply to the Test data set. Context-free results are given for the sake of comparison.</Paragraph>
      <Paragraph position="1"> It turns out that there is very low potential in the context to improve the accuracy (the improvement is significant, though). The behavior of the parsers is too noisy as to the possibility of formulating some rules for prediction, when a particular parser is right. C5 alone provided a supporting evidence for that hypothesis, as it selected a very simple tree from all the features, just 5 levels deep (see Figure 1).</Paragraph>
      <Paragraph position="2"> Henderson and Brill (1999) also reported that context did not help them to outperform simple voting. Although it is risky to generalize these observations for other treebanks and parsers, our environment is quite different from that of Henderson and Brill, so the similarity of the two observations is at least suspicious.</Paragraph>
    </Section>
    <Section position="4" start_page="175" end_page="175" type="sub_section">
      <SectionTitle>
5.3 Unbalanced combining
</SectionTitle>
      <Paragraph position="0"> Finally we compare the balanced and unbalanced methods. Expectedly, precision of the unbalanced combination of odd number of parsers rose while recall dropped slightly. A different situation is observed if even number of parsers vote and more than one parent can be selected for a node. In such case, precision drops in favor of recall.</Paragraph>
    </Section>
    <Section position="5" start_page="175" end_page="176" type="sub_section">
      <SectionTitle>
Method Precision Recall F-measure
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> learned by C5. Besides pairwise agreement between the parsers, only morphological case and negativeness matter.</Paragraph>
      <Paragraph position="3">  runs ignored the context. Evaluated on the Test data set.</Paragraph>
    </Section>
    <Section position="6" start_page="176" end_page="176" type="sub_section">
      <SectionTitle>
5.4 Switching
</SectionTitle>
      <Paragraph position="0"> Out of the 3,673 sentences in our Test set, 91.6 % have been rendered as correct trees in the balanced decision-tree based stacking of ec+mc+zz+dz (our best method).</Paragraph>
      <Paragraph position="1"> After we banned cycles, the accuracy dropped from 97.0 to 96.9 %.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="176" end_page="176" type="metho">
    <SectionTitle>
6 Comparison to related work
</SectionTitle>
    <Paragraph position="0"> Brill and Hladka in (Haji et al., 1998) were able to improve the original accuracy of the mc parser on PDT 0.5 e-test data from 79.1 to 79.9 (a nearly 4% reduction of the error rate). Their unbalanced5 voting pushed the F-measure from 79.1 to 80.4 (6% error reduction). We pushed the balanced accuracy of the ec parser from 85.0 to 87.0 (13% error reduction), and the unbalanced F-measure from 85.0 to 87.7 (18% reduction). Note however that there were different data and component parsers (Haji et al. found bagging the best parser better than combining it with other that-time-available parsers). This is the first time that several strategically different dependency parsers have been combined.</Paragraph>
    <Paragraph position="1"> (Henderson and Brill, 1999) improved their best parser's F-measure of 89.7 to 91.3, using their naive Bayes voting on the Penn TreeBank constituent structures (16% error reduction). Here, even the framework is different, as has been explained above.</Paragraph>
  </Section>
  <Section position="9" start_page="176" end_page="176" type="metho">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have tested several approaches to combining of dependency parsers. Accuracy-aware voting of the four best parsers turned out to be the best method, as it significantly improved the accuracy of the best component from 85.0 to 87.0 % (13 % error</Paragraph>
  </Section>
  <Section position="10" start_page="176" end_page="176" type="metho">
    <SectionTitle>
5 Also alternatively called unrestricted.
</SectionTitle>
    <Paragraph position="0"> rate reduction). The unbalanced voting lead to the precision as high as 90.2 %, while the F-measure of 87.3 % outperforms the best result of balanced voting (87.0).</Paragraph>
    <Paragraph position="1"> At the same time, we found that employing context to this task is very difficult even with a well-known and widely used machine-learning approach. null The methods are language independent, though the amount of accuracy improvement may vary according to the performance of the available parsers. null Although voting methods are themselves not new, as far as we know we are the first to propose and evaluate their usage in full dependency parsing. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML