File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1021_evalu.xml

Size: 7,526 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1021">
  <Title>Multilingual Dependency Parsing using Bayes Point Machines</Title>
  <Section position="7" start_page="163" end_page="165" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> Table 2 presents the accuracy of the dependency parsers. Dependency accuracy indicates for how many tokens we identified the correct head. Root accuracy, i.e. for how many sentences did we identify the correct root or roots, is reported as F1 measure, since sentences in the Czech and Arabic corpora can have multiple roots and since the parsing algorithms can identify multiple roots. Complete match indicates how many sentences were a complete match with the oracle dependency parse.</Paragraph>
    <Paragraph position="1"> A convention appears to have arisen when reporting dependency accuracy to give results for English excluding punctuation (i.e., ignoring punctuation tokens in the output of the parser) and to report results for Czech including punctuation. In order to facilitate comparison of the present results with previously published results, we present measures including and excluding punctuation for all four languages. We hope that by presenting both sets of measurements, we also simplify one dimension along which published results of parse accuracy differ. A direct  comparison of parse results across languages is still difficult for reasons to do with the different nature of the languages, the corpora and the differing standards of linguistic detail annotated, but a comparison of parsers for two different languages where both results include punctuation is at least preferable to a comparison of results including punctuation to results excluding punctuation.</Paragraph>
    <Paragraph position="2"> The results reported here for English and Czech are comparable to the previous best published numbers in (McDonald et al., 2005a), as Table 3 shows.</Paragraph>
    <Paragraph position="3"> This table compares McDonald et al.'s results for an averaged perceptron trained for ten iterations with no check for convergence (Ryan McDonald, pers.</Paragraph>
    <Paragraph position="4"> comm.), MIRA, a large margin classifier, and the current Bayes Point Machine results. To determine statistical significance we used confidence intervals for p=0.95. For the comparison of English dependency accuracy excluding punctuation, MIRA and BPM are both statistically significantly better than the averaged perceptron result reported in (McDonald et al., 2005a). MIRA is significantly better than BPM when measuring dependency accuracy and root accuracy, but BPM is significantly better when measuring sentences that match completely.</Paragraph>
    <Paragraph position="5"> From the fact that neither MIRA nor BPM clearly outperforms the other, we conclude that we have successfully replicated the results reported in (Mc-Donald et al., 2005a) for English.</Paragraph>
    <Paragraph position="6"> For Czech we also determined significance using confidence intervals for p=0.95 and compared results including punctuation. For both dependency accuracy and root accuracy, MIRA is statisticallty significantly better than averaged perceptron, and BPM is statistically significantly better than MIRA.</Paragraph>
    <Paragraph position="7"> Measuring the number of sentences that match completely, BPM is statistically significantly better than averaged perceptron, but MIRA is significantly better than BPM. Again, since neither MIRA nor BPM outperforms the other on all measures, we conclude that the results constitute a valiation of the results reported in (McDonald et al., 2005a).</Paragraph>
    <Paragraph position="8"> For every language, the dependency accuracy of the Bayes Point Machine was greater than the accuracy of the best individual perceptron that contributed to that Bayes Point Machine, as Table 4 shows. As previously noted, when measuring against the development test set, we used human-annotated part-of-speech labels for English and Chinese. null Although the Prague Czech Dependency Tree-bank is much larger than the English Penn Treebank, all measurements are lower than the corresponding measurements for English. This reflects the fact that Czech has considerably more inflectional morphology than English, leading to data sparsity for the lexical features.</Paragraph>
    <Paragraph position="9"> The results reported here for Arabic are, to our knowledge, the first published numbers for dependency parsing of Arabic. Similarly, the results for Chinese are the first published results for the dependency parsing of the Chinese Treebank 5.0.</Paragraph>
    <Paragraph position="10">  Since the Arabic and Chinese numbers are well short of the numbers for Czech and English, we attempted to determine what impact the smaller corpora used for training the Arabic and Chinese parsers might have. We performed data reduction experiments, training the parsers on five random samples at each size smaller than the entire training set. Figure 2 shows the dependency accuracy measured on the complete development test set when training with samples of the data. The graph shows the average  (Wang et al., 2005) report numbers for undirected dependencies on the Chinese Treebank 3.0. We cannot meaningfully compare those numbers to the numbers here.</Paragraph>
    <Paragraph position="11">  dependency accuracy for five runs at each sample size up to 5,000 sentences. English and Chinese accuracies in this graph use oracle part-of-speech tags. At all sample sizes, the dependency accuracy for English exceeds the dependency accuracy of the other languages. This difference is perhaps partly attributable to the use of oracle part-of-speech tags. However, we suspect that the major contributor to this difference is the part-of-speech tag set. The tags used in the English Penn Treebank encode traditional lexical categories such as noun, preposition, and verb. They also encode morphological information such as person (the VBZ tag for example is used for verbs that are third person, present tense-typically with the suffix -s), tense, number and degree of comparison. The part-of-speech tag sets used for the other languages encode lexical categories, but do not encode morphological information. null  With small amounts of data, the perceptrons do not encounter sufficient instances of each lexical item to calculate reliable weights. The perceptrons are therefore forced to rely on the part-of-speech information. null It is surprising that the results for Arabic and Chinese should be so close as we vary the size of the  For Czech and Arabic we followed the convention established in previous parsing work on the Prague Czech Dependency Treebank of using the major and minor part-of-speech tags but ignoring other morphological information annotated on each node.</Paragraph>
    <Paragraph position="12"> training data (Figure 2) given that Arabic has rich morphology and Chinese very little. One possible explanation for the similarity in accuracy is that the rather poor root accuracy in Chinese indicates parses that have gone awry. Anecdotal inspection of parses suggests that when the root is not correctly identified, there are usually cascading related errors. Czech, a morphologically complex language in which root identification is far from straightforward, exhibits the worst performance at small sample sizes. But (not shown) as the sample size increases, the accuracy of Czech and Chinese converge. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML