File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1012_evalu.xml
Size: 8,724 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1012"> <Title>Online Large-Margin Training of Dependency Parsers</Title> <Section position="5" start_page="94" end_page="96" type="evalu"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> We tested our methods experimentally on the English Penn Treebank (Marcus et al., 1993) and on the Czech Prague Dependency Treebank (HajiVc, 1998).</Paragraph> <Paragraph position="1"> All experiments were run on a dual 64-bit AMD Opteron 2.4GHz processor.</Paragraph> <Paragraph position="2"> To create dependency structures from the Penn Treebank, we used the extraction rules of Yamada and Matsumoto (2003), which are an approximation to the lexicalization rules of Collins (1999). We split the data into three parts: sections 02-21 for training, section 22 for development and section 23 for evaluation. Currently the system has 6,998,447 features. Each instance only uses a tiny fraction of these features making sparse vector calculations possible.</Paragraph> <Paragraph position="3"> Our system assumes POS tags as input and uses the tagger of Ratnaparkhi (1996) to provide tags for the development and evaluation sets.</Paragraph> <Paragraph position="4"> Table 2 shows the performance of the systems that were compared. Y&M2003 is the SVM-shiftreduce parsing model of Yamada and Matsumoto (2003), N&S2004 is the memory-based learner of Nivre and Scholz (2004) and MIRA is the the system we have described. We also implemented an averaged perceptron system (Collins, 2002) (another online learning algorithm) for comparison. This table compares only pure dependency parsers that do identified their parent in the tree. Root is the number of trees in which the root word was correctly identified. For Czech this is f-measure since a sentence may have multiple roots. Complete is the number of sentences for which the entire dependency tree was correct.</Paragraph> <Paragraph position="5"> not exploit phrase structure. We ensured that the gold standard dependencies of all systems compared were identical.</Paragraph> <Paragraph position="6"> Table 2 shows that the model described here performs as well or better than previous comparable systems, including that of Yamada and Matsumoto (2003). Their method has the potential advantage that SVM batch training takes into account all of the constraints from all training instances in the optimization, whereas online training only considers constraints from one instance at a time. However, they are fundamentally limited by their approximate search algorithm. In contrast, our system searches the entire space of dependency trees and most likely benefits greatly from this. This difference is amplified when looking at the percentage of trees that correctly identify the root word. The models that search the entire space will not suffer from bad approximations made early in the search and thus are more likely to identify the correct root, whereas the approximate algorithms are prone to error propagation, which culminates with attachment decisions at the top of the tree. When comparing the two online learning models, it can be seen that MIRA outperforms the averaged perceptron method. This difference is statistically significant, p < 0.005 (McNemar test on head selection accuracy).</Paragraph> <Paragraph position="7"> In our Czech experiments, we used the dependency trees annotated in the Prague Treebank, and the predefined training, development and evaluation sections of this data. The number of sentences in this data set is nearly twice that of the English treebank, leading to a very large number of features -13,450,672. But again, each instance uses just a handful of these features. For POS tags we used the automatically generated tags in the data set. Though we made no language specific model changes, we did need to make some data specific changes. In particular, we used the method of Collins et al. (1999) to simplify part-of-speech tags since the rich tags used by Czech would have led to a large but rarely seen set of POS features.</Paragraph> <Paragraph position="8"> The model based on MIRA also performs well on Czech, again slightly outperforming averaged perceptron. Unfortunately, we do not know of any other parsing systems tested on the same data set. The Czech parser of Collins et al. (1999) was run on a different data set and most other dependency parsers are evaluated using English. Learning a model from the Czech training data is somewhat problematic since it contains some crossing dependencies which cannot be parsed by the Eisner algorithm. One trick is to rearrange the words in the training set so that all trees are nested. This at least allows the training algorithm to obtain reasonably low error on the training set. We found that this did improve performance slightly to 83.6% accuracy.</Paragraph> <Section position="1" start_page="95" end_page="96" type="sub_section"> <SectionTitle> 3.1 Lexicalized Phrase Structure Parsers </SectionTitle> <Paragraph position="0"> It is well known that dependency trees extracted from lexicalized phrase structure parsers (Collins, 1999; Charniak, 2000) typically are more accurate than those produced by pure dependency parsers (Yamada and Matsumoto, 2003). We compared our system to the Bikel re-implementation of the Collins parser (Bikel, 2004; Collins, 1999) trained with the same head rules of our system. There are two ways to extract dependencies from lexicalized phrase structure. The first is to use the automatically generated dependencies that are explicit in the lexicalization of the trees, we call this system Collinsauto. The second is to take just the phrase structure output of the parser and run the automatic head rules over it to extract the dependencies, we call this sys- null computational complexity of each parser and Time the CPU time to parse sec. 23 of the Penn Treebank. tem Collins-rules. Table 3 shows the results comparing our system, MIRA-Normal, to the Collins parser for English. All systems are implemented in Java and run on the same machine.</Paragraph> <Paragraph position="1"> Interestingly, the dependencies that are automatically produced by the Collins parser are worse than those extracted statically using the head rules. Arguably, this displays the artificialness of English dependency parsing using dependencies automatically extracted from treebank phrase-structure trees. Our system falls in-between, better than the automatically generated dependency trees and worse than the head-rule extracted trees.</Paragraph> <Paragraph position="2"> Since the dependencies returned from our system are better than those actually learnt by the Collins parser, one could argue that our model is actually learning to parse dependencies more accurately. However, phrase structure parsers are built to maximize the accuracy of the phrase structure and use lexicalization as just an additional source of information. Thus it is not too surprising that the dependencies output by the Collins parser are not as accurate as our system, which is trained and built to maximize accuracy on dependency trees. In complexity and run-time, our system is a huge improvement over the Collins parser.</Paragraph> <Paragraph position="3"> The final system in Table 3 takes the output of Collins-rules and adds a feature to MIRA-Normal that indicates for given edge, whether the Collins parser believed this dependency actually exists, we call this system MIRA-Collins. This is a well known discriminative training trick -- using the suggestions of a generative system to influence decisions. This system can essentially be considered a corrector of the Collins parser and represents a significant improvement over it. However, there is an added complexity with such a model as it requires the output of the O(n5) Collins parser.</Paragraph> </Section> <Section position="2" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 3.2 k-best MIRA Approximation </SectionTitle> <Paragraph position="0"> One question that can be asked is how justifiable is the k-best MIRA approximation. Table 4 indicates the accuracy on testing and the time it took to train models with k = 1,2,5,10,20 for the English data set. Even though the parsing algorithm is proportional to O(klogk), empirically, the training times scale linearly with k. Peak performance is achieved very early with a slight degradation around k=20.</Paragraph> <Paragraph position="1"> The most likely reason for this phenomenon is that the model is overfitting by ensuring that even unlikely trees are separated from the correct tree proportional to their loss.</Paragraph> </Section> </Section> class="xml-element"></Paper>