File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1021_metho.xml
Size: 12,847 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1021"> <Title>Multilingual Dependency Parsing using Bayes Point Machines</Title> <Section position="4" start_page="160" end_page="161" type="metho"> <SectionTitle> 3 Parser Architecture </SectionTitle> <Paragraph position="0"> We take as our starting point a re-implementation of McDonald's state-of-the-art dependency parser (McDonald et al., 2005a). Given a sentence x, the goal of the parser is to find the highest-scoring parse ^y among all possible parses y [?] Y :</Paragraph> <Paragraph position="2"> The files in each partition of the Chinese and Arabic data are given at http://research.microsoft.com/~simonco/ HLTNAACL2006.</Paragraph> <Paragraph position="3"> For a given parse y, its score is the sum of the scores of all its dependency links (i, j) [?] y:</Paragraph> <Paragraph position="5"> where the link (i, j) indicates a head-child dependency between the token at position i and the token at position j. The score d(i, j) of each dependency link (i, j) is further decomposed as the weighted sum of its features f(i, j).</Paragraph> <Paragraph position="6"> This parser architecture naturally consists of three modules: (1) a decoder that enumerates all possible parses y and computes the argmax; (2) a training algorithm for adjusting the weights w given the training data; and (3) a feature representation f(i, j). Two decoders will be discussed here; the training algorithm and feature representation are discussed in the following sections.</Paragraph> <Paragraph position="7"> A good decoder should satisfy several properties: ideally, it should be able to search through all valid parses of a sentence and compute the parse scores efficiently. Efficiency is a significant issue since there are usually an exponential number of parses for any given sentence, and the discriminative training methods we will describe later require repeated decoding at each training iteration. We re-implemented Eisner's decoder (Eisner, 1996), which searches among all projective parse trees, and the Chu-Liu-Edmonds' decoder (Chu and Liu, 1965; Edmonds, 1967), which searches in the space of both projective and non-projective parses. (A projective tree is a parse with no crossing dependency links.) For the English and Chinese data, the head-finding rules for converting from Penn Treebank analyses to dependency analyses creates trees that are guaranteed to be projective, so Eisner's algorithm suffices. For the Czech and Arabic corpora, a non-projective decoder is necessary. Both algorithms are O(N ), where N is the number of words in a sentence.</Paragraph> <Paragraph position="8"> Refer to (McDonald et al., 2005b) for a detailed treatment of both algorithms.</Paragraph> </Section> <Section position="5" start_page="161" end_page="162" type="metho"> <SectionTitle> 4 Training: The Bayes Point Machine </SectionTitle> <Paragraph position="0"> In this section, we describe an online learning algorithm for training the weights w. First, we argue why an online learner is more suitable than a batch learner like a Support Vector Machine (SVM) for this task. We then review some standard on-line learners (e.g. perceptron) before presenting the Bayes Point Machine (BPM) (Herbrich et al., 2001; Harrington et al., 2003).</Paragraph> <Section position="1" start_page="161" end_page="162" type="sub_section"> <SectionTitle> 4.1 Online Learning </SectionTitle> <Paragraph position="0"> An online learner differs from a batch learner in that it adjusts w incrementally as each input sample is revealed. Although the training data for our parsing problem exists as a batch (i.e. all input samples are available during training), we can apply online learning by presenting the input samples in some sequential order. For large training set sizes, a batch learner may face computational difficulties since there already exists an exponential number of parses per input sentence. Online learning is more tractable since it works with one input at a time.</Paragraph> <Paragraph position="1"> A popular online learner is the perceptron. It adjusts w by updating it with the feature vector whenever a misclassification on the current input sample occurs. It has been shown that such updates converge in a finite number of iterations if the data is linearly separable. The averaged perceptron (Collins, 2002) is a variant which averages the w across all iterations; it has demonstrated good generalization especially with data that is not linearly separable, as in many natural language processing problems.</Paragraph> <Paragraph position="2"> The Chu-Liu-Edmonds' decoder, which is based on a maximal spanning tree algorithm, can run in O(N ), but our simpler implementation of O(N ) was sufficient.</Paragraph> <Paragraph position="3"> Recently, the good generalization properties of Support Vector Machines have prompted researchers to develop large margin methods for the online setting. Examples include the margin perceptron (Duda et al., 2001), ALMA (Gentile, 2001), and MIRA (which is used to train the parser in (McDonald et al., 2005a)). Conceptually, all these methods attempt to achieve a large margin and approximate the maximum margin solution of SVMs.</Paragraph> </Section> <Section position="2" start_page="162" end_page="162" type="sub_section"> <SectionTitle> 4.2 Bayes Point Machines </SectionTitle> <Paragraph position="0"> The Bayes Point Machine (BPM) achieves good generalization similar to that of large margin methods, but is motivated by a very different philosophy of Bayesian learning or model averaging. In the Bayesian learning framework, we assume a prior distribution over w. Observations of the training data revise our belief of w and produce a posterior distribution. The posterior distribution is used to create the final w</Paragraph> <Paragraph position="2"> where p(w|D) is the posterior distribution of the weights given the data D and E p(w|D) is the expectation taken with respect to this distribution. The term |V (D) |is the size of the version space V (D), which is the set of weights w</Paragraph> <Paragraph position="4"> that is consistent with the training data (i.e. the set of w</Paragraph> <Paragraph position="6"> that classifies the training data with zero error). This solution achieves the so-called Bayes Point, which is the best approximation to the Bayes optimal solution given finite training data.</Paragraph> <Paragraph position="7"> In practice, the version space may be large, so we approximate it with a finite sample of size I. Further, assuming a uniform prior over weights, we get the following equation:</Paragraph> <Paragraph position="9"> Equation 4 can be computed by a very simple algorithm: (1) Train separate perceptrons on different random shuffles of the entire training data, obtaining a set of w</Paragraph> <Paragraph position="11"> training results in different weight vector solutions Input: Training set D = ((x</Paragraph> <Paragraph position="13"/> <Paragraph position="15"> if the data samples are presented sequentially in different orders. Therefore, random shuffles of the data and training a perceptron on each shuffle is effectively equivalent to sampling different models (w</Paragraph> <Paragraph position="17"> in the version space. Note that this averaging operation should not be confused with ensemble techniques such as Bagging or Boosting-ensemble techniques average the output hypotheses, whereas BPM averages the weights (models).</Paragraph> <Paragraph position="18"> The BPM pseudocode is given in Figure 1. The inner loop is simply a perceptron algorithm, so the BPM is very simple and fast to implement. The outer loop is easily parallelizable, allowing speed-ups in training the BPM. In our specific implementation for dependency parsing, the line of the pseudocode corresponding to [^y</Paragraph> <Paragraph position="20"> by Eq. 1 and updates are performed for each incorrect dependency link. Also, we chose to average each individual perceptron (Collins, 2002) prior to Bayesian averaging.</Paragraph> <Paragraph position="21"> Finally, it is important to note that the definition of the version space can be extended to include weights with non-zero training error, so the BPM can handle data that is not linearly separable. Also, although we only presented an algorithm for linear classifiers (parameterized by the weights), arbitrary kernels can be applied to BPM to allow non-linear decision boundaries. Refer to (Herbrich et al., 2001) for a comprehensive treatment of BPMs.</Paragraph> </Section> </Section> <Section position="6" start_page="162" end_page="163" type="metho"> <SectionTitle> 5 Features </SectionTitle> <Paragraph position="0"> Dependency parsers for all four languages were trained using the same set of feature types. The feature types are essentially those described in (Mc-Donald et al., 2005a). For a given pair of tokens, where one is hypothesized to be the parent and the other to be the child, we extract the word of the parent token, the part of speech of the parent token, the word of the child token, the part of speech of the child token and the part of speech of certain adjacent and intervening tokens. Some of these atomic features are combined in feature conjunctions up to four long, with the result that the linear classifiers described below approximate polynomial kernels. For example, in addition to the atomic features extracted from the parent and child tokens, the feature [ParentWord, ParentPOS, ChildWord, ChildPOS] is also added to the feature vector representing the dependency between the two tokens. Additional features are created by conjoining each of these features with the direction of the dependency (i.e. is the parent to the left or right of the child) and a quantized measure of the distance between the two tokens. Every token has exactly one parent. The root of the sentence has a special synthetic token as its parent.</Paragraph> <Paragraph position="1"> Like McDonald et al, we add features that consider the first five characters of words longer than five characters. This truncated word crudely approximates stemming. For Czech and English the addition of these features improves accuracy. For Chinese and Arabic, however, it is clear that we need a different backoff strategy.</Paragraph> <Paragraph position="2"> For Chinese, we truncate words longer than a single character to the first character.</Paragraph> <Section position="1" start_page="163" end_page="163" type="sub_section"> <SectionTitle> Experimental </SectionTitle> <Paragraph position="0"> results on the development test set suggested that an alternative strategy, truncation of words longer than two characters to the first two characters, yielded slightly worse results.</Paragraph> <Paragraph position="1"> The Arabic data is annotated with gold-standard morphological information, including information about stems. It is also annotated with the output of an automatic morphological analyzer, so that researchers can experiment with Arabic without first needing to build these components. For Arabic, we truncate words to the stem, using the value of the lemma attribute.</Paragraph> <Paragraph position="2"> All tokens are converted to lowercase, and numbers are normalized. In the case of English, Czech and Arabic, all numbers are normalized to a sin- null There is a near 1:1 correspondence between characters and morphemes in contemporary Mandarin Chinese. However, most content words consist of more than one morpheme, typically two.</Paragraph> <Paragraph position="3"> gle token. In Chinese, months are normalized to a MONTH token, dates to a DATE token, years to a YEAR token. All other numbers are normalized to a single NUMBER token.</Paragraph> <Paragraph position="4"> The feature types were instantiated using all oracle combinations of child and parent tokens from the training data. It should be noted that when the feature types are instantiated, we have considerably more features than McDonald et al. For example, for English we have 8,684,328 whereas they report 6,998,447 features. We suspect that this is mostly due to differences in implementation of the features that backoff to stems.</Paragraph> <Paragraph position="5"> The averaged perceptrons were trained on the one-best parse, updating the perceptron for every edge and averaging the accumulated perceptrons after every sentence. Experiments in which we updated the perceptron based on k-best parses tended to produce worse results. The Chu-Liu-Edmonds algorithm was used for Czech. Experiments with the development test set suggested that the Eisner decoder gave better results for Arabic than the Chu-Liu-Edmonds decoder. We therefore used the Eisner decoder for Arabic, Chinese and English.</Paragraph> </Section> </Section> class="xml-element"></Paper>