File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2926_metho.xml
Size: 15,008 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2926"> <Title>A Pipeline Model for Bottom-Up Dependency Parsing</Title> <Section position="3" start_page="0" end_page="187" type="metho"> <SectionTitle> 1 System Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="186" type="sub_section"> <SectionTitle> 1.1 Parsing as a Pipeline </SectionTitle> <Paragraph position="0"> Pipeline computation is a common computational strategy in natural language processing, where a task is decomposed into several stages that are solved sequentially. For example, a semantic role labeling program may start by using a part-of-speech tagger, than apply a shallow parser to chunk the sentence into phrases, and continue by identifying predicates and arguments and then classifying them.</Paragraph> <Paragraph position="1"> (Yamada and Matsumoto, 2003) proposed a bottom-up dependency parsing algorithm, where the local actions, chosen from among Shift, Left, Right, are used to generate a dependency tree using a shift-reduce parsing approach. Moreover, they used SVMs to learn the parsing decisions between pairs of consecutive words in the sentences 1. This is a true pipeline approach in that the classi ers are trained on individual decisions rather than on the overall quality of the parser, and chained to yield the 1A pair of words may become consecutive after the words between them become the children of these two words global structure. It suffers from the limitations of pipeline processing, such as accumulation of errors, but nevertheless, yields very competitive parsing results. null We devise two natural principles for enhancing pipeline models. First, inference procedures should be incorporated to make robust prediction for each stage. Second, the number of predictions should be minimized to prevent error accumulation. According to these two principles, we propose an improved pipeline framework for multi-lingual dependency parsing that aims at addressing the limitations of the pipeline processing. Speci cally, (1) we use local search, a look ahead policy, to improve the accuracy of the predicted actions, and (2) we argue that the parsing algorithm we used minimizes the number of actions (Chang et al., 2006).</Paragraph> <Paragraph position="2"> We use the set of actions: Shift, Left, Right, Wait-Left, WaitRight for the parsing algorithm. The pure Wait action was suggested in (Yamada and Matsumoto, 2003). However, here we come up with these ve actions by separating actions Left into (real) Left and WaitLeft, and Right into (real) Right and WaitRight. Predicting these turns out to be easier due to ner granularity. We then use local search over consecutive actions and better exploit the dependencies among them.</Paragraph> <Paragraph position="3"> The parsing algorithm is a modi ed shift-reduce parser (Aho et al., 1986) that makes use of the actions described above and applies them in a left to right manner on consecutive word pairs (a, b) (a < b) in the word list T. T is initialized as the full sentence. Latter, the actions will change the contents of T. The actions are used as follows: Shift: there is no relation between a and b.</Paragraph> <Paragraph position="4"> Right: b is the parent of a, Left: a is the parent of b WaitLeft: a is the parent of b, but it's possible that b is a parent of other nodes. Action is deferred.</Paragraph> <Paragraph position="5"> The actions control the procedure of building trees. When Left or Right is performed, the algorithm has found a parent and a child. Then, the function deleteWord will be called to eliminate the child word, and the procedure will be repeated until the tree is built. In projective languages, we discovered that action WaitRight is not needed. Therefore, for projective languages, we just need 4 actions.</Paragraph> <Paragraph position="6"> In order to complete the description of the algorithm we need to describe which pair of consecutive words to consider once an action is taken. We describe it via the notion of the focus point, which represents the index of the current word in T. In fact, determining the focus point does not affect the correctness of the algorithm. It is easy to show that any pair of consecutive words in the sentence can be considered next. If the correct action is chosen for the corresponding pair, this will eventually yield the correct tree (but may necessitate multiple cycles through the sentence).</Paragraph> <Paragraph position="7"> In practice, however, the actions chosen will be noisy, and a wasteful focus point policy will result in a large number of actions, and thus in error accumulation. To minimize the number of actions taken, we want to nd a good focus point placement policy.</Paragraph> <Paragraph position="8"> There are many natural placement policies that we can consider (Chang et al., 2006). In this paper, according to the policy we used, after S and WL, the focus point moves one word to the right. After L or R, we adopt the policy Step Back: the focus moves back one word to the left. Although the focus placement policy here is similar to (Yamada and Matsumoto, 2003), they did not explain why they made this choice. In (Chang et al., 2006), we show that the policy movement used here minimized the number of actions during the parsing procedure. We can also show that the algorithm can parse a sentence with projective relationships in only one round.</Paragraph> <Paragraph position="9"> Once the parsing algorithm, along with the focus point policy, is determined, we can train the action classi ers. Given an annotated corpus, the parsing algorithm is used to determine the action taken for each consecutive pair; this is used to train a classi er Algorithm 1 Pseudo Code of the dependency parsing algorithm. getFeatures extracts the features describing the currently considered pair of words; getAction determines the appropriate action for the pair; assignParent assigns the parent for the child word based on the action; and deleteWord deletes the word which become child once the action is taken.</Paragraph> <Paragraph position="10"> Let t represents for a word and its part of speech For sentence T = {t1, t2, . . ., tn} focus= 1 while focus< |T |do</Paragraph> <Paragraph position="12"> to predict one of the four actions. The details of the classi er and the features are given in Section 3.</Paragraph> <Paragraph position="13"> When we apply the trained model on new data, the sentence is processed from left to right to produce the predicted dependency tree. The evaluation process is somewhat more involved, since the action classi er is not used as it is, but rather via a local search inference step. This is described in Section 2.</Paragraph> <Paragraph position="14"> Algorithm 1 depicts the pseudo code of our parsing algorithm.</Paragraph> <Paragraph position="15"> Our algorithm is designed for projective languages. For non-projective relationships in some languages, we convert them into near projective ones. Then, we directly apply the algorithm on modi ed data in training stage. Because the sentences in some language, such as Czech, etc. , may have multi roots, in our experiment, we ran multiple rounds of Algorithm 1 to build the tree.</Paragraph> </Section> <Section position="2" start_page="186" end_page="187" type="sub_section"> <SectionTitle> 1.2 Labeling the Type of Dependencies </SectionTitle> <Paragraph position="0"> In our work, labeling the type of dependencies is a post-task after the phase of predicting the head for the tokens in the sentences. This is a multi-class classi cation task. The number of the de- null pendency types for each language can be found in the organizer's introduction paper of the shared task of CoNLL-X. In the phase of learning dependency types, the parent of the tokens, which was labeled in the rst phase, will be used as features. The predicted actions can help us to make accurate predictions for dependency types.</Paragraph> </Section> <Section position="3" start_page="187" end_page="187" type="sub_section"> <SectionTitle> 1.3 Dealing with Crossing Edges </SectionTitle> <Paragraph position="0"> The algorithm described in previous section is primarily designed for projective languages. To deal with non-projective languages, we use a similar approach of (Nivre and Nilsson, 2005) to map non-projective trees to projective trees. Any single rooted projective dependency tree can be mapped into a projective tree by the Lift operation. The de nition of Lift is as follows: Lift(wj - wk) = parent(wj) - wk, where a - b means that a is the parent of b, and parent is a function which returns the parent word of the given word. The procedure is as follows. First, the mapping algorithm examines if there is a crossing edge in the current tree. If there is a crossing edge, it will perform Lift and replace the edge until the tree becomes projective.</Paragraph> </Section> </Section> <Section position="4" start_page="187" end_page="188" type="metho"> <SectionTitle> 2 Local Search </SectionTitle> <Paragraph position="0"> The advantage of a pipeline model is that it can use more information that is taken from the outcomes of previous prediction. However, this may result in accumulating error. Therefore, it is essential for our algorithm to use a reliable action predictor. This motivates the following approach for making the local prediction in a pipeline model more reliable. Informally, we devise a local search algorithm and use it as a look ahead policy, when determining the predicted action.</Paragraph> <Paragraph position="1"> In order to improve the accuracy, we might want to examine all the combinations of actions proposed and choose the one that maximizes the score. It is clearly intractable to nd the global optimal prediction sequence in a pipeline model of the depth we consider. The size of the possible action sequence increases exponentially so that we can not examine every possibility. Therefore, a local search framework which uses additional information, however, is suitable and tractable.</Paragraph> <Paragraph position="2"> The local search algorithm is presented in Al-Algorithm 2 Pseudo code for the local search algorithm. In the algorithm, y represents the a action sequence. The function search considers all possible action sequences with |depth |actions and returns the sequence with highest score.</Paragraph> <Paragraph position="4"> model and depth. We assume a classi er that can give a con dence in its prediction. This is represented here by model. depth is the parameter determining the depth of the local search. State encodes the con guration of the environment (in the context of the dependency parsing this includes the sentence, the focus point and the current parent and children for each node). Note that the features extracted for the action classi er depends on State, and State changes by the update function when a prediction is made. In this paper, the update function cares about the child word elimination, relationship addition and focus point movement.</Paragraph> <Paragraph position="5"> The search algorithm will perform a search of length depth. Additive scoring is used to score the sequence, and the rst action in this sequence is performed. Then, the State is updated, determining the next features for the action classi ers and search is called again.</Paragraph> <Paragraph position="6"> One interesting property of this framework is that we use future information in addition to past information. The pipeline model naturally allows access to all the past information. But, since our algorithm uses the search as a look ahead policy, it can produce more robust results.</Paragraph> </Section> <Section position="5" start_page="188" end_page="188" type="metho"> <SectionTitle> 3 Experiments and Results </SectionTitle> <Paragraph position="0"> In this work we used as our learning algorithm a regularized variation of the perceptron update rule as incorporated in SNoW (Roth, 1998; Carlson et al., 1999), a multi-class classi er that is speci cally tailored for large scale learning tasks. SNoW uses softmax over the raw activation values as its con dence measure, which can be shown to be a reliable approximation of the labels' probabilities. This is used both for labeling the actions and types of dependencies. There is no special language enhancement required for each language. The resources provided for 12 languages are described in: (Haji c et al., 2004; Chen et al., 2003; Bcurrency1ohmov*a et al., 2003; Kromann, 2003; van der Beek et al., 2002; Brants et al., 2002; Kawata and Bartels, 2000; Afonso et al., 2002; D zeroski et al., 2006; Civit Torruella and Mart* Anton* n, 2002; Nilsson et al., 2005; O azer et al., 2003; Atalay et al., 2003).</Paragraph> <Section position="1" start_page="188" end_page="188" type="sub_section"> <SectionTitle> 3.1 Experimental Setting </SectionTitle> <Paragraph position="0"> The feature set plays an important role in the quality of the classi er. Basically, we used the same feature set for the action selection classi ers and for the label classi ers. In our work, each example has average fty active features. For each word pair (w1, w2), we used their LEMMA, the POSTAG and also the POSTAG of the children of w1 and w2. We also included the LEMMA and POSTAG of surrounding words in a window of size (2, 4).</Paragraph> <Paragraph position="1"> We considered 2 words before w1 and 4 words after w2 (we agree with the window size in (Yamada and Matsumoto, 2003)). The major difference of our feature set compared with the one in (Yamada and Matsumoto, 2003) is that we included the previous predicted action. We also added some conjunctions of the above features to ensure expressiveness of the model. (Yamada and Matsumoto, 2003) made use of the polynomial kernel of degree 2 so they in fact use more conjunctive features. Beside these features, we incorporated the information of FEATS for the languages when it is available. The columns in the data les we used for our work are the LEMMA, POSTAG, and the FEATS, which is treated as atomic. Due to time limitation, we did not apply the local search algorithm for the languages having the FEATS features.</Paragraph> </Section> <Section position="2" start_page="188" end_page="188" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Table 1 shows our results on Unlabeled Attachment Scores (UAS), Labeled Attachment Scores (LAS), and Label Accuracy score (LAC) for 12 languages.</Paragraph> <Paragraph position="1"> Our results are compared with the average scores (AV) and the standard deviations (SD), of all the systems participating in the shared task of CoNLL-X.</Paragraph> <Paragraph position="2"> Our average UAS for 12 languages is 83.54% with the standard deviation 6.01; and 76.80% with the standard deviation 9.43 for average LAS.</Paragraph> </Section> </Section> class="xml-element"></Paper>