XML Viewer - w06-2937

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2937_metho.xml
Size: 15,739 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2937">
  <Title>The Exploration of Deterministic and Efficient Dependency Parsing</Title>
  <Section position="4" start_page="0" end_page="242" type="metho">
    <SectionTitle>
2 System Description
</SectionTitle>
    <Paragraph position="0"> Over the past decades, many state-of-the-art parsing algorithm were proposed, such as head-word lexicalized PCFG (Collins, 1998), Maximum Entropy (Charniak, 2000), Maximum/Minimum spanning tree (MST) (McDonald et al., 2005), Bottom-up deterministic parsing (Yamada and Matsumoto, 2003), and Constant-time deterministic parsing (Nivre, 2003). Among them, the Nivre's algorithm (Nivre, 2003) was shown to be most efficient method, which only costs at most 2n transition actions to parse a sentence (O(n  ) for the bottom-up or MST approaches). Nivre's method is mainly consists of four transition actions, Left/Right/Reduce/Shift. We further extend these four actions by dividing the &amp;quot;reduce&amp;quot; into &amp;quot;reduce&amp;quot; and &amp;quot;sleep (reduce-but-shift)&amp;quot; two actions. Because the too early reduce action makes the following words difficult to find the parents. Thus, during training, if a word which is the child of the top of the stack, it is then assigned to the &amp;quot;sleep&amp;quot; category and pushed into stack, otherwise, the conventional reduce action is applied. Besides, we do not arrange these transition actions with priority order, instead, the decision is made by the classifier. The overall parsing model can be found in Figure 1.</Paragraph>
    <Paragraph position="1"> Table 1 lists the detail system spec of our model.</Paragraph>
    <Paragraph position="2">  . Parsing Algorithm: 1. Nivre's Algorithm (Nivre, 2003) 2. Root Parser 3. Exhaustive-based Post-processing .</Paragraph>
    <Paragraph position="3"> Parser Characteristics: null 1. Top-down + Bottom-up 2. Deterministic + Exhaustive 3. Labeling integrated 4. Non-Projective . Learner: SVMLight (Joachims, 1998) (1) One-versus-One (2) Linear Kernel . Feature Set: 1. Lexical (Unigram/Bigram) 2. Fine-grained POS and Coarse grained</Paragraph>
    <Section position="1" start_page="241" end_page="241" type="sub_section">
      <SectionTitle>
2.1 Constant-time Parser and Analysis
</SectionTitle>
      <Paragraph position="0"> The Nivre's algorithm makes use of a stack and an input list to model the word dependency relations via identifying the transition action of the top token on the stack (Top) and the next token of the input list (Next). Typically a learning algorithm can be used to recognize these actions via encoding features of the two terms (Top and Next). The &amp;quot;Left&amp;quot; and &amp;quot;Reduce&amp;quot; pops the Top from stack whereas the &amp;quot;Right&amp;quot;, &amp;quot;Reduce-But-Shift&amp;quot;, and &amp;quot;Shift&amp;quot; push token Next into the top of stack. Nivre (Nivre, 2003) had proved that this algorithm can accomplish dependency parsing at most 2n transition actions.</Paragraph>
      <Paragraph position="1"> Although, the Nivre's algorithm is much more efficient than the others, it produces three problems.</Paragraph>
      <Paragraph position="2">  1. It does not explicitly indicate which words are the roots.</Paragraph>
      <Paragraph position="3"> 2. Some of the terms in the stack do not belong to the root but still should be parsed.</Paragraph>
      <Paragraph position="4"> 3. It always only compares the Top and Next  words.</Paragraph>
      <Paragraph position="5"> The problem (2) and (3) are complement with each other. A straightforward way resolution is to adopt the exhaustive parsing strategy (Covington, 2001). Unfortunately, such a brute-force way may cause exponential training and testing spaces, which is impractical to apply to the large-scale corpus, for example, the Czech Treebank (1.3 million words).</Paragraph>
      <Paragraph position="6"> To overcome this and keep the efficiency, we design a post-processor that re-cycles the residuum in the stack and re-identify the heads of them. Since most of the terms (90-95%) of the terms had be processed in previous stages, the post-processor just exhaustively parses a small part. In addition, for problem (1), we propose a root parser based on the parsed result of the Nivre's algorithm. We discuss the root-parser and post-processor in the next two subsections.</Paragraph>
    </Section>
    <Section position="2" start_page="241" end_page="242" type="sub_section">
      <SectionTitle>
2.2 Root Parser
</SectionTitle>
      <Paragraph position="0"> After the first stage, the stack may contain root and un-parsed words. The root parser identifies the root word in the stack. The main advantage of this strategy could avoid sequential classification process, which only focuses on terms in the stack.</Paragraph>
      <Paragraph position="1"> We build a classifier, which learns to find root word based on encoding context and children features. However, most of the dependency relations were constructed at the first stage. Thus, we have more sufficient head-modifier information rather  than only taking the contexts into account. The used features are listed as follows.</Paragraph>
      <Paragraph position="2"> Neighbor terms,bigrams,POS,BiCPOS (+/-2 window) Left most child term, POS, Bigram, BiCPOS Right most child term, POS, Bigram, BiCPOS</Paragraph>
    </Section>
    <Section position="3" start_page="242" end_page="242" type="sub_section">
      <SectionTitle>
2.3 Post-Processing
</SectionTitle>
      <Paragraph position="0"> Before post-processing, we remove the root words from stack, which were identified by root-parser.</Paragraph>
      <Paragraph position="1"> The remaining un-parsed words in stack were used to construct the actual dependency graph via exhaustive comparing with parsed-words. It is necessary to build a post-processor since there are about 10% un-parsed words in each training set. We provide the un-parsed rate of each language in Table 2 (the r.h.s. part).</Paragraph>
      <Paragraph position="2"> By applying previous two steps (constant-time parser and root parser) to the training data, the remaining un-parsed tokens were recorded. Not only using the forward parsing direction, the backward direction is also taken into account in this statistics. Averagely, the un-parsed rates of the forward and backward directions are 13% and 4% respectively.</Paragraph>
      <Paragraph position="3"> The back ward parsing often achieves lower un-parsed rate among all languages (except for Japanese and Turkish).</Paragraph>
      <Paragraph position="4"> To find the heads of the un-parsed words, we copy the whole sentence into the word list again, and re-compare the un-parsed tokens (in stack) and all of the words in the input list. Comparing with the same words is disallowed. The comparing process is going on until the actual head is found.</Paragraph>
      <Paragraph position="5"> Acquiescently, we use the nearest root words as its head. Although such a brute force way is timeconsuming. However, it only parses a small part of un-parsed tokens (usually, 2 or 3 words per sentence). null</Paragraph>
    </Section>
    <Section position="4" start_page="242" end_page="242" type="sub_section">
      <SectionTitle>
2.4 Features and Learners
</SectionTitle>
      <Paragraph position="0"> For the constant-time parser of the first stage, we employ the features as follows.</Paragraph>
      <Paragraph position="1">  .word In this paper, we use the support vector machines (SVM) (Joachims, 1998) as the learner. SVM is widely used in many natural language processing (NLP) areas, for example, POS tagging (Wu et al., 2006). However, the SVM is a binary classifier which only recognizes true or false. For multiclass problem, we use the so-called one-versus-one (OVO) method with linear kernel to combine the results of each pairwise subclassifier. The final class in testing phase is mainly determined by majority voting.</Paragraph>
      <Paragraph position="2"> For all languages, our parser uses the same settings and features. For all the languages (except Japanese and Turkish), we use backward parsing direction to keep the un-parsed token rate low.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="242" end_page="243" type="metho">
    <SectionTitle>
3 Experimental Result
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="242" end_page="242" type="sub_section">
      <SectionTitle>
3.1 Dataset and Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The testing data is provided by the (Buchholz et al., 2006) which consists of 13 language treebanks.</Paragraph>
      <Paragraph position="1"> The experimental results are mainly evaluated by the unlabeled and labeled attachment scores. The CoNLL also provided a perl-scripter to automatic compute these rates.</Paragraph>
    </Section>
    <Section position="2" start_page="242" end_page="243" type="sub_section">
      <SectionTitle>
3.2 System Results
</SectionTitle>
      <Paragraph position="0"> Table 2 presents the overall parsing performance of the 13 languages. As shown in Table 2, we list two parsing results at the second and third columns (new and old). It is worth to note that the result B is produced by removing the enhanced features and the post-processing step from our parser, while the result A is the complete use of the enhanced features and the overall three-step parsing. In this year, we submit result B to the CoNLL shared task due to the time limitation.</Paragraph>
      <Paragraph position="1"> In addition, we also apply the Maltparser, which is implemented with the Nivre's algorithm (Nivre, 2003) to be compared. The Maltpaser also includes the SVM and memory-based learner (MBL). Nevertheless, it does not optimize the SVM where the training and testing times are too long to be compared even the linear kernel is used. Therefore we use the default MBL and feature model 3 (M3) in this experiment. We also perform the significant test to evaluate the statistical difference among the three results. If the answer is &amp;quot;Yes&amp;quot;, it means the two systems are significant difference under at least 95% confidence score (p &lt; 0.05).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="243" end_page="244" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="243" end_page="244" type="sub_section">
      <SectionTitle>
4.1 Analysis of Overview Aspect
</SectionTitle>
      <Paragraph position="0"> Although our method is efficient for parsing that achieves satisfactory result, it is still away from the state-of-the-art performance. Many problems give rise to not only the language-specific characteristics, but also the parsing strategy. We found that our method is weak to the large-scale training size and large dependency class datasets, for example, German (Brants et al., 2002) and Czech. For Dutch, we observe that the large non-projective tokens and relations in this set. Overall, we conclude the  The main reason of the first problem is still caused by the unbalanced distribution of the training data.</Paragraph>
      <Paragraph position="1"> Usually, the right-action categories obtain much fewer training examples. For example, in the Turkish data, 50 % of the categories receive less than 0.1% of the training examples, 2/3 are the right dependency group. For the Czech, 74.6% of the categories receive less than 0.1% of the training examples.</Paragraph>
      <Paragraph position="2"> Second, the too fine grained size of POS tag set often cause the features too specific that is difficult to be generalized by the learner. Although we found the grained size is not the critical factor of our parser, it is closely related to the fourth problem, feature engineering. For example, in Chinese (Chen et al., 2003), there are 303 fine grained POS types which achieves better result on the labeled attachment score is higher than the coarse grained (81.25 vs. 81.17). Intuitively, the feature combinations deeply affect the system performance (see A vs. C where we extend more features than the original Nivre's algorithm).</Paragraph>
      <Paragraph position="3"> Problem 3 exposes the disadvantage of our method, which is weak to identify the long distance dependency. The main reason is resulted from the Nivre's algorithm in step 1. This method is quite sensitive and non error-recovered since it is a deterministic parsing strategy. Abnormal or wrong push or pop actions usually cause the error propagation to the remaining words in the list. For example, there are large parts of errors are caused by too early reduce or missed left arc makes some words could not find the actual heads. On the contrary, one can use an N-best selection to choose the optimal dependency graph or applying MST or exhaustive parsing schema. Usually, these approaches are quite inefficient which requires at least O(n  ).</Paragraph>
      <Paragraph position="4"> Finally, in this paper, we only take the surface lexical word and POS tag into account without employing the language-specific features, such as Lemma, Morph...etc. Actually, it is an open question to compile and investigate the feature engineering. On the other hand, we also find the performance of the root parser in some languages is poor. For example, for Dutch the root precision rate is only 38.52, while the recall rate is 76.07. It indicates most of the words in stack were wrongly recognized as root. This is because there are substantially un-parsed rate that left many un-parsed words remain in stack. One way to remedy the problem can adjust the root parser to independently identify root word by sequential word classification at first step and then apply the Nivre's algorithm. We left the comparison of the issue as future work.</Paragraph>
    </Section>
    <Section position="2" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
4.2 Analysis of Specific View
</SectionTitle>
      <Paragraph position="0"> We select three languages, Arabic, Japanese, and Turkish to be more detail analysis. Figure 2 illustrates the learning curve of the three languages and Table 3 summarizes the comparisons of &amp;quot;fine vs.</Paragraph>
      <Paragraph position="1"> coarse&amp;quot; POS types and &amp;quot;forward vs. backward&amp;quot; parsing directions.</Paragraph>
      <Paragraph position="2"> For the three languages, we found that most of the errors frequently appear to the noun POS tags which often denominate half of the training set. In Turkish, the lower performance on the noun POS attachment rate deeply influents the overall parsing.</Paragraph>
      <Paragraph position="3"> For example, the error rate of Noun in Turkish is 39% which is the highest error rate. On the contrary, the head error rates fall in the middle rank for the other two languages.</Paragraph>
      <Paragraph position="4">  In Turkish, we also find an interesting result where the recall rate of the distance=2 parsing (56.87) is lower than distance=3-6, and &gt;7 (62.65, 57.83). In other words, for Turkish, our parser failed to recognize the distance=2 dependency relations. For the other languages, usually the identification rate of the longer distance parsing should be lower than the smaller distance. Thus, a future work to parsing Turkish, should put more emphasis on improving not only the noun POS type, but also the distance=2 parsing.</Paragraph>
      <Paragraph position="5"> Besides, the root parsing accuracy is also an important factor to most languages. In Japanese, although our parser achieves more than 97% left/right arc rates. However, for the root word precision rate is quite lower (85.97). Among all dependency relation classification rates, the root class usually locates in the lowest rank for the three languages. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML