XML Viewer - p06-1054

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1054_metho.xml
Size: 24,487 bytes
Last Modified: 2025-10-06 14:10:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1054">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Fast, Accurate Deterministic Parser for Chinese</Title>
  <Section position="4" start_page="0" end_page="425" type="metho">
    <SectionTitle>
(SVM), Maximum-Entropy (Maxent), Decision
</SectionTitle>
    <Paragraph position="0"> Tree (DTree) and memory-based learning (MBL).</Paragraph>
    <Paragraph position="1"> We also compared the performance of three different classifier ensemble approaches (simple voting, classifier stacking and meta-classifier).</Paragraph>
    <Paragraph position="2"> Our best model (using stacked classifiers) runs in linear time and has labeled precision and recall above 88% using gold-standard part-of-speech tags, surpassing the best published results (see Section 5). Our SVM parser is 2-13 times faster than state-of-the-art parsers, while produc- null ing more accurate results. Our Maxent and DTree parsers are 40-270 times faster than state-of-the-art parsers, but with 5-6% losses in accuracy.</Paragraph>
  </Section>
  <Section position="5" start_page="425" end_page="426" type="metho">
    <SectionTitle>
2 Deterministic parsing model
</SectionTitle>
    <Paragraph position="0"> Like other deterministic parsers, our parser assumes input has already been segmented and tagged with part-of-speech (POS) information during a preprocessing step1. The main data structures used in the parsing algorithm are a queue and a stack. The input word-POS pairs to be processed are stored in the queue. The stack holds the partial parse trees that are built during parsing. A parse state is represented by the content of the stack and queue.</Paragraph>
    <Paragraph position="1"> The classifier makes shift/reduce decisions based on contextual features that represent the parse state. A shift action removes the first item on the queue and puts it onto the stack. A reduce action is in the form of Reduce-{Binary|Unary}-X, where {Binary|Unary} denotes whether one or two items are to be removed from the stack, and X is the label of a new tree node that will be dominating the removed items. Because a reduction is either unary or binary, the resulting parse tree will only have binary and/or unary branching nodes.</Paragraph>
    <Paragraph position="2"> Parse trees are also lexicalized to produce dependency structures. For lexicalization, we used the same head-finding rules reported in (Bikel, 2004). With this additional information, reduce actions are now in the form of Reduce-{Binary |Unary}-X-Direction. The &amp;quot;Direction&amp;quot; tag gives information about whether to take the head-node of the left subtree or the right subtree to be the head of the new tree, in the case of binary reduction. A simple transformation process as described in (Sagae and Lavie, 2005) is employed to convert between arbitrary branching trees and binary trees. This transformation breaks multi-branching nodes down into binary-branching nodes by inserting temporary nodes; temporary nodes are collapsed and removed when we transform a binary tree back into a multi-branching tree.</Paragraph>
    <Paragraph position="3"> The parsing process succeeds when all the items in the queue have been processed and there is only one item (the final parse tree) left on the stack.</Paragraph>
    <Paragraph position="4"> If the classifier returns a shift action when there are no items left on the queue, or a reduce action when there are no items on the stack, the 1We constructed our own POS tagger based on SVM; see Section 3.3.</Paragraph>
    <Paragraph position="5"> parser fails. In this case, the parser simply combines all the items on the stack into one IP node, and outputs this as a partial parse. Sagae and Lavie (2005) have shown that this algorithm has linear time complexity, assuming that classification takes constant time. The next example illustrates the process for the input &amp;quot;_d_2504_d_3370 (Brown) _d_6264_d_7109(visits)_d_954_d_3919(Shanghai)&amp;quot; that is tagged with the POS sequence &amp;quot;NR (Proper Noun) VV (Verb)  NR (Proper Noun)&amp;quot;.</Paragraph>
    <Paragraph position="6"> 1. In the initial parsing state, the stack (S) is empty, and the queue (Q) holds word and POS tag pairs for the input sentence.</Paragraph>
    <Paragraph position="7">  3. The next action is a reduce-Unary-NP, which means reducing the first item on the stack to a NP node. Node (NR_d_2504_d_3370) becomes the head of the new NP node and this information is marked by brackets. The new parse state is:  Since after the action is performed, there will be only one tree node(IP) left on the stack and no items on the queue, this is the final action.</Paragraph>
    <Paragraph position="8"> The final state is:</Paragraph>
  </Section>
  <Section position="6" start_page="426" end_page="427" type="metho">
    <SectionTitle>
3 Classifiers and Feature Selection
</SectionTitle>
    <Paragraph position="0"> Classification is the key component of our parsing model. We conducted experiments with four different types of classifiers.</Paragraph>
    <Section position="1" start_page="426" end_page="426" type="sub_section">
      <SectionTitle>
3.1 Classifiers
</SectionTitle>
      <Paragraph position="0"> Support Vector Machine: Support Vector Machine is a discriminative classification technique which solves the binary classification problem by finding a hyperplane in a high dimensional space that gives the maximum soft margin, based on the Structural Risk Minimization Principle. We used the TinySVM toolkit (Kudo and Matsumoto, 2000), with a degree 2 polynomial kernel. To train a multi-class classifier, we used the one-against-all scheme.</Paragraph>
      <Paragraph position="1"> Maximum-Entropy Classifier: In a Maximum-entropy model, the goal is to estimate a set of parameters that would maximize the entropy over distributions that satisfy certain constraints. These constraints will force the model to best account for the training data (Ratnaparkhi, 1999). Maximum-entropy models have been used for Chinese character-based parsing (Fung et al., 2004; Luo, 2003) and POS tagging (Ng and Low, 2004). In our experiments, we used Le's Maxent toolkit (Zhang, 2004). This implementation uses the Limited-Memory Variable Metric method for parameter estimation. We trained all our models using 300 iterations with no event cut-off, and a Gaussian prior smoothing value of 2. Maxent classifiers output not only a single class label, but also a number of possible class labels and their associated probability estimate.</Paragraph>
      <Paragraph position="2"> Decision Tree Classifier: Statistical decision tree is a classic machine learning technique that has been extensively applied to NLP. For example, decision trees were used in the SPATTER system (Magerman, 1994) to assign probability distribution over the space of possible parse trees. In our experiment, we used the C4.5 decision tree classifier, and ignored lexical features whose counts were less than 7.</Paragraph>
      <Paragraph position="3">  Learning approaches the classification problem by storing training examples explicitly in memory, and classifying the current case by finding the most similar stored cases (using k-nearestneighbors). We used the TiMBL toolkit (Daelemans et al., 2004) in our experiment, with k = 5.</Paragraph>
    </Section>
    <Section position="2" start_page="426" end_page="427" type="sub_section">
      <SectionTitle>
3.2 Feature selection
</SectionTitle>
      <Paragraph position="0"> For each parse state, a set of features are extracted and fed to each classifier. Features are distributionally-derived or linguisticallybased, and carry the context of a particular parse state. When input to the classifier, each feature is treated as a contextual predicate which maps an outcome and a context to true,false value.</Paragraph>
      <Paragraph position="1"> The specific features used with the classifiers are listed in Table 1.</Paragraph>
      <Paragraph position="2"> Sun and Jurafsky (2003) studied the distributional property of rhythm in Chinese, and used the rhythmic feature to augment a PCFG model for a practical shallow parsing task. This feature has the value 1, 2 or 3 for monosyllabic, bi-syllabic or multi-syllabic nouns or verbs. For noun and verb phrases, the feature is defined as the number of words in the phrase. Sun and Jurafsky found that in NP and VP constructions there are strong constraints on the word length for verbs and nouns (a kind of rhythm), and on the number of words in a constituent. We employed these same rhythmic features to see whether this property holds for the Penn Chinese Treebank data, and if it helps in the disambiguation of phrase types. Experiments show that this feature does increase classification accuracy of the SVM model by about 1%.</Paragraph>
      <Paragraph position="3"> In both Chinese and English, there are punctuation characters that come in pairs (e.g., parentheses). In Chinese, such pairs are more frequent (quotes, single quotes, and book-name marks).</Paragraph>
      <Paragraph position="4"> During parsing, we note how many opening punc- null 1 A Boolean feature indicates if a closing punctuation is expected or not. 2 A Boolean value indicates if the queue is empty or not. 3 A Boolean feature indicates whether there is a comma separating S(1) and S(2) or not. 4 Last action given by the classifier, and number of words in S(1) and S(2). 5 Headword and its POS of S(1), S(2), S(3) and S(4), and word and POS of Q(1), Q(2), Q(3) and Q(4). 6 Nonterminal label of the root of S(1) and S(2), and number of punctuations in S(1) and S(2). 7 Rhythmic features and the linear distance between the head-words of the S(1) and S(2). 8 Number of words found so far to be dependents of the head-words of S(1) and S(2). 9 Nonterminal label, POS and headword of the immediate left and right child of the root of S(1) and S(2). 10 Most recently found word and POS pair that is to the left of the head-word of S(1) and S(2). 11 Most recently found word and POS pair that is to the right of the head-word of S(1) and S(2).  tuations we have seen on the stack. If the number is odd, then feature 2 will have value 1, otherwise 0. A boolean feature is used to indicate whether or not an odd number of opening punctuations have been seen and a closing punctuation is expected; in this case the feature gives a strong hint to the parser that all the items in the queue before the closing punctuation, and the items on the stack after the opening punctuation should be under a common constituent node which begins and ends with the two punctuations.</Paragraph>
    </Section>
    <Section position="3" start_page="427" end_page="427" type="sub_section">
      <SectionTitle>
3.3 POS tagging
</SectionTitle>
      <Paragraph position="0"> In our parsing model, POS tagging is treated as a separate problem and it is assumed that the input has already been tagged with POS. To compare with previously published work, we evaluated the parser performance on automatically tagged data. We constructed a simple POS tagger using an SVM classifier. The tagger makes two passes over the input sentence. The first pass extracts features from the two words and POS tags that came before the current word, the two words following the current word, and the current word itself (the length of the word, whether the word contains numbers, special symbols that separates foreign first and last names, common Chinese family names, western alphabets or dates). Then the tag is assigned to the word according to SVM classifier's output. In the second pass, additional features such as the POS tags of the two words following the current word, and the POS tag of the current word (assigned in the first pass) are used.</Paragraph>
      <Paragraph position="1"> This tagger had a measured precision of 92.5% for sentences [?] 40 words.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="427" end_page="429" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We performed experiments using the Penn Chinese Treebank. Sections 001-270 (3484 sentences, 84,873 words) were used for training, 271-300 (348 sentences, 7980 words) for development, and 271-300 (348 sentences, 7980 words) for testing.</Paragraph>
    <Paragraph position="1"> The whole dataset contains 99629 words, which is about 1/10 of the size of the English Penn Treebank. Standard corpus preparation steps were done prior to parsing, so that empty nodes were removed, and the resulting A over A unary rewrite nodes are collapsed. Functional labels of the non-terminal nodes are also removed, but we did not relabel the punctuations, unlike in (Jiang, 2004).</Paragraph>
    <Paragraph position="2"> Bracket scoring was done by the EVALB program2, and preterminals were not counted as constituents. In all our experiments, we used labeled recall (LR), labeled precision (LP) and F1 score (harmonic mean of LR and LP) as our evaluation metrics.</Paragraph>
    <Section position="1" start_page="427" end_page="428" type="sub_section">
      <SectionTitle>
4.1 Results of different classifiers
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the classification accuracy and parsing accuracy of the four different classifiers on the development set for sentences [?] 40 words, with gold-standard POS tagging. The runtime (Time) of each model and number of failed parses (Fail) are also shown.</Paragraph>
      <Paragraph position="1">  els' parsing accuracies on development set for sentences [?] 40 words, with gold-standard POS For the DTree learner, we experimented with two different classification strategies. In our first approach, the classification is done in a single stage (DTree1). The learner is trained for a multi- null class classification problem where the class labels include shift and all possible reduce actions. But this approach yielded a lot of parse failures (42 out of 350 sentences failed during parsing, and partial parse tree was returned). These failures were mostly due to false shift actions in cases where the queue is empty. To alleviate this problem, we broke the classification process down to two stages (DTree2). A first stage classifier makes a binary decision on whether the action is shift or reduce. If the output is reduce, a second-stage classifier decides which reduce action to take. Results showed that breaking down the classification task into two stages increased overall accuracy, and the number of failures was reduced to 30.</Paragraph>
      <Paragraph position="2"> The SVM model achieved the highest classification accuracy and the best parsing results. It also successfully parsed all sentences. The Max-ent model's classification error rate (7.4%) was 30% higher than the error rate of the SVM model (5.7%), and its F1 (84.6%) was 3.2% lower than SVM model's F1 (87.4%). But Maxent model was about 9.5 times faster than the SVM model. The DTree classifier achieved 81.6% LR and 83.6% LP. The MBL model did not perform well; although MBL and SVM differed in accuracy by only about 3 percent, the parsing results showed a difference of more than 10 percent. One possible explanation for the poor performance of the MBL model is that all the features we used were binary features, and memory-based learner is known to work better with multivalue features than binary features in natural language learning tasks (van den Bosch and Zavrel, 2000).</Paragraph>
      <Paragraph position="3"> In terms of speed and accuracy trade-off, there is a 5.5% trade-off in F1 (relative to SVM's F1) for a roughly 14 times speed-up between SVM and two-stage DTree. Maxent is more balanced in the sense that its accuracy was slightly lower (3.2%) than SVM, and was just about as fast as the two-stage DTree on the development set. The high speed of the DTree and Maxent models make them very attractive in applications where speed is more critical than accuracy. While the SVM model takes more CPU time, we show in Section 5 that when compared to existing parsers, SVM achieves about the same or higher accuracy but is at least twice as fast.</Paragraph>
      <Paragraph position="4"> Using gold-standard POS tagging, the best classifier model (SVM) achieved LR of 87.2% and LP of 88.3%, as shown in Table 4. Both measures surpass the previously known best results on parsing using gold-standard tagging. We also tested the SVM model using data automatically tagged by our POS tagger, and it achieved LR of 78.1% and LP of 81.1% for sentences [?] 40 words, as shown in Table 3.</Paragraph>
    </Section>
    <Section position="2" start_page="428" end_page="429" type="sub_section">
      <SectionTitle>
4.2 Classifier Ensemble Experiments
</SectionTitle>
      <Paragraph position="0"> Classifier ensemble by itself has been a fruitful research direction in machine learning in recent years. The basic idea in classifier ensemble is that combining multiple classifiers can often give significantly better results than any single classifier alone. We experimented with three different classifier ensemble strategies: classifier stacking, meta-classifier, and simple voting.</Paragraph>
      <Paragraph position="1"> Using the SVM classifier's results as a baseline, we tested these approaches on the development set. In classifier stacking, we collect the outputs from Maxent, DTree and TiMBL, which are all trained on a separate dataset from the training set (section 400-650 of the Penn Chinese Treebank, smaller than the original training set). We use their classification output as features, in addition to the original feature set, to train a new SVM model on the original training set. We achieved LR of 90.3% and LP of 90.5% on the development set, a 3.4% and 2.6% improvement in LR and LP, respectively. When tested on the test set, we gained 1% improvement in F1 when gold-standard POS tagging is used. When tested with automatic tagging, we achieved a 0.5% improvement in F1. Using Bikel's significant tester with 10000 times random shuffle, the p-value for LR and LP are 0.008 and 0.457, respectively. The increase in recall is statistically significant, and it shows classifier stacking can improve performance.</Paragraph>
      <Paragraph position="2"> On the other hand, we did not find metaclassification and simple voting very effective. In simple voting, we make the classifiers to vote in each step for every parse action. The F1 of simple voting method is downgraded by 5.9% relative to SVM model's F1. By analyzing the interagreement among classifiers, we found that there were no cases where Maxent's top output and DTree's output were both correct and SVM's output was wrong. Using the top output from Maxent and DTree directly does not seem to be complementary to SVM.</Paragraph>
      <Paragraph position="3"> In the meta-classifier approach, we first collect the output from each classifier trained on sec- null tion 1-210 (roughly 3/4 of the entire training set). Then specifically for Maxent, we collected the top output as well as its associated probability estimate. Then we used the outputs and probability estimate as features to train an SVM classifier that makes a decision on which classifier to pick.</Paragraph>
      <Paragraph position="4"> Meta-classifier results did not change at all from our baseline. In fact, the meta-classifier always picked SVM as its output. This agrees with our observation for the simple voting case.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="429" end_page="430" type="metho">
    <SectionTitle>
5 Comparison with Related Work
</SectionTitle>
    <Paragraph position="0"> Bikel and Chiang (2000) constructed two parsers using a lexicalized PCFG model that is based on Collins' model 2 (Collins, 1999), and a statistical Tree-adjoining Grammar(TAG) model. They used the same train/development/test split, and achieved LR/LP of 76.8%/77.8%. In Bikel's thesis (2004), the same Collins emulation model was used, but with tweaked head-finding rules.</Paragraph>
    <Paragraph position="1"> Also a POS tagger was used for assigning tags for unseen words. The refined model achieved LR/LP of 78.0%/81.2%. Chiang and Bikel (2002) used inside-outside unsupervised learning algorithm to augment the rules for finding heads, and achieved an improved LR/LP of 78.8%/81.1%.</Paragraph>
    <Paragraph position="2"> Levy and Manning (2003) used a factored model that combines an unlexicalized PCFG model with a dependency model. They achieved LR/LP of 79.2%/78.4% on a different test/development split. Xiong et al. (2005) used a similar model to the BBN's model in (Bikel and Chiang, 2000), and augmented the model by semantic categorical information and heuristic rules. They achieved LR/LP of 78.7%/80.1%. Hearne and Way (2004) used a Data-Oriented Parsing (DOP) approach that was optimized for top-down computation.</Paragraph>
    <Paragraph position="3"> They achieved F1 of 71.3 on a different test and training set. Jiang (2004) reported LR/LP of 80.1%/82.0% on sentences [?] 40 words (results not available for sentences [?] 100 words) by applying Collins' parser to Chinese. In Sun and Jurafsky (2004)'s work on Chinese shallow semantic parsing, they also applied Collin's parser to Chinese. They reported up-to-date the best parsing performance on Chinese Treebank. They achieved LR/LP of 85.5%/86.4% on sentences [?] 40 words, and LR/LP of 83.3%/82.2% on sentences [?] 100 words, far surpassing all other previously reported results. Luo (2003) and Fung et al. (2004) addressed the issue of Chinese text segmentation in their work by constructing character-based parsers. Luo integrated segmentation, POS tagging and parsing into one maximum-entropy framework. He achieved a F1 score of 81.4% in parsing. But the score was achieved using 90% of the 250K-CTB (roughly 2.5 times bigger than our training set) for training and 10% for testing. Fung et al.(2004) also took the maximum-entropy modeling approach, but augmented by transformation-based learning. They used the standard training and testing split. When tested with gold-standard segmentation, they achieved a F1 score of 79.56%, but POS-tagged words were treated as constituents in their evaluation.</Paragraph>
    <Paragraph position="4"> In comparison with previous work, our parser's accuracy is very competitive. Compared to Jiang's work and Sun and Jurafsky's work, the classifier ensemble model of our parser is lagging behind by 1% and 5.8% in F1, respectively. But compared to all other works, our classifier stacking model gave better or equal results for all three measures.</Paragraph>
    <Paragraph position="5"> In particular, the classifier ensemble model and SVM model of our parser achieved second and third highest LP, LR and F1 for sentences [?] 100 words as shown in Table 3. (Sun and Jurafsky did not report results on sentences [?] 100 words, but it is worth noting that out of all the test sentences,  only 2 sentences have length &gt; 100).</Paragraph>
    <Paragraph position="6"> Jiang (2004) and Bikel (2004)3 also evaluated their parsers on the test set for sentences [?] 40 words, using gold-standard POS tagged input. Our parser gives significantly better results as shown in Table 4. The implication of this result is twofold. On one hand, it shows that if POS tagging accuracy can be increased, our parser is likely to benefit more than the other two models; on the other hand, it also indicates that our deterministic model is less resilient to POS errors. Further detailed analysis is called for, to study the extent to which POS tagging errors affects the deterministic parsing model.</Paragraph>
  </Section>
  <Section position="9" start_page="430" end_page="430" type="metho">
    <SectionTitle>
POS
</SectionTitle>
    <Paragraph position="0"> To measure efficiency, we ran two publicly available parsers (Levy and Manning's PCFG parser (2003) and Bikel's parser (2004)) on the standard test set and compared the runtime4. The runtime of these parsers are shown in minute:second format in Table 5. Our SVM model is more than 2 times faster than Levy and Manning's parser, and more than 13 times faster than Bikel's parser. Our DTree model is 40 times faster than Levy and Manning's parser, and 270 times faster than Bikel's parser. Another advantage of our parser is that it does not take as much memory as these other parsers do. In fact, none of the models except MBL takes more than 60 megabytes of memory at runtime. In comparison, Levy and Manning's PCFG parser requires more than 400 mega-bytes of memory when parsing long sentences (70 words or longer).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML