File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1083_evalu.xml
Size: 9,534 bytes
Last Modified: 2025-10-06 14:00:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1083"> <Title>Using Decision Trees to Construct a Practical Parser</Title> <Section position="5" start_page="507" end_page="510" type="evalu"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluated the proposed parser using the EDR Japanese annotated corpus (EDR, 199.5). The experiment consisted of two parts. One evaluated the single-tree parser and the other tile boosting counterpart. In tile rest of this section, parsing accuracy refers only to precision; how many of tile system's output are correct in terms of the annotated corpus.</Paragraph> <Paragraph position="1"> We do not show recall because we assume every bunsetsu modifies only one posterior bunsetsu. The features used for learning were non head-word features, (i.e., type 2 to 8 in Table 3). Section 4.1.4 investigates lexical information of head words such as frequent, words and thesaurus categories. Before going into details of tile experimental results, we sunnnarize here how training and test data were selected.</Paragraph> <Paragraph position="2"> 1. After all sentences in the EDR corpus were word-segmented and part-of-speech tagged (Matsumoto and others, 1996), they were then chunked into a sequence of bunsetsu.</Paragraph> <Paragraph position="3"> 2. All bunsetsu pairs were compared with EDR bracketing annotation (correct segmentations and modifications). If a sentence contained a pair inconsistent with the EDR annotation, the sentence was removed from the data.</Paragraph> <Paragraph position="4"> 3. All data examined (total number of sentences:207802, total number of bunset.su:1790920) were divided into 20 files, The training data were same number of first sentences of the 20 files according to the training data size. Test data (10000 sentences) were the 2501th to 3000th sentences of each file.</Paragraph> <Section position="1" start_page="508" end_page="508" type="sub_section"> <SectionTitle> 4.1 Single Tree Experiments </SectionTitle> <Paragraph position="0"> In the single tree experiments, we evaluated the following 4 properties of the new dependency parser.</Paragraph> <Paragraph position="1"> Table 5 summarizes the parsing accuracy with various confidence levels of pruning. The number of training sentences was 10000.</Paragraph> <Paragraph position="2"> In C4.5 programs, a larger value of confidence means weaker pruning and 25% is connnonly used in various domains (Quinlan, 1993). Our experimental results show that 75% pruning attains the best performance, i.e. weaker pruning than usual. In the remaining single tree experiments, we used the 75% confidence level. Although strong pruning treats infrequent data as noise, parsing involves many exceptional and infrequent modifications as mentioned before. Our result means that only information included in small numbers of samples are useful for disambiguating the syntactic structure of sentences.</Paragraph> </Section> <Section position="2" start_page="508" end_page="508" type="sub_section"> <SectionTitle> 4.1.2 The amount of Training Data and Parsing Accuracy </SectionTitle> <Paragraph position="0"> Table 6 and Figure 2 show how the number of training sentences influences parsing accuracy for the same 10000 test. sentences. They illustrate tile following two characteristics of the learning curve.</Paragraph> <Paragraph position="1"> 1. The parsing accuracy rapidly rises up to 30000 sentences and converges at around 50000 sentences. null 2. The maximum parsing accuracy is 84.33% at 50000 training sentences.</Paragraph> <Paragraph position="2"> We will discuss the maximum accuracy of 84.33%. Compared to recent stochastic English parsers that yield 86 to 87% accuracy (Collins, 1996; Magerman, 1995), 84.33% seems unsatisfactory at the first glance. The main reason behind this lies in the difference between the two corpora used: Penn Tree-bank (Marcus et al., 1993) and EDR corpus (EDR, 1995). Penn Treebank(Marcus et al., 1993) was also used to induce part-of-speech (POS) taggers because the corpus contains very precise and detailed POS markers as well as bracket, annotations. In addition, English parsers incorporate the syntactic tags that are contained in the corpus. The EDR corpus, on the other hand, contains only coarse POS tags. We used another Japanese POS tagger (Matsumoto and others, 1996) to make use of well-grained information for disambiguating syntactic structures. Only the bracket information in the EDR corpus was considered. We conjecture that the difference between the parsing accuracies is due to the difference of the corpus information. (Fujio and Matsumoto, 1997) constructed an EDR-based dependency parser by using a similar method to Collins' (Collins, 1996). The parser attained 80.48% accuracy. Although thier training and test. sentences are not exactly same as ours, the result seems to support our conjecture on the data difference between EDR and Penn Treebank. null</Paragraph> </Section> <Section position="3" start_page="508" end_page="509" type="sub_section"> <SectionTitle> 4.1.3 Significance of Non Head-Word Features </SectionTitle> <Paragraph position="0"> We will now summarize tile significance of each non head-word feature introduced in Section 3. The influence of the lexical information of head words will be discussed in the next section. Table 7 illustrates how the parsing accuracy is reduced when each feature is removed. The number of training sentences was 10000. In the table, ant and post. represent, the anterior and the posterior bunsetsu, respectively.</Paragraph> <Paragraph position="1"> cant features are anterior bunsetsu type and distance between the two bunsetsu. This result may partially support an often used heuristic; bunsetsu modification should be as short range as possible, provided the modification is syntactically possible. In particular, we need to concentrate on the types of bunsetsu to attain a higher level of accuracy. Most features contribute, to some extent, to the parsing performance. In our experiment, information on parentheses has no effect on the performance. The reason may be that EDR contains only a small number of parentheses. One exception in our features is anterior POS of head. We currently hypothesize that this drop of accuracy arises from two reasons.</Paragraph> <Paragraph position="2"> * In many cases, the POS of head word can be determined from bunsetsu type.</Paragraph> <Paragraph position="3"> * Our POS tagger sometimes assigns verbs for verb-derived nouns.</Paragraph> <Paragraph position="4"> Information We focused on the head-word feature by testing the following 4 lexical sources. The first and the second are the 100 and 200 most frequent words, respectively. The third and the fourth are derived from a broadly used Japanese thesaurus, Word List by Semantic Principles (NLRI, 1964). Level 1 and Level 2 classify words into 15 and 67 categories, respectively. 1. 100 most Frequent words 2. 200 most Frequent words 3. Word List Level 1 4. Word List Level 2 Table 8 displays the parsing accuracy when each head word information was used in addition to the previous features. The number of training sentences was 10000. In all cases, the performance was worse than 83.52% which was attained without head word lexical information. More surprisingly, more head word information yielded worse performance. From this result, it. may be safely said, at least, for the Japanese language,' that we cannot expect, lexica\] information to always improve the performance. Further investigation of other thesaurus and clustering (Charniak, 1997) techniques is necessary to fully understand the influence of lexical information.</Paragraph> </Section> <Section position="4" start_page="509" end_page="510" type="sub_section"> <SectionTitle> 4.2 Boosting Experiments </SectionTitle> <Paragraph position="0"> This section reports experimental results on the boosting version of our parser. In all experiments, pruning confidence levels were set. to 55%. Table 9 and Figure 3 show the parsing accuracy when the number of training examples was increased. Because the number of iterations in each data set changed between 5 and 8, we will show the accuracy by combining the first 5 decision trees. In Figure 3, the dotted line plots the learning of the single tree case (identical to Figure 2) for reader's convenience. The characteristics of the boosting version can be summarized as follows compared to the single tree version.</Paragraph> <Paragraph position="1"> * The learning curve rises more rapidly with a small number of examples. It is surprising that the boosting version with 10000 sentences performs better than the single tree version with 50000 sentences.</Paragraph> <Paragraph position="2"> * The boosting version significantly outperforms the single tree counterpart for any number of sentences although they use the same features for learning.</Paragraph> <Paragraph position="3"> Next, we discuss how the number of iterations influences the parsing accuracy. Table 10 shows the parsing accuracy for various iteration numbers when 50000 sentences were used as training data. The resuits have two characteristics.</Paragraph> <Paragraph position="4"> * Parsing accuracy rose up rapidly at the second iteration.</Paragraph> <Paragraph position="5"> * No over-fitting to data was seen although the performance of each generated tree fell around 30% at the final stage of iteration.</Paragraph> </Section> </Section> class="xml-element"></Paper>