File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1303_evalu.xml

Size: 12,691 bytes

Last Modified: 2025-10-06 13:58:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1303">
  <Title>Japanese Dependency Structure Analysis Based on Support Vector Machines</Title>
  <Section position="5" start_page="20" end_page="24" type="evalu">
    <SectionTitle>
4 Experiments and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
4.1 Experiments Setting
</SectionTitle>
      <Paragraph position="0"> We use Kyoto University text corpus (Version 2.0) consisting of articles of Mainichi newspaper annotated with dependency structure(Kurohashi and Nagao, 1997). 7,958 sentences from the articles on January 1st to January 7th are used for the training data, and 1,246 sentences from the articles on January 9th are used for the test data. For the kernel function, we used the polynomial function (9).</Paragraph>
      <Paragraph position="1"> We set the soft margin parameter C to be 1.</Paragraph>
      <Paragraph position="2"> The feature set used in the experiments are shown in Table 1. The static features are basically taken from Uchimoto's list(Uchimoto et al., 1999) with little modification. In Table 1, 'Head' means the rightmost content word in a chunk whose part-of-speech is not a functional category. 'Type' means the rightmost functional word or the inflectional form of the rightmost predicate if there is no functional word in the chunk. The static features include the information on existence of brackets, question marks and punctuation marks etc. Besides, there are features that show the relative relation of two chunks, such as distance, and existence of brackets, quotation marks and punctuation marks between them.</Paragraph>
      <Paragraph position="3"> For dynamic features, we selected functional words or inflection forms of the right-most predicates in the chunks that appear between two chunks and depend on the modiflee. Considering data sparseness problem, we  apply a simple filtering based on the part-of-speech of functional words: We use the lexical form if the word's POS is particle, adverb, adnominal or conjunction. We use the inflection form if the word has inflection. We use the POS tags for others.</Paragraph>
    </Section>
    <Section position="2" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
4.2 Results of Experiments
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the result of parsing accuracy under the condition k = 5 (beam width), and d = 3 (dimension of the polynomial functions used for the kernel function).</Paragraph>
      <Paragraph position="1"> This table shows two types of dependency accuracy, A and B. The training data size is measured by the number of sentences. The accuracy A means the accuracy of the entire dependency relations. Since Japanese is a head-final language, the second chunk from the end of a sentence always modifies the last chunk.</Paragraph>
      <Paragraph position="2"> The accuracy B is calculated by excluding this dependency relation. Hereafter, we use the accuracy A, if it is not explicitly specified, since this measure is usually used in other literature. null</Paragraph>
    </Section>
    <Section position="3" start_page="20" end_page="22" type="sub_section">
      <SectionTitle>
4.3 Effects of Dynamic Features
</SectionTitle>
      <Paragraph position="0"> Table3 shows the accuracy when only static features are used. Generally, the results with</Paragraph>
      <Paragraph position="2"> dynamic feature set is better than the results without them. The results with dynamic features constantly outperform that with static features only. In most of cases, the improvements is significant. In the experiments, we restrict the features only from the chunks that appear between two chunks being in consideration, however, dynamic features could be also taken from the chunks that appear not between the two chunks. For example, we could also take into consideration the chunk that is modified by the right chunk, or the chunks  tences, k = 5) that modify the left chunk. We leave experiment in such a setting for the future work.</Paragraph>
    </Section>
    <Section position="4" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.4 Training data vs. Accuracy
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the relationship between the size of the training data and the parsing accuracy. This figure shows the accuracy of with and without the dynamic features.</Paragraph>
      <Paragraph position="1"> The parser achieves 86.52% accuracy for test data even with small training data (1172 sentences). This is due to a good characteristic of SVMs to cope with the data sparseness problem. Furthermore, it achieves almost 100% accuracy for the training data, showing that the training data are completely separated by appropriate combination of features. Generally, selecting those specific features of the training data tends to cause overfitting, and accuracy for test data may fall. However, the SVMs method achieve a high accuracy not only on the training data but also on the test data. We claim that this is due to the high generalization ability of SVMs. In addition, observing at the learning curve, further improvement will be possible if we increase the size of the training data.</Paragraph>
    </Section>
    <Section position="5" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.5 Kernel Function vs. Accuracy
</SectionTitle>
      <Paragraph position="0"> Table 4 shows the relationship between the dimension of the kernel function and the parsing accuracy under the condition k -- 5.</Paragraph>
      <Paragraph position="1"> As a result, the case of d ---- 4 gives the best accuracy. We could not carry out the training in realistic time for the case of d = 1.</Paragraph>
      <Paragraph position="2"> This result supports our intuition that we need a combination of at least two features.</Paragraph>
      <Paragraph position="3"> In other words, it will be hard to confirm a dependency relation with only the features of the modifier or the modfiee. It is natural that a dependency relation is decided by at least the information from both of two chunks. In addition, further improvement has been possible by considering combinations of three or more features.</Paragraph>
    </Section>
    <Section position="6" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.6 Beam width vs. Accuracy
</SectionTitle>
      <Paragraph position="0"> Sekine (Sekine et al., 2000) gives an interesting report about the relationship between the beam width and the parsing accuracy. Generally, high parsing accuracy is expected when a large beam width is employed in the dependency structure analysis. However, the result is against our intuition. They report that a beam width between 3 and 10 gives the best parsing accuracy, and parsing accuracy falls down with a width larger than 10. This result suggests that Japanese dependency structures may consist of a series of local optimization processes.</Paragraph>
      <Paragraph position="1"> We evaluate the relationship between the beam width and the parsing accuracy. Table 5 shows their relationships under the condition d = 3, along with the changes of the beam width from k = 1 to 15. The best parsing accuracy is achieved at k ---- 5 and the best sentence accuracy is achieved at k = 5 and k=7.</Paragraph>
      <Paragraph position="2"> We have to consider how we should set the beam width that gives the best parsing accuracy. We believe that the beam width that gives the best parsing accuracy is related not only with the length of the sentence, but also with the lexical entries and parts-of-speech that comprise the chunks.</Paragraph>
    </Section>
    <Section position="7" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.7 Committee based approach
</SectionTitle>
      <Paragraph position="0"> Instead of learning a single classier using all training data, we can make n classifiers dividing all training data by n, and the final result is decided by their voting. This approach would reduce computational overhead.</Paragraph>
      <Paragraph position="1"> The use of multi-processing computer would help to reduce their training time considerably since all individual training can be carried out in parallel.</Paragraph>
      <Paragraph position="2"> To investigate the effectiveness of this method, we perform a simple experiment: Dividing all training data (7958 sentences) by 4, the final dependency score is given by a weighted average of each scores. This simple voting approach is shown to achieve the accuracy of 88.66%, which is nearly the same accuracy achieved 5540 training sentences.</Paragraph>
      <Paragraph position="3"> In this experiment, we simply give an equal weight to each classifier. However, if we optimized the voting weight more carefully, the further improvements would be achieved (Inui and Inni, 2000).</Paragraph>
    </Section>
    <Section position="8" start_page="22" end_page="22" type="sub_section">
      <SectionTitle>
4.8 Comparison with Related Work
</SectionTitle>
      <Paragraph position="0"> Uchimoto (Uchimoto et al., 1999) and Sekine (Sekine et al., 2000) report that using Kyoto University Corpus for their training and testing, they achieve around 87.2% accuracy by building statistical model based on Maximum Entropy framework. For the training data, we used exactly the same data that they used in order to make a fair comparison. In our experiments, the accuracy of 89.09% is achieved using same training data. Our model outperforms Uchimoto's model as far as the accuracies are compared.</Paragraph>
      <Paragraph position="1"> Although Uchimoto suggests that the importance of considering combination of features, in ME framework we must expand these combination by introducing new feature set. Uchimoto heuristically selects &amp;quot;effective&amp;quot; combination of features. However, such a manual selection does not always cover all relevant combinations that are important in the determination of dependency relation.</Paragraph>
      <Paragraph position="2"> We believe that our model is better than others from the viewpoints of coverage and consistency, since our model learns the combination of features without increasing the computational complexity. If we want to reconsider them, all we have to do is just to change the Kernel function. The computational complexity depends on the number of support vectors not on the dimension of the Kernel function. null</Paragraph>
    </Section>
    <Section position="9" start_page="22" end_page="24" type="sub_section">
      <SectionTitle>
4.9 Future Work
</SectionTitle>
      <Paragraph position="0"> The simplest and most effective way to achieve better accuracy is to increase the training data. However, the proposed method that uses all candidates that form dependency relation requires a great amount of time to compute the separating hyperplaneas the size of the training data increases. The experiments given in this paper have actually taken long  training time 3 To handle large size of training data, we have to select only the related portion of examples that are effective for the analysis. This will reduce the training overhead as well as the analysis time. The committee-based approach discussed section 4.7 is one method of coping with this problem. For future research, to reduce the computational overhead, we will work on methods for sample selection as follows: null * Introduction of constraints on nondependency null Some pairs of chunks need not consider since there is no possibility of dependency between them from grammatical constraints. Such pairs of chunks are not necessary to use as negative examples in the training phase. For example, a chunk within quotation marks may not modify a chunk that locates outside of the quotation marks. Of course, we have to be careful in introducing such constraints, and they should be learned from existing corpus.</Paragraph>
      <Paragraph position="1"> * Integration with other simple models Suppose that a computationally light and moderately accuracy learning model is obtainable (there are actually such systems based on probabilistic parsing models). We can use the system to output some redundant parsing results and use only those results for the positive and negative examples. This is another way to reduce the size of training data.</Paragraph>
      <Paragraph position="2"> * Error-driven data selection We can start with a small size of training data with a small size of feature set. Then, by analyzing held-out training data and selecting the features that affect the parsing accuracy. This kind of gradual increase of training data and feature set will be another method for reducing the computational overhead.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML