File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/e99-1026_evalu.xml

Size: 19,280 bytes

Last Modified: 2025-10-06 14:00:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1026">
  <Title>Japanese Dependency Structure Analysis Based on Maximum Entropy Models</Title>
  <Section position="4" start_page="196" end_page="201" type="evalu">
    <SectionTitle>
3 Experiments and Discussion
</SectionTitle>
    <Paragraph position="0"> In our experiment, we used the Kyoto University text corpus (version 2) (Kurohashi and Nagao, 1997), a tagged corpus of the Mainichi newspaper.</Paragraph>
    <Paragraph position="1"> For training we used 7,958 sentences from newspaper articles appearing from January 1st to January 8th, and for testing we used 1,246 sentences from articles appearing on January 9th. The input sentences were morphologically analyzed and their bunsetsus were identified. We assumed that this preprocessing was done correctly before parsing input sentences. If we used automatic morphological analysis and bunsetsu identification, the parsing accuracy would not decrease so much because the rightmost element in a bunsetsu is usually a case marker, a verb ending, or a adjective ending, and each of these is easily recognized. The automatic preprocessing by using public domain  Proceedings of EACL '99 tools, for example, can achieve 97% for morphological analysis (Kitauchi et al., 1998) and 99% for bunsetsu identification (Murata et al., 1998).</Paragraph>
    <Paragraph position="2"> We employed the Maximum Entropy tool made by Ristad (Ristad, 1998), which requires one to specify the number of iterations for learning. We set this number to 400 in all our experiments.</Paragraph>
    <Paragraph position="3"> In the following sections, we show the features used in our experiments and the results. Then we describe some interesting statistics that we found in our experiments. Finally, we compare our work with some related systems.</Paragraph>
    <Section position="1" start_page="197" end_page="199" type="sub_section">
      <SectionTitle>
3.1 Results of Experiments
</SectionTitle>
      <Paragraph position="0"> The features used in our experiments are listed in Tables 1 and 2. Each row in Table 1 contains a feature type, feature values, and an experimental result that will be explained later. Each feature consists of a type and a value. The features are basically some attributes of a bunsetsu itself or those between bunsetsus. We call them 'basic features.' The list is expanded from tIaruno's list (Haruno et al., 1998). The features in the list are classified into five categories that are related to the &amp;quot;Head&amp;quot; part of the anterior bunsetsu (category &amp;quot;a&amp;quot;), the '~rype&amp;quot; part of the anterior bunsetsu (category &amp;quot;b&amp;quot;), the &amp;quot;Head&amp;quot; part of the posterior bunsetsu (category &amp;quot;c&amp;quot;), the '~l~ype &amp;quot; part of the posterior bunsetsu (category &amp;quot;d&amp;quot;), and the features between bunsetsus (category &amp;quot;e&amp;quot;) respectively. The term &amp;quot;Head&amp;quot; basically means a right-most content word in a bunsetsu, and the term &amp;quot;Type&amp;quot; basically means a function word following a &amp;quot;Head&amp;quot; word or an inflection type of a &amp;quot;Head&amp;quot; word. The terms are defined in the following paragraph. The features in Table 2 are combinations of basic features ('combined features'). They are represented by the corresponding category name of basic features, and each feature set is represented by the feature numbers of the corresponding basic features. They are classified into nine categories we constructed manually. For example, twin features are combinations of the features related to the categories %&amp;quot; and &amp;quot;c.&amp;quot; Triplet, quadruplet and quintuplet features basically consist of the twin features plus the features of the remainder categories &amp;quot;a,&amp;quot; &amp;quot;d&amp;quot; and &amp;quot;e.&amp;quot; The total number of features is about 600,000. Among them, 40,893 were observed in the training corpus, and we used them in our experiment.</Paragraph>
      <Paragraph position="1"> The terms used in the table are the following: Anterior: left bunsetsu of the dependency Posterior: right bunsetsu of the dependency Head: the rightmost word in a bunsetsu other than those whose major part-of-speech 2 category is &amp;quot;~ (special marks),&amp;quot; &amp;quot;1~ (postpositional particles),&amp;quot; or &amp;quot;~ (suffix)&amp;quot; 2Part-of-speech categories follow those of JU-MAN(Kurohashi and Nagao, 1998).</Paragraph>
      <Paragraph position="2"> Head-Lex: the fundamental form (uninflected form) of the head word. Only words with a frequency of three or more are used.</Paragraph>
      <Paragraph position="3"> Head-Inf: the inflection type of a head Type: the rightmost word other than those whose major part-of-speech category is &amp;quot;~ (special marks).&amp;quot; If the major category of the word is neither &amp;quot;IIJJ~-~-\] (post-positional particles)&amp;quot; nor &amp;quot;~\[~:~. (suffix),&amp;quot; and the word is inflectable 3, then the type is represented by the inflection type.</Paragraph>
      <Paragraph position="4"> JOStiIl: the rightmost post-positional particle in the bunsetsu JOSttI2: the second rightmost post-positional particle in the bunsetsu if there are two or more post-positional particles in the bunsetsu TOUTEN, WA: TOUTEN means if a comma (Touten) exists in the bunsetsu. WA means if the word WA (a topic marker) exists in the bunsetsu BW: BW means &amp;quot;between bunsetsus&amp;quot; BW-Distance: the distance between the bunsetsus null BW-TOUTEN: if TOUTEN exists between bunsetsus BW-IDto-Anterior-Type: BW-IDto-Anterior-Type means if there is a bunsetsu whose type is identical to that of the anterior bunsetsu between bunsetsus BW-IDto-Anterior-Type-Head-P OS: the part-of-speech category of the head word of the bunsetsu of &amp;quot;BW-IDto-Anterior-Type&amp;quot; BW-IDto-Posterior-Head: if there is between bunsetsus a bunsetsu whose head is identical to that of the posterior bunsetsu BW-IDto-Posterior- Head-Type(String): the lexical information of the bunsetsu &amp;quot;BWIDto-Posterior-Head&amp;quot; null The results of our experiment are listed in Table 3. The dependency accuracy means the percentage of correct dependencies out of all dependencies. The sentence accuracy means the percentage of sentences in which all dependencies were analyzed correctly. We used input sentences that had already been morphologically analyzed and for which bunsetsus had been identified. The first line in Table 3 (deterministic) shows the accuracy achieved when the test sentences were analyzed deterministically (beam width k = 1). The second line in Table 3 (best beam search) shows the best accuracy among the experiments when changing the beam breadth k from 1 to 20. The best accuracy was achieved when k = 11, although the variation in accuracy was very small. This result supports assumption (4) in Chapter 1 because</Paragraph>
      <Paragraph position="6"> The same values as those of feature number 1.</Paragraph>
      <Paragraph position="7"> The same values as those of feature number 2.</Paragraph>
      <Paragraph position="8"> The same values as those of feature number 3.</Paragraph>
      <Paragraph position="9"> The same values as those of feature number 4.</Paragraph>
      <Paragraph position="10"> The same values as those of feature number 5.</Paragraph>
      <Paragraph position="11"> The same values as those of feature number 6.</Paragraph>
      <Paragraph position="12"> The same values as those of feature number 7.</Paragraph>
      <Paragraph position="13"> The same values as those of feature number 8.</Paragraph>
      <Paragraph position="14"> The same values as those of feature number 9.</Paragraph>
      <Paragraph position="15"> The same values as those of feature number 10.</Paragraph>
      <Paragraph position="16"> The same values as those of feature number 11.</Paragraph>
      <Paragraph position="17"> The same values as those of feature number 12.</Paragraph>
      <Paragraph position="18"> The same values as those of feature number 13.</Paragraph>
      <Paragraph position="19"> The same values as those of feature number 14.</Paragraph>
      <Paragraph position="20"> The same values as those of feature number 15.</Paragraph>
      <Paragraph position="21"> A(1), B~2 ~ 5), C(6 or more) (3) \[nil\], \[extstJ (2~ \[hill, \[exist\] (27 \[nil\], close, open, open-close (4) \[nil\], \[existJ (2) The same values as those of feature number 2. The same values as those of feature number 3.</Paragraph>
      <Paragraph position="22"> The same values as those of feature number 4.</Paragraph>
      <Paragraph position="23"> The same values as those of feature number 5.</Paragraph>
      <Paragraph position="24"> \[nilJ, \[exist\] (2) The same values as those of feature number 6. The same values as those of feature number 7. The same values as those of feature number 8.</Paragraph>
      <Paragraph position="26"> Combination type Twin features: related to the &amp;quot;Type&amp;quot; part of the anterior bunsetsu and the &amp;quot;Head&amp;quot; part of the posterior bunsetsu.</Paragraph>
      <Paragraph position="27"> Triplet features: basically consist of the twin features plus the features between bunsetsus.</Paragraph>
      <Paragraph position="28"> Quadruplet features: basically consist of the twin features plus the features related to the &amp;quot;Head&amp;quot; part of the anterior bunsetsu, and the &amp;quot;Type&amp;quot; part of the posterior</Paragraph>
      <Paragraph position="30"> Quintuplet features: (a, bl, b2, c, d) (a, c) = {(2, 17), (3, 18)}, 86.96% (-0.18%) basically consist of the (bl, b2) = {(9, 11), (I0, 12)}, d = {21,22,23} quadruplet features plus the (a, b, c, d, e) (a, c) = {(1, 16), (2, 17), (3, 18)}, features between bunsetsus. (b, d) = {(6, 21), (7, 22), (8, 23}, e = 31  it shows that the previous context has almost no effect on the accuracy. The last line in Table 3 represents the accuracy when we assumed that every bunsetsu depended on the next one (baseline).</Paragraph>
      <Paragraph position="31"> Figure 1 shows the relationship between the sentence length (the number of bunsetsus) and the dependency accuracy. The data for sentences longer than 28 segments are not shown, because there was at most one sentence of each length.</Paragraph>
      <Paragraph position="32"> Figure 1 shows that the accuracy degradation due to increasing sentence length is not significant.</Paragraph>
      <Paragraph position="33"> For the entire test corpus the average running time on a SUN Sparc Station 20 was 0.08 seconds per sentence.</Paragraph>
    </Section>
    <Section position="2" start_page="199" end_page="200" type="sub_section">
      <SectionTitle>
3.2 Features and Accuracy
</SectionTitle>
      <Paragraph position="0"> This section describes how much each feature set contributes to improve the accuracy.</Paragraph>
      <Paragraph position="1"> The rightmost column in Tables 1 and 2 shows the performance of the analysis without each feature set. In parenthesis, the percentage of improvement or degradation to the formal experiment is shown. In the experiments, when a basic feature was deleted, the combined features that included the basic feature were also deleted.</Paragraph>
      <Paragraph position="2"> We also conducted some experiments in which several types of features were deleted together.</Paragraph>
      <Paragraph position="3"> The results are shown in Table 4. All of the results in the experiments were carried out deterministically (beam width k = 1).</Paragraph>
      <Paragraph position="4"> The results shown in Table 1 were very close to our expectation. The most useful features are the type of the anterior bunsetsu and the part-of-speech tag of the head word on the posterior bunsetsu. Next important features are the distance between bunsetsus, the existence of punctuation in the bunsetsu, and the existence of brackets. These results indicate preferential rules with respect to the features.</Paragraph>
      <Paragraph position="5"> The accuracy obtained with the lexical features of the head word was better than that without them. In the experiment with the features, we found many idiomatic expressions, for example, &amp;quot;~,, 15-C (oujile, according to)-- b}~b (kimeru, decide)&amp;quot; and &amp;quot;~'~&amp;quot; (katachi_de, in the form of)-- ~b~ (okonawareru, be held).&amp;quot; We would expect to collect more of such expressions if we use more training data.</Paragraph>
      <Paragraph position="6"> The experiments without some combined features are reported in Tables 2 and 4. As can be seen from the results, the combined features are very useful to improve the accuracy. We used these combined features in addition to the basic features because we thought that the basic features were actually related to each other. Without the combined features, the features are independent of each other in the maximum entropy framework.</Paragraph>
      <Paragraph position="7"> We manually selected combined features, which are shown in Table 2. If we had used all combi-</Paragraph>
    </Section>
    <Section position="3" start_page="200" end_page="200" type="sub_section">
      <SectionTitle>
Features
</SectionTitle>
      <Paragraph position="0"> Without features 1 and 16 (lexical information about the head word) Without features 35 to 43 Without quadruplet and quintuplet features Without triplet, quadruplet, and quintuplet features Without all combinations</Paragraph>
      <Paragraph position="2"> nations, the number of combined features would have been very large, and the training would not have been completed on the available machine. Furthermore, we found that the accuracy decreased when several new features were added in our preliminary experiments. So, we should not use all combinations of the basic features. We selected the combined features based on our intuition. null In our future work, we believe some methods for automatic feature selection should be studied. One of the simplest ways of selecting features is to select features according to their frequencies in the training corpus. But using this method in our current experiments, the accuracy decreased in all of the experiments. Other methods that have been proposed are one based on using the gain (Berger et al., 1996) and an approximate method for selecting informative features (Shirai et al., 1998a), and several criteria for feature selection were proposed and compared with other criteria (Berger and Printz, 1998). We would like to try these methods.</Paragraph>
      <Paragraph position="3"> Investigating the sentences which could not be analyzed correctly, we found that many of those sentences included coordinate structures. We believe that coordinate structures can be detected to a certain extent by considering new features which take a wide range of information into account.</Paragraph>
    </Section>
    <Section position="4" start_page="200" end_page="200" type="sub_section">
      <SectionTitle>
3.3 Number of Training Data and
Accuracy
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the relationship between the number of training data (the number of sentences) and the accuracy. This figure shows dependency accuracies for the training corpus and the test corpus.</Paragraph>
      <Paragraph position="1"> Accuracy of 81.84% was achieved even with a very small training set (250 sentences). We believe that this is due to the strong characteristic of the maximum entropy framework to the data sparseness problem. From the learning curve, we can expect a certain amount of improvement if we have more training data.</Paragraph>
    </Section>
    <Section position="5" start_page="200" end_page="201" type="sub_section">
      <SectionTitle>
3.4 Comparison with Related Works
</SectionTitle>
      <Paragraph position="0"> This section compares our work with related statistical dependency structure analyses in Japanese.</Paragraph>
      <Paragraph position="1"> Comparison with Shirai's work (Shirai et al., 1998b) Shirai proposed a framework of statistical language modeling using several corpora: the EDR corpus, RWC corpus, and Kyoto University corpus. He combines a parser based on a hand-made CFG and a probabilistic dependency model. He also used the maximum entropy model to estimate the dependency probabilities between two or three post-positional particles and a verb. Accuracy of 84.34% was achieved using 500 test sentences of length 7 to 9 bunsetsus. In both his and our experiments, the input sentences were morphologically analyzed and their bunsetsus were identified. The comparison of the results cannot strictly be done because the conditions were different. However, it should be noted that the accuracy achieved by our model using sentences of the same length was about 3% higher than that of Shirai's model, although we used a much smaller set of training data. We believe that it is because his approach is based on a hand-made CFG.</Paragraph>
      <Paragraph position="2"> Comparison with Ehara's work (Ehara, 1998) Ehara also used the Maximum Entropy model, and a set of similar kinds of features to ours. However, there is a big difference in the number of features between Ehara's model and ours. Besides the difference in the number of basic features, Ehara uses only the combination of two features, but we also use triplet, quadruplet, and quintuplet features. As shown in Section 3.2, the accuracy increased more than 5% using triplet or larger combinations. We believe that the difference in the combination features between Ehara's model and ours may have led to the difference in the accuracy. The accuracy of his system was about 10% lower than ours. Note that Ehara used TV news articles for training and testing, which are different from our corpus. The average sentence length in those articles was 17.8, much longer than that (average: 10.0) in the Kyoto University text corpus.</Paragraph>
      <Paragraph position="3"> Comparison with Fujio's work (Fujio and Matsumoto, 1998) and Haruno's work (Haruno et al., 1998) Fujio used the Maximum Likelihood model with similar features to our model in his parser. Haruno proposed a parser that uses decision tree  k=l) models and a boosting method. It is difficult to directly compare these models with ours because they use a different corpus, the EDR corpus which is ten times as large as our corpus, for training and testing, and the way of collecting test data is also different. But they reported an accuracy of around 85%, which is slightly worse than our model.</Paragraph>
      <Paragraph position="4"> We carried out two experiments using almost the same attributes as those used in their experiments. The results are shown in Table 5, where the lines &amp;quot;Feature set(l)&amp;quot; and &amp;quot;Feature set(2)&amp;quot; show the accuracies achieved by using Fujio's attributes and Haruno's attributes respectively. Considering that both results are around 85% to 86%, which is about the same as ours. From these experiments, we believe that the important factor in the statistical approaches is not the model, i.e. Maximum Entropy, Maximum Likelihood, or Decision Tree, but the feature selection. However, it may be interesting to compare these models in terms of the number of training data, as we can imagine that some models are better at coping with the data sparseness problem than others. This is our future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML