File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1022_metho.xml

Size: 19,394 bytes

Last Modified: 2025-10-06 14:10:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1022">
  <Title>Dependency Parsing of Japanese Spoken Monologue Based on Clause Boundaries Tomohiro Ohno+a) Shigeki Matsubara++ Hideki KashiokaSS</Title>
  <Section position="4" start_page="170" end_page="172" type="metho">
    <SectionTitle>
3 Dependency Parsing Based on Clause
Boundaries
</SectionTitle>
    <Paragraph position="0"> In accordance with the assumption described in Section 2, in our method, the transcribed sentence on which morphological analysis, clause boundary detection, and bunsetsu segmentation are performed is considered the input 4. The dependency 3Asu-Wo-Yomu is a collection of transcriptions of a TV commentary program of the Japan Broadcasting Corporation (NHK). The commentator speaks on some current social issue for 10 minutes.</Paragraph>
    <Paragraph position="1"> 4It is difficult to preliminarily divide a monologue into sentences because there are no clear sentence breaks in monologues. However, since some methods for detecting sentence boundaries have already been proposed (Huang and Zweig, 2002; Shitaoka et al., 2004), we assume that they can be detected automatically before dependency parsing.</Paragraph>
    <Paragraph position="2"> parsing is executed based on the following procedures: null  1. Clause-level parsing: The internal dependency relations of clause boundary units are identified for every clause boundary unit in one sentence.</Paragraph>
    <Paragraph position="3"> 2. Sentence-level parsing: The dependency  relations in which the dependent unit is the final bunsetsu of the clause boundary units are identified.</Paragraph>
    <Paragraph position="4"> In this paper, we describe a sequence of clause boundary units in a sentence as C1***Cm, a sequence of bunsetsus in a clause boundary unit Ci as bi1***bini, a dependency relation in which the dependent bunsetsu is a bunsetsu bik as dep(bik), and a dependency structure of a sentence as {dep(b11),*** ,dep(bmnm[?]1)}.</Paragraph>
    <Paragraph position="5"> First, our method parses the dependency structure {dep(bi1),*** ,dep(bini[?]1)} within the clause boundary unit whenever a clause boundary unit Ci is inputted. Then, it parses the dependency structure {dep(b1n1),*** ,dep(bm[?]1nm[?]1)}, which is a set of dependency relations whose dependent bunsetsu is the final bunsetsu of each clause boundary unit in the input sentence. In addition, in both of the above procedures, our method assumes the fol- null lowing three syntactic constraints: 1. No dependency is directed from right to left. 2. Dependencies don't cross each other.</Paragraph>
    <Paragraph position="6"> 3. Each bunsetsu, except the final one in a sen- null tence, depends on only one bunsetsu.</Paragraph>
    <Paragraph position="7"> These constraints are usually used for Japanese dependency parsing.</Paragraph>
    <Section position="1" start_page="170" end_page="171" type="sub_section">
      <SectionTitle>
3.1 Clause-level Dependency Parsing
</SectionTitle>
      <Paragraph position="0"> Dependency parsing within a clause boundary unit, when the sequence of bunsetsus in an input clause boundary unit Ci is described as Bi (= bi1***bini), identifies the dependency structure Si (= {dep(bi1),*** ,dep(bini[?]1)}), which maximizes the conditional probability P(Si|Bi). At this level, the head bunsetsu of the final bunsetsu bini of a clause boundary unit is not identified.</Paragraph>
      <Paragraph position="1"> Assuming that each dependency is independent of the others, P(Si|Bi) can be calculated as follows: null</Paragraph>
      <Paragraph position="3"> where P(bik rel- bil|Bi) is the probability that a bunsetsu bik depends on a bunsetsu bil when the sequence of bunsetsus Bi is provided. Unlike the conventional stochastic sentence-by-sentence dependency parsing method, in our method, Bi is the sequence of bunsetsus that constitutes not a sentence but a clause. The structure Si, which maximizes the conditional probability P(Si|Bi), is regarded as the dependency structure of Bi and calculated by dynamic programming (DP).</Paragraph>
      <Paragraph position="4"> Next, we explain the calculation of P(bik relbil|Bi). First, the basic form of independent words in a dependent bunsetsu is represented by hik, its parts-of-speech tik, and type of dependency rik, while the basic form of the independent word in a head bunsetsu is represented by hil, and its parts-of-speech til. Furthermore, the distance between bunsetsus is described as diikl. Here, if a dependent bunsetsu has one or more ancillary words, the type of dependency is the lexicon, part-of-speech and conjugated form of the rightmost ancillary word, and if not so, it is the part-of-speech and conjugated form of the rightmost morpheme. The type of dependency rik is the same attribute used in our stochastic method proposed for robust dependency parsing of spoken language dialogue (Ohno et al., 2005b). Then diikl takes 1 or more than 1, that is, a binary value. Incidentally, the above attributes are the same as those used by the conventional stochastic dependency parsing methods (Collins, 1996; Ratnaparkhi, 1997; Fujio and Matsumoto, 1998; Uchimoto et al., 1999; Charniak, 2000; Kudo and Matsumoto, 2002).</Paragraph>
      <Paragraph position="5"> Additionally, we prepared the attribute eil to indicate whether bil is the final bunsetsu of a clause boundary unit. Since we can consider a clause boundary unit as a unit corresponding to a simple sentence, we can treat the final bunsetsu of a clause boundary unit as a sentence-end bunsetsu.</Paragraph>
      <Paragraph position="6"> The attribute that indicates whether a head bunsetsu is a sentence-end bunsetsu has often been used in conventional sentence-by-sentence parsing methods (e.g. Uchimoto et al., 1999).</Paragraph>
      <Paragraph position="7"> By using the above attributes, the conditional probability P(bik rel- bil|Bi) is calculated as follows: null</Paragraph>
      <Paragraph position="9"> Note that F is a co-occurrence frequency function.</Paragraph>
      <Paragraph position="10"> In order to resolve the sparse data problems caused by estimating P(bik rel- bil|Bi) with formula (2), we adopted the smoothing method described by Fujio and Matsumoto (Fujio and Matsumoto,</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="2" start_page="171" end_page="172" type="sub_section">
      <SectionTitle>
3.2 Sentence-level Dependency Parsing
</SectionTitle>
      <Paragraph position="0"> Here, the head bunsetsu of the final bunsetsu of a clause boundary unit is identified. Let B (=B1***Bn) be the sequence of bunsetsus of one sentence and Sfin be a set of dependency relations whose dependent bunsetsu is the final bunsetsu of a clause boundary unit, {dep(b1n1),*** ,dep(bm[?]1nm[?]1)}; then Sfin, which makes P(Sfin|B) the maximum, is calculated by DP. The P(Sfin|B) can be calculated as follows:</Paragraph>
      <Paragraph position="2"> where P(bini rel- bjl|B) is the probability that a bunsetsu bini depends on a bunsetsu bjl when the sequence of the sentence's bunsetsus, B, is provided. Our method parses by giving consideration to the dependency structures in each clause boundary unit, which were previously parsed. That is, the method does not consider all bunsetsus located on the right-hand side as candidates for a head bunsetsu but calculates only dependency relations within each clause boundary unit that do not cross any other relation in previously parsed dependency structures. In the case of Fig. 1, the method calculates by assuming that only three bunsetsus &amp;quot; U(the ratio of people),&amp;quot; or &amp;quot;sl oSb(is)&amp;quot; can be the head bunsetsu of the bunsetsu &amp;quot;bqMO(advocating).&amp;quot; In addition, P(bini rel- bjl|B) is calculated as in Eq. (5). Equation (5) uses all of the attributes used in Eq. (2), in addition to the attribute sjl , which indicates whether the head bunsetsu of bjl is the final bunsetsu of a sentence. Here, we take into  account the analysis result that about 70% of the final bunsetsus of clause boundary units depend on the final bunsetsu of other clause boundary units 5 and also use the attribute ejl at this phase.</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
  <Section position="5" start_page="172" end_page="173" type="metho">
    <SectionTitle>
4 Parsing Experiment
</SectionTitle>
    <Paragraph position="0"> To evaluate the effectiveness of our method for Japanese spoken monologue, we conducted an experiment on dependency parsing.</Paragraph>
    <Section position="1" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
4.1 Outline of Experiment
</SectionTitle>
      <Paragraph position="0"> We used the spoken monologue corpusAsu-Wo-Yomu,annotated with information on morphological analysis, clause boundary detection, bunsetsu segmentation, and dependency analysis6. Table 2 shows the data used for the experiment. We used 500 sentences as the test data. Although our method assumes that a dependency relation does not cross clause boundaries, there were 152 dependency relations that contradicted this assumption. This means that the dependency accuracy of our method is not over 96.8% (4,646/4,798). On the other hand, we used 5,532 sentences as the learning data.</Paragraph>
      <Paragraph position="1"> To carry out comparative evaluation of our method's effectiveness, we executed parsing for 5We analyzed the 200 sentences described in Section 2.3 and confirmed 70.6% (522/751) of the final bunsetsus of clause boundary units depended on the final bunsetsu of other clause boundary units.</Paragraph>
      <Paragraph position="2"> 6Here, the specifications of these annotations are in accordance with those described in Section 2.3.</Paragraph>
      <Paragraph position="3">  parsing time the above-mentioned data by the following two methods and obtained, respectively, the parsing time and parsing accuracy.</Paragraph>
      <Paragraph position="4"> * Our method: First, our method provides clause boundaries for a sequence of bunsetsus of an input sentence and identifies all clause boundary units in a sentence by performing clause boundary analysis (CBAP) (Maruyama et al., 2004). After that, our method executes the dependency parsing described in Section 3.</Paragraph>
      <Paragraph position="5"> * Conventional method: This method parses a sentence at one time without dividing it into clause boundary units. Here, the probability that a bunsetsu depends on another bunsetsu, when the sequence of bunsetsus of a sentence is provided, is calculated as in Eq. (5), where the attribute e was eliminated. This conventional method has been implemented by us based on the previous research (Fujio and Matsumoto, 1998).</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
4.2 Experimental Results
</SectionTitle>
      <Paragraph position="0"> The parsing times of both methods are shown in  proves by about 5 times on average in comparison with the conventional method. Here, the parsing time of our method includes the time taken not only for the dependency parsing but also for the clause boundary analysis. The average time taken for clause boundary analysis was about 1.2 millisecond per sentence. Therefore, the time cost of performing clause boundary analysis as a preprocessing of dependency parsing can be considered small enough to disregard. Figure 2 shows the relation between sentence length and parsing time  our method conv. method bunsetsu within a clause boundary unit (except final bunsetsu) 88.2% (2,701/3,061) 84.7% (2,592/3,061) final bunsetsu of a clause boundary unit 65.6% (1,140/1,737) 63.3% (1,100/1,737) total 80.1% (3,841/4,798) 76.9% (3,692/4,798)  for both methods, and it is clear from this figure that the parsing time of the conventional method begins to rapidly increase when the length of a sentence becomes 12 or more bunsetsus. In contrast, our method changes little in relation to parsing time. Here, since the sentences used in the experiment are composed of 11.8 bunsetsus on average, this result shows that our method is suitable for improving the parsing time of a monologue sentence whose length is longer than the average.</Paragraph>
      <Paragraph position="1"> Table 4 shows the parsing accuracy of both methods. The first line of Table 4 shows the parsing accuracy for all bunsetsus within clause boundary units except the final bunsetsus of the clause boundary units. The second line shows the parsing accuracy for the final bunsetsus of all clause boundary units except the sentence-end bunsetsus. We confirmed that our method could analyze with a higher accuracy than the conventional method. Here, Table 5 shows the accuracy of the clause boundary analysis executed by CBAP. Since the precision and recall is high, we can assume that the clause boundary analysis exerts almost no harmful influence on the following dependency parsing.</Paragraph>
      <Paragraph position="2"> As mentioned above, it is clear that our method is more effective than the conventional method in shortening parsing time and increasing parsing accuracy. null</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="173" end_page="174" type="metho">
    <SectionTitle>
5 Discussions
</SectionTitle>
    <Paragraph position="0"> Our method assumes that dependency relations within a clause boundary unit do not cross clause boundaries. Due to this assumption, the method cannot correctly parse the dependency relations over clause boundaries. However, the experimental results indicated that the accuracy of our method was higher than that of the conventional method.</Paragraph>
    <Paragraph position="1"> In this section, we first discuss the effect of our  setsus within clause boundary units (except the final bunsetsus) and the final bunsetsus of clause boundary units. Next, we discuss the problem of our method's inability to parse dependency relations over clause boundaries.</Paragraph>
    <Section position="1" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
5.1 Parsing Accuracy for Bunsetsu within a
</SectionTitle>
      <Paragraph position="0"> Clause Boundary Unit (except final bunsetsu) Table 6 compares parsing accuracies for bunsetsus within clause boundary units (except the final bunsetsus) between the conventional method and our method. There are 3,061 bunsetsus within clause boundary units except the final bunsetsu, among which 2,499 were correctly parsed by both methods. There were 202 dependency relations correctly parsed by our method but incorrectly parsed by the conventional method. This means that our method can narrow down the candidates for a head bunsetsu.</Paragraph>
      <Paragraph position="1"> In contrast, 93 dependency relations were correctly parsed solely by the conventional method. Among these, 46 were dependency relations over clause boundaries, which cannot in principle be parsed by our method. This means that our method can correctly parse almost all of the dependency relations that the conventional method can correctly parse except for dependency relations over clause boundaries.</Paragraph>
    </Section>
    <Section position="2" start_page="173" end_page="174" type="sub_section">
      <SectionTitle>
5.2 Parsing Accuracy for Final Bunsetsu of a
Clause Boundary Unit
</SectionTitle>
      <Paragraph position="0"> We can see from Table 4 that the parsing accuracy for the final bunsetsus of clause boundary units by both methods is much worse than that for bunsetsus within the clause boundary units (except the final bunsetsus). This means that it is difficult  to identify dependency relations whose dependent bunsetsu is the final one of a clause boundary unit. Table 7 compares how the two methods parse the dependency relations when the dependent bunsetsu is the final bunsetsu of a clause boundary unit. There are 1,737 dependency relations whose dependent bunsetsu is the final bunsetsu of a clause boundary unit, among which 1,037 were correctly parsed by both methods. The number of dependency relations correctly parsed only by our method was 103. This number is higher than that of dependency relations correctly parsed by only the conventional method. This result might be attributed to our method's effect; that is, our method narrows down the candidates internally for a head bunsetsu based on the first-parsed dependency structure for clause boundary units.</Paragraph>
    </Section>
    <Section position="3" start_page="174" end_page="174" type="sub_section">
      <SectionTitle>
5.3 Dependency Relations over Clause
Boundaries
</SectionTitle>
      <Paragraph position="0"> Table 8 shows the accuracy of both methods for parsing dependency relations over clause boundaries. Since our method parses based on the assumption that those dependency relations do not exist, it cannot correctly parse anything. Although, from the experimental results, our method could identify two dependency relations over clause boundaries, these were identified only because dependency parsing for some sentences was performed based on wrong clause boundaries that were provided by clause boundary analysis. On the other hand, the conventional method correctly parsed 46 dependency relations among 152 that crossed a clause boundary in the test data. Since the conventional method could correctly parse only 30.3% of those dependency relations, we can see that it is in principle difficult to identify the dependency relations.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="174" end_page="174" type="metho">
    <SectionTitle>
6 Related Works
</SectionTitle>
    <Paragraph position="0"> Since monologue sentences tend to be long and have complex structures, it is important to consider the features. Although there have been very few studies on parsing monologue sentences, some studies on parsing written language have dealt with long-sentence parsing. To resolve the syntactic ambiguity of a long sentence, some of them have focused attention on the &amp;quot;clause.&amp;quot; First, there are the studies that focused attention on compound clauses (Agarwal and Boggess, 1992; Kurohashi and Nagao, 1994). These tried to improve the parsing accuracy of long sentences by identifying the boundaries of coordinate structures. Next, other research efforts utilized the three categories into which various types of subordinate clauses are hierarchically classified based on the &amp;quot;scope-embedding preference&amp;quot; of Japanese subordinate clauses (Shirai et al., 1995; Utsuro et al., 2000). Furthermore, Kim et al. (Kim and Lee, 2004) divided a sentence into &amp;quot;S(ubject)-clauses,&amp;quot; which were defined as a group of words containing several predicates and their common subject. The above studies have attempted to reduce the parsing ambiguity between specific types of clauses in order to improve the parsing accuracy of an entire sentence.</Paragraph>
    <Paragraph position="1"> On the other hand, our method utilizes all types of clauses without limiting them to specific types of clauses. To improve the accuracy of long-sentence parsing, we thought that it would be more effective to cyclopaedically divide a sentence into all types of clauses and then parse the local dependency structure of each clause. Moreover, since our method can perform dependency parsing clause-by-clause, we can reasonably expect our method to be applicable to incremental parsing (Ohno et al., 2005a).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML