File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2042_metho.xml

Size: 11,747 bytes

Last Modified: 2025-10-06 14:10:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2042">
  <Title>Detection of Quotations and Inserted Clauses and its Application to Dependency Structure Analysis in Spontaneous Japanese Ryoji Hamabe DD</Title>
  <Section position="4" start_page="0" end_page="324" type="metho">
    <SectionTitle>
2 Problems with Dependency Structure
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="324" type="sub_section">
      <SectionTitle>
Analysis in Spontaneous Japanese
</SectionTitle>
      <Paragraph position="0"> There are many differences between written text and spontaneous speech, and consequently, problems peculiar to spontaneous speech arise in de- null pendency structure analysis, such as ambiguous clause boundaries, independent bunsetsus, crossed dependencies, self-corrections, and inversions. In this study, we address the problem of ambiguous clause boundaries in dependency structure analysis in spontaneous speech. We treated the other problems in the same way as Shitaoka et al. (Shitaoka et al., 2004). For example, inversions are represented as dependency relationships going in the direction from right to left in the CSJ, and their direction was changed to that from left to right in our experiments. In this paper, therefore, all the dependency relationships were assumed to go in the direction from left to right (Uchimoto et al., 2006).</Paragraph>
      <Paragraph position="1"> There are several types of clause boundaries such as sentence boundaries, boundaries of quotations and inserted clauses. In the CSJ, clause boundaries were automatically detected by using surface information (Maruyama et al., 2003), and sentence boundaries were manually selected from them (Takanashi et al., 2003). Boundaries of quotations and inserted clauses were also defined and detected manually. Dependency relationships between bunsetsus were annotated within sentences (Uchimoto et al., 2006). Our definition of clause boundaries follows the definition used in the CSJ.</Paragraph>
      <Paragraph position="2"> Shitaoka et al. worked on automatic sentence boundary detection by using SVM-based text chunking. However, quotations and inserted clauses were not considered. In this paper, we focus on these problems in a context of ambiguous clause boundaries.</Paragraph>
    </Section>
    <Section position="2" start_page="324" end_page="324" type="sub_section">
      <SectionTitle>
Quotations
</SectionTitle>
      <Paragraph position="0"> In written text, quotations are often bracketed by (angle brackets), but no brackets are inserted in spontaneous speech.</Paragraph>
      <Paragraph position="1"> ex) &amp;quot;SpMMTlohM&amp;quot; (I want to go there at any rate) is a quotation. In the CSJ, quotations were manually annotated as follows.  ing utterances, which results in supplements, annotations, or paraphrases of main clauses. ex) &amp;quot;MhpbZr&amp;quot; (where I arrived at night) is an inserted clause.</Paragraph>
      <Paragraph position="3"> Dependency relationships are closed within a quotation or an inserted clause. Therefore, dependencies except the rightmost bunsetsu in each clause do not cross boundaries of the same clause, meaning no dependency exists between the bunsetsu inside a clause and that outside the clause.</Paragraph>
      <Paragraph position="4"> However, automatically detected dependencies often cross clause boundaries erroneously because sentences including quotations or inserted clauses can have complicated clause structures. This is one of the reasons dependency structure analysis of spontaneous speech has more errors than that of written texts. We propose a method for improving dependency structure analysis based on automatic detection of quotations and inserted clauses.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="324" end_page="326" type="metho">
    <SectionTitle>
3 Dependency Structure Analysis and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="324" end_page="324" type="sub_section">
      <SectionTitle>
Detection of Quotations and Inserted
Clauses
</SectionTitle>
      <Paragraph position="0"> The outline of the proposed processes is shown in Figure 1. Here, we use &amp;quot;clause&amp;quot; to describe a quotation and an inserted clause.</Paragraph>
    </Section>
    <Section position="2" start_page="324" end_page="325" type="sub_section">
      <SectionTitle>
3.1 Dependency Structure Analysis
</SectionTitle>
      <Paragraph position="0"> In this research, we use the method proposed by Uchimoto et al. (Uchimoto et al., 2000) to ana- null lyze dependency structures. This method is a two-step procedure, and the first step is preparation of a dependency matrix in which each element represents the likelihood that one bunsetsu depends on another. The second step of the analysis is finding an optimal set of dependencies for the entire sentence. The likelihood of dependency is represented by a probability, using a dependency probability model. The model in this study (Uchimoto et al., 2000) takes into account not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all the bunsetsu to its right.</Paragraph>
      <Paragraph position="1"> We implemented this model within a maximum entropy modeling framework. The features used in the model were basically attributes related to the target two bunsetsus: attributes of a bunsetsu itself, such as character strings, parts of speech, and inflection types of a bunsetsu together with attributes between bunsetsus, such as the distance between bunsetsus, etc. Combinations of these features were also used. In this work, we added to the features whether there is a boundary of quotations or inserted clauses between the target bunsetsus. If there is, the probability that the left bunsetsu depends on the right bunsetsu is estimated to be low.</Paragraph>
      <Paragraph position="2"> In the CSJ, some bunsetsus are defined to have no modifiee. In our experiments, we defined their dependencies as follows.</Paragraph>
      <Paragraph position="3"> AF The rightmost bunsetsu in a quotation or an inserted clause depends on the rightmost one in the sentence.</Paragraph>
      <Paragraph position="4"> AF If a sentence boundary is included in a quotation or an inserted clause, the bunsetsu to the immediate left of the boundary depends on the rightmost bunsetsu in the quotation or the inserted clause.</Paragraph>
      <Paragraph position="5"> AF Other bunsetsus that have no modifiee depend on the next one.</Paragraph>
    </Section>
    <Section position="3" start_page="325" end_page="326" type="sub_section">
      <SectionTitle>
3.2 Detection of Quotations and Inserted
Clauses
</SectionTitle>
      <Paragraph position="0"> We regard the problem of clause boundary detection as a text chunking task. We used YamCha (Kudo and Matsumoto, 2001) as a text chunker, which is based on Support Vector Machine (SVM). We used the chunk labels consisting of three tags which correspond to sentence boundaries, boundaries of quotations, and boundaries of inserted clauses, respectively. The tag for sentence  boundaries can be either E (the rightmost bunsetsu in a sentence) or I (the others). The tags for the boundaries of quotations and inserted clauses are shown in Table 1. An example of chunk labels assigned to each bunsetsu in a sentence is as follows. ex) &amp;quot;'wi&amp;quot; (It is because of the budget) is a quotation, and &amp;quot;'wiqMbU&amp;quot; (which I think is because of the budget) is an inserted clause. For a chunk label, for example, the bunsetsu that the chunk label (I, B, B) is assigned to means that it is not related to a sentence boundary but is related to the beginning of a quotation and an inserted clause.</Paragraph>
      <Paragraph position="2"> The three tags of each chunk label are simultaneously estimated. Therefore, the relationships between sentence boundaries, quotations, and inserted clauses are considered in this model. For instance, quotations and inserted clauses should not cross the sentence boundaries, and the chunk label such as (E,I,O) is never estimated because this label means that a sentence boundary exists within a quotation.</Paragraph>
      <Paragraph position="3"> We used the following parameters for YamCha.</Paragraph>
      <Paragraph position="4">  The chunk label is estimated for each bunsetsu, The features used to estimate the chunk labels are as follows.</Paragraph>
      <Paragraph position="5"> (1) word information We used word information such as character strings, pronunciation, part of speech, inflection type, and inflection form. Specific expressions are often used at the ends of quotations and inserted clauses.</Paragraph>
      <Paragraph position="6">  left of beginning of quotations or inserted clauses For instance, &amp;quot;qO, to-omou&amp;quot; (think) and &amp;quot;lotO, tte-iu&amp;quot; (say) are used at the ends of quotations. Expressions such as &amp;quot;pb U, desu-ga&amp;quot; and &amp;quot;Zr, keredo-mo&amp;quot; are used at the ends of inserted clauses.</Paragraph>
      <Paragraph position="7"> (2) fillers and pauses Fillers and pauses are often inserted just before or after quotations and inserted clauses. Pause duration is normalized in a talk with its mean and variance.</Paragraph>
      <Paragraph position="8"> (3) speaking rate Inside inserted clauses, speakers tend to speak fast. The speaking rate is also normalized in a talk.</Paragraph>
      <Paragraph position="9"> Detecting the ends of clauses appears easy because specific expressions are frequently used at the ends of clauses as previously mentioned. However, determining the beginnings of clauses is difficult in a single process because all features mentioned above are local information. Therefore, the global information is also used to detect the beginning of the clauses. If the end of a clause is given, the bunsetsus to the left of the clause should satisfy the two conditions described in Figure 2. Our method uses the constraint as global information. They are considered as additional features based on dependency probabilities estimated for the bunsetsus to the left of the clause. Thus, our chunking method has two steps. First, clause boundaries are detected based on the three types of features itemized above. Second, the beginnings of clauses are determined after adding to the features the following probabilities obtained by automatic dependency structure analysis.</Paragraph>
      <Paragraph position="10"> (4) probability that bunsetsu to left of target depends on bunsetsu inside clause (5) probability that bunsetsu to immediate left of target depends on bunsetsu to right of clause Figure 2 shows that the target bunsetsu is likely to be the beginning of the clause if probability (4) is low and probability (5) is high. For instance, the following example sentence has an inserted clause. In the first chunking step, the bunsetsu &amp;quot;spbZr&amp;quot; (which is a story) is found to be the end of the inserted clause.</Paragraph>
      <Paragraph position="11"> ex) &amp;quot; TMhspbZr&amp;quot; (which is a story that I heard from my father) is an inserted clause.</Paragraph>
      <Paragraph position="13"> The three bunsetsus&amp;quot;%x, atari-wa&amp;quot;, &amp;quot;M h, kii-ta&amp;quot;, and &amp;quot;spbZr, hanashi-nandesu-kedo&amp;quot; are less likely to be the beginning of the inserted clause because in the three cases the bunsetsu to the immediate left depends on the target bunsetsu. On the other hand, the bunsetsu &amp;quot; T, chichi-kara&amp;quot; is the most likely to be the beginning since the bunsetsu to its immediate left &amp;quot;%x, atari-wa&amp;quot; depends on the bunsetsu to the right of the inserted clause &amp;quot;&gt;ilhpb, tanbo-datta-ndesu&amp;quot;.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML