File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1159_metho.xml
Size: 17,434 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1159"> <Title>Dependency Structure Analysis and Sentence Boundary Detection in Spontaneous Japanese</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dependency Structure Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> and Sentence Boundary Detection in Spontaneous Japanese </SectionTitle> <Paragraph position="0"> First, let us briefly describe how dependency structures can be represented in a Japanese sentence. In Japanese sentences, word order is rather free, and subjects and objects are often omitted. In languages having such characteristics, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus, based on a dependency grammar. Phrasal units, or bunsetsus, are minimal linguistic units obtained by segmenting a sentence naturally in terms of semantics and phonetics. Each bunsetsu consists of one or more morphemes. For example, the sentence &quot;txlX2MoM(kare-wa yukkuri aruite-iru, He is walking slowly)&quot; can be divided into three bunsetsus, &quot;tx(kare-wa,he)&quot;,&quot;l X(yukkuri,slowly)&quot;and&quot;2MoM(aruiteiru, is walking)&quot;. In this sentence, the first and second bunsetsus depend on the third one.</Paragraph> <Paragraph position="1"> There are many differences between written text and spontaneous speech, and there are problems peculiar to spontaneous speech in dependency structure analysis and sentence boundary detection. The following sections describe some typical problems and our solutions.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Problems with Dependency Structure Analysis </SectionTitle> <Paragraph position="0"> Ambiguous sentence boundaries As described in Section 1, in this study, we assumed that ambiguous sentence boundaries is the biggest problem in dependency structure analysis of spontaneous speech.</Paragraph> <Paragraph position="1"> So in this paper, we mainly focus on this problem and describe our solution to it.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Independent bunsetsus </SectionTitle> <Paragraph position="0"> In spontaneous speech, we sometimes find that modifiees are missing because utterance planning changes in the middle of the speech. Also, we sometimes find bunsetsus whose dependency relationships are useless for understanding the utterance. These include fillers such as &quot;Kw(anoh, well)&quot; and &quot;fw(sonoh, well)&quot;, adverbs that behave like fillers such as &quot;O(mou)&quot;, responses such as &quot;xM(hai,yes)&quot;and&quot; O(un, yes)&quot;, conjunctions such as &quot;p (de, and)&quot;, and disfluencies. In these cases, bunsetsus are assumed to be independent, and as a result, they have no modifiees in the CSJ. For example, 14,988 bunsetsusin 188 talks in the CSJ are independent.</Paragraph> <Paragraph position="1"> We cannot ignore fillers, responses, and disfluencies because they frequently appear in spontaneous speech. However, we can easily detect them by using the method proposed by Asahara and Matsumoto (Asahara and Matsumoto, 2003).</Paragraph> <Paragraph position="2"> In this paper, fillers, responses, and disfluencies were eliminated before dependency structure analysis and sentence boundary detection by using morphological information and labels. In the CSJ, fillers and responses are interjections, and almost all of them are marked with label (F). Disfluencies are marked with label (D).</Paragraph> <Paragraph position="3"> In this paper, every independent bunsetsu was assumed to depend on the next one.</Paragraph> <Paragraph position="4"> However, practically speaking, independent bunsetsus should be correctly detected as &quot;independent&quot;. This detection is one of our future goals.</Paragraph> <Paragraph position="5"> Crossed dependency In general, dependencies in Japanese written text do not cross. In contrast, dependencies in spontaneous speech sometimes do. For example, &quot;\U(kore-ga, this)&quot; depends on &quot; Y`Mq(tadashii-to,is right)&quot; and &quot;x(watashi-wa, I)&quot; depends on &quot;O(omou, think)&quot; in the sentence &quot;\ Ux Y`MqO&quot;, where &quot;&quot; denotes a bunsetsu boundary. Therefore, the two dependencies cross.</Paragraph> <Paragraph position="6"> However, there are few number of crossed dependencies in the CSJ: In 188 talks, we found 689 such dependencies for total of 170,760 bunsetsus. In our experiments, therefore, we assumed that dependencies did not cross. Correctly detecting crossed dependencies is one of our future goals.</Paragraph> <Paragraph position="7"> Self-correction We often find self-corrections in spontaneous speech. For example, in the 188 talks in the CSJ there were 2,544 self-corrections. In the CSJ, self-corrections are represented as dependency relationships between bunsetsus, and label D is assigned to them.</Paragraph> <Paragraph position="8"> Coordination and appositives are also represented as dependency relationships between bunsetsus, and labels P and A are assigned to them, respectively. The definitions of coordination and appositives follow those of the Kyoto University text corpus (Kurohashi and Nagao, 1997). Both the labels and the dependencies should be detected for applications such as automatic text summarization. However, in this study, we detected only the dependencies between bunsetsus, and we did it in the same manner as in previous studies using written text.</Paragraph> <Paragraph position="9"> Inversion Inversion occurs more frequently in spontaneousspeechthaninwrittentext. For example, in the 188 talks in the CSJ there were 172 inversions. In the CSJ, inversions are represented as dependency relationships going in the direction from right to left. In this study, we thought it important to detect dependencies, and we manually changed their direction to that from left to right. The direction of dependency has been changed to that from left to right.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Problems with Sentence Boundary Detection </SectionTitle> <Paragraph position="0"> In spontaneous Japanese speech, sentence boundaries are ambiguous. In the CSJ, therefore, sentence boundaries were defined based on clauses whose boundaries were automatically detected using surface information (Maruyama et al., 2003), and they were detected manually (Takanashi et al., 2003). Clause boundaries can be classified into the following three groups.</Paragraph> <Paragraph position="1"> Absolute boundaries , or sentence boundaries in their usual meaning. Such boundaries are often indicated by verbs in their basic form.</Paragraph> <Paragraph position="2"> Strong boundaries , or points that can be regarded as major breaks in utterances and that can be used for segmentation. Such boundaries are often indicated by clauses whose rightmost words are &quot;U(ga, but)&quot;, or &quot;`(shi, and)&quot;.</Paragraph> <Paragraph position="3"> Weak boundaries , or points that can be used for segmentation because they strongly depend on other clauses. Such boundaries are often indicated by clauses whose rightmost words are &quot;wp(node,because)&quot;, or &quot;h(tara,if)&quot;. These three types of boundary differ in the degree of their syntactic and semantic completeness and the dependence of their subsequent clauses. Absolute boundaries and strong boundaries are usually defined as sentence boundaries. However, sentence boundaries in the CSJ are different from these two types of clause boundaries, and the accuracy of rule-based automatic sentence boundary detection in the 188 talks in the CSJ has an F-measure of approximately 81, which is the accuracy for a closed test. Therefore, we need a more accurate sentence boundary detection system. null Shitaoka et al. (Shitaoka et al., 2002) proposed a method for detecting sentence boundaries in spontaneous Japanese speech. Their definition of sentence boundaries is approximately the same as that of absolute boundaries described above. In this method, sentence boundary candidates are extracted by character-based pattern matching using pause duration. However, it is difficult to extract appropriate candidates by this method because there is a low correlation between pauses and the strong and weak boundaries described above. It is also hard to detect noun-final clauses by character-based pattern matching.</Paragraph> <Paragraph position="4"> One method based on machine learning, a method based on maximum entropy models, has been proposed by Reynar and Ratnaparkhi (Reynar and Ratnaparkhi, 2000). However, the target in their study was written text. This method cannot readily used for spontaneous speech because in speech, there are no punctuation marks such as periods. Other features of utterances should be used to detect sentence boundaries in spontaneous speech.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Dependency Structure Analysis </SectionTitle> <Paragraph position="0"> In statistical dependency structure analysis of Japanese speech, the likelihood of dependency is represented by a probability estimated by a dependency probability model.</Paragraph> <Paragraph position="1"> Given sentence S, let us assume that it is uniquely divided into n bunsetsus, b</Paragraph> <Paragraph position="3"> and that it is represented as an ordered set of structure analysis finds dependencies that maximize probability P(D|S) given sentence S. The conventional statistical model (Collins, 1996; Fujio and Matsumoto, 1998; Haruno et al., 1998; Uchimoto et al., 1999) uses only the relationship between two bunsetsustoestimate the probability of dependency, whereas the model in this study (Uchimoto et al., 2000) takes into account not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all the bunsetsus to its right. This model uses more information than the conventional model.</Paragraph> <Paragraph position="4"> We implemented this model within a maximum entropy modeling framework. The features used in the model were basically attributes of bunsetsus, such as character strings, parts of speech, and types of inflections, as well as those that describe the relationships between bunsetsus, such as the distance between bunsetsus. Combinations of these features were also used. To find D best , we analyzed the sentences backwards (from right to left). In the backward analysis, we can limit the search space effectively by using a beam search. Sentences can also be analyzed deterministically without great loss of accuracy (Uchimoto et al., 1999). So we analyzed a sentence backwards and deterministically. null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Sentence Boundary Detection </SectionTitle> <Paragraph position="0"> The framework for statistical machine translation is formulated as follows. Given input sequence X, the goal of statistical machine translation is to find the best output sequence, Y , that maximizes conditional probability P(Y |X):</Paragraph> <Paragraph position="2"> The problem of sentence boundary detection can be reduced to the problem of translating a sequence of words, X,thatdoesnotinclude periods but instead includes pauses into a sequence of words, Y , that includes periods. Specifically, in places where a pause might be converted into a period, which means P(X|Y ) = 1, the decision whether a period should be inserted or not is made by comparing language model scores P(Y</Paragraph> <Paragraph position="4"> is in that one includes a period in a particular place and the other one does not.</Paragraph> <Paragraph position="5"> We used a model that uses pause duration and surface expressions around pauses as translation model P(X|Y ). We used expressions around absolute and strong boundaries as described in Section 2.2 as surface expressions around pauses. A pause preceding or following surface expressions can be converted into a period. Specifically, pauses following expressions &quot;q(to)&quot;, &quot;sM(nai)&quot;, and &quot;h(ta)&quot;, and pauses preceding expression &quot;p(de)&quot;, can be converted into a period when these pauses are longer than average. A pause preceding or following other surface expressions can be converted into a period even if its duration is short. To calculate P(Y ), we used a word 3-gram model trained with transcriptions in the CSJ.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Sentence Boundary Detection Using Dependency Information (Method 1) </SectionTitle> <Paragraph position="0"> There are three assumptions that should be satisfied by the rightmost bunsetsu in every sentence. In the following, this bunsetsu is referred to as the target bunsetsu.</Paragraph> <Paragraph position="1"> (1) One or more bunsetsus depend on the target bunsetsu. (Figure 2) Since every bunsetsu depends on another bunsetsu in the same sentence, the second rightmost bunsetsu always depends on the rightmost bunsetsu in any sentence, except in inverted sentences. In inverted sentences in this study, we changed the direction of all dependencies to that from left to right.</Paragraph> <Paragraph position="2"> One or more</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Bunsetsus depend </SectionTitle> <Paragraph position="0"> the target bunsetsu.(&quot;|&quot; represents a sentence boundary.) (2) There is no bunsetsu that depends on a bunsetsu beyond the target bunsetsu. (Figure 3) Each bunsetsu in a sentence depends on a bunsetsu in the same sentence.</Paragraph> <Paragraph position="1"> (3) The probability of the target bun- null setsu is low. (Figure 4) The target bunsetsu does not depend on any bunsetsu.</Paragraph> <Paragraph position="2"> No bunsetsu depend in this way Bunsetsus that satisfy assumptions (1)-(3) are extracted as rightmost bunsetsu candidates in a sentence. Then, for every point following the extracted bunsetsus and for every pause preceding or following the expressions described in Section 3.2, a decision is made regarding whether a period should be inserted or not. In assumption (2), bunsetsus that depend on a bunsetsu beyond 50 bunsetsus are ignored because no such long-distance dependencies were found in the 188 talks in the CSJ used in our experiments. Bunsetsus whose dependency probability is very low are also ignored because there is a high possibility that these bunsetsus' dependencies are incorrect. Let this threshold probability be p, and let the threshold probability in assumption (3) be q. The optimal parameters p and q are determined by using held-out data. In this approach, about one third of all bunsetsu boundaries are extracted as sentence boundary candidates. So, an output sequence is selected from all possible conversion patterns generated using two words to the left and two words to the right of each sentence boundary candidate. To perform this operation, we used a beam search with a width of 10 because a number of conversion patterns can be generated with such a search.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Sentence Boundary Detection Based on Machine Learning (Method 2) </SectionTitle> <Paragraph position="0"> We use Support Vector Machine (SVM) as a machine learning model and we approached the problem of sentence boundary detection as a text chunking task. We used YamCha (Kudo and Matsumoto, 2001) as a text chunker, which is based on SVM and uses polynomial kernel functions. To determine the appropriate chunk label for a target word, YamCha uses two words to the right and two words to the left of the target word as statistical features, and it uses chunk labels that are dynamically assigned to the two preceding or the two following words as dynamic features, depending on the analysis direction. To solve the multi-class problem, we used pairwise classification. This method generates N [?] (N [?] 1)/2 classifiers for all pairs of classes, N, and makes a final decision by their weighted voting.</Paragraph> <Paragraph position="1"> The features used in our experiments are the following: 1. Morphological information of the three words to the right and three words to the left of the target word, such as character strings, pronunciation, part of speech, type of inflection, and inflection form 2. Pause duration normalized in terms of Mahalanobis distance 3. Clause boundaries 4. Dependency probability of the target bunsetsu 5. The number of bunsetsus that depend on the target bunsetsu and their dependency probabilities null We used the IOE labeling scheme for proper chunking, and the following parameters for YamCha.</Paragraph> </Section> </Section> class="xml-element"></Paper>