File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1159_intro.xml

Size: 4,328 bytes

Last Modified: 2025-10-06 14:02:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1159">
  <Title>Dependency Structure Analysis and Sentence Boundary Detection in Spontaneous Japanese</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The &amp;quot;Spontaneous Speech: Corpus and Processing Technology&amp;quot; project has been sponsoring the construction of a large spontaneous Japanese speech corpus, Corpus of Spontaneous Japanese (CSJ) (Maekawa et al., 2000). The CSJ is the biggest spontaneous speech corpus in the world, and it is a collection of monologues and dialogues, the majority being monologues such as academic presentations. The CSJ includes transcriptions of speeches as well as audio recordings. Approximately one tenth of the CSJ has been manually annotated with information about morphemes, sentence boundaries, dependency structures, discourse structures, and so on. The remaining nine tenths of the CSJ have been annotated semi-automatically. A future goal of the project is to extract sentence boundaries, dependency structures, and discourse structures from the remaining transcriptions. This paper focuses on methods for automatically detecting sentence boundaries and dependency structures in Japanese spoken text.</Paragraph>
    <Paragraph position="1"> In many cases, Japanese dependency structures are defined in terms of the dependency relationships between Japanese phrasal units called bunsetsus. To define dependency relationships between all bunsetsusinspontaneous speech, we need to define not only the dependency structures in all sentences but also the inter-sentential relationships, or, discourse relationships, between the sentences, as dependency relationships between bunsetsus. However, it is difficult to define and detect discourse relationships between sentences because of significant inconsistencies in human annotations of discourse structures, especially with regard to spontaneous speech. We also need to know intra-sentential dependency structures in order to use the results of dependency structure analysis for sentence compaction in automatic text summarization or case frame acquisition. Because it is difficult to define discourse relationships between sentences, depending on the actual application, it is usually enough to define and detect the dependency structure of each sentence. Therefore, the CSJ was annotated with intra-sentential dependency structures for sentences in the same way this is usually done for a written text corpus. However, there is a big difference between a written text corpus and a spontaneous speech corpus: In spontaneous speech, especially when it is long, sentence boundaries are often ambiguous. In the CSJ, therefore, sentence boundaries were defined based on clauses whose boundaries were automatically detected by using surface information (Maruyama et al., 2003), and they were detected manually (Takanashi et al., 2003). Our definition of sentence boundaries follows the definition used in the CSJ.</Paragraph>
    <Paragraph position="2"> Almost all previous research on Japanese dependency structure analysis dealt with dependency structures in written text (Fujio and Matsumoto, 1998; Haruno et al., 1998; Uchimoto et al., 1999; Uchimoto et al., 2000; Kudo and Matsumoto, 2000). Although Matsubara and colleagues did investigate dependency structures in spontaneous speech (Matsubara et al., 2002), the target speech was dialogues where the utterances were short and sentence boundaries could be easily defined based on turn-taking data. In contrast, we investigated dependency structures in spontaneous and long speeches in the CSJ. The biggest problem in dependency structure analysis with spontaneous and long speeches is that sentence boundaries are ambiguous. Therefore, sentence boundaries should be detected before or during dependency structure analysis in order to obtain the dependency structure of each sentence.</Paragraph>
    <Paragraph position="3"> In this paper, we first describe the problems with dependency structure analysis of spontaneous speech. Because the biggest problem is ambiguous sentence boundaries, we focus on sentence boundary detection and propose two methods for improving the accuracy of detection. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML