File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-2019_intro.xml
Size: 3,650 bytes
Last Modified: 2025-10-06 14:01:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2019"> <Title>Morphological Analysis of The Spontaneous Speech Corpus</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In recent years, systems developed for analyzing written-language texts have become considerably accurate. This accuracy is largely due to the large amounts of tagged corpora and the rapid progress in the study of corpus-based natural language processing. However, the accuracy of the systems developed for written language is not always high when these same systems are used to analyze spoken-language texts.</Paragraph> <Paragraph position="1"> The reason for this remaining inaccuracy is due to several differences between the two types of languages. For example, the expressions used in written language are often quite different from those in spoken language, and sentence boundaries are frequently ambiguous in spoken language. The &quot;Spontaneous Speech: Corpus and Processing Technology&quot; project was implemented in 1999 to overcome this problem. Spoken language includes both monologue and dialogue texts; the former (e.g. the text of a talk) was selected as a target of the project because it was considered to be appropriate to the current level of study on spoken language.</Paragraph> <Paragraph position="2"> Tagging the spontaneous speech corpus with morphological information such as word segmentation and parts-of-speech is one of the goals of the project. The tagged corpus is helpful for us in making a language model in speech recognition as well as for linguists investigating distribution of morphemes in spontaneous speech. For tagging the corpus with morphological information, a morphological analysis system is needed. Morphological analysis is one of the basic techniques used in Japanese sentence analysis. A morpheme is a minimal grammatical unit, such as a word or a suffix, and morphological analysis is the process of segmenting a given sentence into a row of morphemes and assigning to each morpheme grammatical attributes such as part-of-speech (POS) and inflection type. One of the most important problems in morphological analysis is that posed by unknown words, which are words found in neither a dictionary nor a training corpus. Two statistical approaches have been applied to this problem. One is to find unknown words from corpora and put them into a dictionary (e.g., (Mori and Nagao, 1996)), and the other is to estimate a model that can identify unknown words correctly (e.g., (Kashioka et al., 1997; Nagata, 1999)). Uchimoto et al. used both approaches. They proposed a morphological analysis method based on a maximum entropy (M.E.) model (Uchimoto et al., 2001). We used their method to tag a spontaneous speech corpus. Their method uses a model that can not only consult a dictionary but can also identify unknown words by learning certain characteristics. To learn these characteristics, we focused on such information as whether or not a string is found in a dictionary and what types of characters are used in a string. The model estimates how likely a string is to be a morpheme.</Paragraph> <Paragraph position="3"> This model is independent of the domain of corpora; in this paper we demonstrate that this is true by applying our model to the spontaneous speech corpus, Corpus of Spontaneous Japanese (CSJ) (Maekawa et al., 2000). We also show that a dictionary developed for a corpus on a certain domain is helpful for improving accuracy in analyzing a corpus on another domain.</Paragraph> </Section> class="xml-element"></Paper>