File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1061_intro.xml
Size: 2,190 bytes
Last Modified: 2025-10-06 14:01:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1061"> <Title>Satoshi Sekine ++</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Problems and Their Solutions </SectionTitle> <Paragraph position="0"> As we mentioned in Section 1, tagging the whole of the CSJ manually would be difficult. Therefore, we are taking a semi-automatic approach. This section describes major problems in tagging a large spontaneous speech corpus with high precision in a semi-automatic way, and our solutions to those problems.</Paragraph> <Paragraph position="1"> One of the most important problems in morphological analysis is that posed by unknown words, which are words found in neither a dictionary nor a training corpus. Two statistical approaches have been applied to this problem. One is to find unknown words from corpora and put them into a dictionary (e.g., (Mori and Nagao, 1996)), and the other is to estimate a model that can identify unknown words correctly (e.g., (Kashioka et al., 1997; Nagata, 1999)). Uchimoto et al. used both approaches. They proposed a morphological analysis method based on a maximum entropy (ME) model (Uchimoto et al., 2001). Their method uses a model that estimates how likely a string is to be a morpheme as its probability, and thus it has a potential to overcome the unknown word problem. Therefore, we use their method for morphological analysis of the CSJ. However, Uchimoto et al. reported that the accuracy of automatic word segmentation and POS tagging was 94 points in F-measure (Uchimoto et al., 2002). That is much lower than the accuracy obtained by manual tagging. Several problems led to this inaccuracy. In the following, we describe these problems and our solutions to them.</Paragraph> <Paragraph position="2"> * Fillers and disfluencies Fillers and disfluencies are characteristic expressions often used in spoken language, but they are randomly inserted into text, so detecting their segmentation is difficult. In the CSJ, they are tagged manually. Therefore, we first delete fillers and disfluencies and then put them back in their original place after analyzing a text.</Paragraph> </Section> class="xml-element"></Paper>