File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1032_intro.xml
Size: 3,023 bytes
Last Modified: 2025-10-06 14:00:42
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1032"> <Title>Language Independent Morphological Analysis</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The first step in natural language processing is to identify words in a sentence. We call this process a morphological analysis. Various languages exist in the world, and strategies for morphological analysis differ by types of language. Conventionally, morphological analyzers have been developed in one analyzer for each language approach. This is a language dependent approach. In contrast, We propose a framework of language independent morphological analysis system. We employ one analyzer for any language approach. This approach enables a rapid implementation of morphological analysis systems for new languages.</Paragraph> <Paragraph position="1"> We define two types of written languages: one is a segmented language, and the other is a non-segmented language. In non-segmented languages such as Chinese and Japanese, since words are not separated by delimiters such as white spaces, tokenization is a important and difficult task. In segmented languages such as English, since words are seemingly separated by white spaces or punctuation marks, tokenization is regarded as a relatively easy task and little attention has been paid to. Therefore, each language dependent morphological analyzer has its own strategy for tokenization. We call a string defined in the dictionary lexeme. From an algorithmic point of view, tokenization is regarded as the process of converting an input stream of characters into a stream of lexemes.</Paragraph> <Paragraph position="2"> We assume that a morphological analysis consists of three processes: tokenization, dictionary lookup, and disambiguation. Dictionary look-up gets a string and returns a set of lexemes with part-of-speech information. This implicitly contains lemmatization. Disambiguation selects the most plausible sequence of lexemes by a use of a rule-base model or a hidden Markov model (HMM)(Manning and Schiitze, 1999). Disambiguation i s already language independent, since it does not process strings directly and therefore will not be taken up. On the other hand, tokenization and dictionary look-up are language dependent and shall be explained more in this paper.</Paragraph> <Paragraph position="3"> We consider problems concerning tokenization of segmented languages in Section 2. To resolve these problem, we first apply the method of non-segmented languages processing to segmented languages (Section 3). However, we do not obtain a satisfactory result. Then, we introduce the concept of morpho-fragments to generalize the method of non-segmented language processing (Section 4).</Paragraph> <Paragraph position="4"> The proposed framework resolves most problems in tokenization, and an efficient language independent part-of-speech tagging becomes possible.</Paragraph> </Section> class="xml-element"></Paper>