File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1032_metho.xml

Size: 19,906 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1032">
  <Title>Language Independent Morphological Analysis</Title>
  <Section position="3" start_page="0" end_page="233" type="metho">
    <SectionTitle>
2 Problems of Tokenization in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="233" type="sub_section">
      <SectionTitle>
Segmented Languages
</SectionTitle>
      <Paragraph position="0"> In segmented languages such as English, tokenization is regarded as a relatively easy task and little attention has been paid to. When a sentence has clear word boundaries, the analyzer just consults the dictionary look-up component whether strings between delimiters exist in the dictionary. If any string exists, the dictionary look-up component returns the set of possible parts-of-speech. This string is known as graphic word which is defined as &amp;quot;a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks&amp;quot; (Ku~era and Francis, 1967).</Paragraph>
      <Paragraph position="1"> Conventionally, in segmented languages, an analyzer converts a stream of characters into graphic words (see the rows labeled &amp;quot;Characters&amp;quot; and  &amp;quot;Graphic Words&amp;quot; in Figure 1) and searches the dictionary for these graphic words. However, in practice, we want a sequence of lexemes (see the line labeled &amp;quot;Lexemes&amp;quot; in Figure 1). We list two major problems of tokenization in segmented languages below (examples in English). We use the term segment to refer to a string separated by white spaces.</Paragraph>
      <Paragraph position="2"> 1. Segmentation(one segment into several lexemes): null Segments with a period at the end (e.g, &amp;quot;Calif.&amp;quot; and &amp;quot;etc.&amp;quot;) suffer from segmentation ambiguity. The period can denote an abbreviation, the end of a sentence, or both. The problem of sentence boundary ambiguity is not easy to solve (Palmer and Hearst, 1997). A segment with an apostrophe also has segmentation ambiguity.</Paragraph>
      <Paragraph position="3"> For example, &amp;quot;McDonald's&amp;quot; is ambiguous since this string can be segmented into either &amp;quot;Mc-Donald / Proper noun&amp;quot; + &amp;quot; 's / Possessive ending&amp;quot; or &amp;quot;McDonald's / Proper noun (company name)&amp;quot;. In addition, &amp;quot;boys' &amp;quot; in a sentence &amp;quot;... the boys' toys ...&amp;quot; is ambiguous. The string can be segmented into either &amp;quot;boys' / Plural possessive&amp;quot; or &amp;quot;boys/Plural Noun&amp;quot; / &amp;quot; ' / Punctuation (the end of a quotation)&amp;quot; (Manning and Schiitze, 1999). If a hyphenated segment such as &amp;quot;data-base,&amp;quot; &amp;quot;F-16,&amp;quot; or &amp;quot;MS-DOS&amp;quot; exists in the dictionary, it should be an independent lexeme. However, if a hyphenated segment such as &amp;quot;55-years-old&amp;quot; does not exist in the dictionary, hyphens should be treated as independent tokens(Fox, 1992). Other punctuation marks such as &amp;quot;/&amp;quot; or &amp;quot;_&amp;quot; have the same problem in &amp;quot;OS/2&amp;quot; or &amp;quot;max_size&amp;quot; (in programming languages).</Paragraph>
      <Paragraph position="4"> 2. Round-up(several segments into one lexeme): If a lexeme consisting of a sequence of segments such as a proper noun (e.g., &amp;quot;New York&amp;quot;) or a phrasal verb (e.g., &amp;quot;look at&amp;quot; and &amp;quot;get up&amp;quot;) exists in the dictionary, it should be a lexeme.</Paragraph>
      <Paragraph position="5"> To handle such lexemes, we need to store multisegment lexemes in the dictionary. Webster and Kit handle idioms and fixed expressions in this way(Webster and Kit, 1992). In Penn Treebank(Santorini, 1990), a proper noun like &amp;quot;New York&amp;quot; is defined as two individual proper nouns &amp;quot;New / NNP&amp;quot; / &amp;quot;York / NNP,&amp;quot; disregarding round-up of several:segments into a lexeme.</Paragraph>
      <Paragraph position="6"> The definition of lexemes in a dictionary depends on the requirement of application. Therefore, a simple pattern matcher is not enough to deal with language independent tokenization.</Paragraph>
      <Paragraph position="7"> Non-segmented languages do not have a delimiter between lexemes (Figure 2). Therefore, a treatment of further segmentation and rounding up has been well considered. In a non-segmented language, the analyzer considers all prefixes from each position in the sentence, checks whether each prefix matches the lexeme in the dictionary, stores these lexemes in a graph structure, and finds the most plausible sequence of lexemes in the graph structure. To find the sequence, Nagata proposed a probabilistic language model for non-segmented languages(Nagata, 1994)(Nagata, 1999).</Paragraph>
      <Paragraph position="8"> The crucial difference between segmented and  non-segmented languages in the process of morphological analysis appears in the way of the dictionary look-up. The standard technique for looking up lexemes in Japanese dictionaries is to use a trie structure(Fredkin, 1960)(Knuth, 1998). A trie structured dictionary gives all possible lexemes that start at a given position in a sentence effectively(Morimoto and Aoe, 1993). We call this method of word looking-up as &amp;quot;common prefix search&amp;quot; (hereafter CPS). Figure 3 shows a part of the trie for Japanese lexeme dictionary. The results of CPS for &amp;quot;~j~ ~'7 ~ o &amp;quot;(I go to Ebina.) are &amp;quot;~j~&amp;quot; and &amp;quot;~.&amp;quot; To get all possible lexemes in the sentence, the analyzer has to slide the start position for CPS to the right by character by character.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="233" end_page="233" type="metho">
    <SectionTitle>
3 A Naive Approach
</SectionTitle>
    <Paragraph position="0"> A simple method that directly applies the morphological analysis method for non-segmented languages can handle the problems of segmented languages. For instance, to analyze the sentence, &amp;quot;They've gone to school together,&amp;quot; we first delete all white spaces in the sentence and get &amp;quot;They'vegonetoschooltogether.&amp;quot; Then we pass it to the analyzer for non-segmented languages. However, the analyzer may return the result as &amp;quot;They / 've / gone / to / school / to / get / her / .&amp;quot; inducing a spurious ambiguity. Mills applied this method and tokenized the medieval manuscript in Cornish(Mills, 1998). We carried out experiments to examine the influence of delimiter deletion. We use Penn Treebank(Santorini, 1990) part-of-speech tagged corpus (1.3M lexemes) to train an HMM and analyze sentences by HMM-based morphological analyzer MOZ(Yamashita, 1999)(Ymashita et al., 1999).</Paragraph>
    <Paragraph position="1"> We use a bigram model for training it from the corpus. Test data is the same as the training corpus.</Paragraph>
    <Paragraph position="2"> Table 1 shows accuracy of segmentation and part-of-speech tagging. The accuracy is expressed in terms of recall and precision(Nagata, 1999). Let the number of lexemes in the tagged corpus be Std, the number of lexemes in the output of the analyze be Sys, and the number of matched lexemes be M. Recall is defined as M/Std, precision is defined as M/Sys.</Paragraph>
    <Paragraph position="3"> The following are the labels in Table 1 (sentence formats and methods we use): LXS We isolate all the lexemes in sentences and apply the method for segmented languages to the sentences. This situation is ideal, since the problems we discussed in Section 2 do not exist. In other words, all the sentences do not have segmentation ambiguity. We use the results as the baseline. Example sentence: &amp;quot;Itu ' suMr. uLeeu ' supenu *&amp;quot; NSP We remove all the spaces in sentences and apply the method for non-segmented languages to the sentences. Example sentence: &amp;quot;It ' sMr. Lee ' spen.&amp;quot; NOR Sentences are in the original normal format.</Paragraph>
    <Paragraph position="4"> We apply the method for non-segmented languages to the sentences. Example sentence: &amp;quot;It ' SuMr. uLee ' supen.&amp;quot; Because of no segmentation ambiguity, &amp;quot;LXS&amp;quot; performs better than &amp;quot;NSP&amp;quot; and &amp;quot;NOR.&amp;quot; The following are typical example of segmentation errors.</Paragraph>
    <Paragraph position="5"> The errors originate from conjunctive ambiguity and disjunctive ambiguity(Guo, 1997).</Paragraph>
    <Paragraph position="6"> conjunctive ambiguity The analyzer recognized &amp;quot;away, .... ahead,&amp;quot; &amp;quot;anymore,&amp;quot; and '~orkforce&amp;quot; as &amp;quot;a way,&amp;quot; &amp;quot;a head,&amp;quot; &amp;quot;any more,&amp;quot; and '~ork force,&amp;quot; respectively. In the results of &amp;quot;NSP,&amp;quot; the number of this type of error is 11,267.</Paragraph>
    <Paragraph position="7"> disjunctive ambiguity The analyzer recognized &amp;quot;a tour,&amp;quot; &amp;quot;a ton,&amp;quot; and &amp;quot;Alaskan or&amp;quot; as &amp;quot;at our,&amp;quot; &amp;quot;at on,&amp;quot; and &amp;quot;Alaska nor,&amp;quot; respectively. In the results of &amp;quot;NSP,&amp;quot; the number of this type of error is 233.</Paragraph>
    <Paragraph position="8"> Since only &amp;quot;NSP&amp;quot; has disjunctive ambiguity, &amp;quot;NOR&amp;quot; performs better than &amp;quot;NSP.&amp;quot; This shows that white spaces between segments help to decrease segmentation ambiguity.</Paragraph>
    <Paragraph position="9"> Though the proportional difference in accuracy looks slight between these models, there is a considerable influence in the analysis efficiency. In the cases of &amp;quot;NSP&amp;quot; and &amp;quot;NOR,&amp;quot; the analyzer may look up the dictionary from any position in a given sentence, therefore candidates for lexemes increase, and the analysis time also increase. The results of our experiments show that the run time of analyses of &amp;quot;NSP&amp;quot; or &amp;quot;NOR&amp;quot; takes about 4 times more than that of &amp;quot;LXS.&amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="233" end_page="236" type="metho">
    <SectionTitle>
4 Morpho-fragments: The Building
Blocks
</SectionTitle>
    <Paragraph position="0"> Although segmented languages seemingly have clear word boundaries, there are problems of further segmentation and rounding up as introduced in Section 2. The naive approach in Section 3 does not work well. In this section, we propose an efficient and sophisticated method to solve the problems by introducing the concept of morpho-/ragments. We also show that a uniform treatment of segmented and non-segmented languages is possible without inducing the spurious ambiguity.</Paragraph>
    <Section position="1" start_page="233" end_page="235" type="sub_section">
      <SectionTitle>
4.1 Definition
</SectionTitle>
      <Paragraph position="0"> The morpho-fragments (MFs) of a language is defined as the smallest set of strings of the alphabet which can compose all lexemes in the dictionary. In other words, MFs are intermediate units between</Paragraph>
      <Paragraph position="2"> characters and lexemes (see Figure 1 and Figure 2).</Paragraph>
      <Paragraph position="3"> MFs are well defined tokens which are specialized for language independent morphological analysis.</Paragraph>
      <Paragraph position="4"> For example, in English, all punctuation marks are MFs. Parts of a token separated by a punctuation mark such as &amp;quot;He,&amp;quot; &amp;quot;s,&amp;quot; and the punctuation mark itself, ..... in &amp;quot;He's&amp;quot; are MFs. The tokens in a compound lexeme such as &amp;quot;New&amp;quot; and &amp;quot;York&amp;quot; in &amp;quot;New York&amp;quot; are also MFs. In non-segmented languages such as Chinese and Japanese, every single character is a MF. Figure 4 shows decomposition of sentences into MFs (enclosed by &amp;quot;\[&amp;quot; and &amp;quot;\]&amp;quot;) for several languages. Delimiters (denoted &amp;quot;J') are treated as special MFs that cannot start nor end a lexeme.</Paragraph>
      <Paragraph position="5"> Once the set of MFs is determined, the dictionary is compiled into a trie structure in which the edges are labeled by MFs, as shown in Figure 5 for English and in Figure 3 for Japanese. A trie structure ensures to return all and only possible lexemes starting at a particular position in a sentence by a one-time consultation to the dictionary, resulting in an efficient dictionary look-up with no spurious ambiguity. null When we analyze a sentence of a non-segmented language, to get all possible lexemes in the sentence, the analyzer slides the position one character by one character from the beginning to the end of the sentence and consults the trie structured dictionary (Section 2). Note that every character is a MF in non-segmented languages. In the same way, to analyze a sentence of a segmented language, the analyzer slides the position one MF by one MF and consults the trie structured dictionary, then, all possible lexemes are obtained. For example, in Figure 5, the results of CPS for &amp;quot;'m in ...&amp;quot; are ..... and &amp;quot;'m,&amp;quot; and the results for &amp;quot;New York is ...&amp;quot; are &amp;quot;New&amp;quot; and &amp;quot;New York.&amp;quot; Therefore, a morphological analyzer with CPS-based dictionary look-up for non-segmented languages can be used for the analysis of segmented languages. In other words, MFs make possible language independent morphological analysis. We can also say MFs specify the positions to start as well as to end the dictionary look-up.</Paragraph>
    </Section>
    <Section position="2" start_page="235" end_page="236" type="sub_section">
      <SectionTitle>
4.2 How to Recognize Morpho-fragments
</SectionTitle>
      <Paragraph position="0"> The problem is that it is not easy to identify the complete set of MFs for a segmented language. We do not make effort to find out the minimum and complete set of MFs. Instead, we decide to specify all the possible delimiters and punctuation marks appearing in the dictionary, these may separate MFs or become themselves as MFs. By specifying the following three kinds of information for the language under consideration, we attain a pseudo-complete MF definition. The following setting not only simplifies the identification of MFs but also achieves a uniform framework of language dependent morphological analysis system.</Paragraph>
      <Paragraph position="1"> 1. The language type: The languages are classified into two groups: segmented and non-segmented languages.</Paragraph>
      <Paragraph position="2"> &amp;quot;Language type&amp;quot; decides if every character in the language can be an MF. In non-segmented language every character can be an MF. In segmented language, punctuation marks and sequences of characters except for delimiters can be an MF.</Paragraph>
      <Paragraph position="3"> 2. The set of the delimiters acting as boundaries: These act as boundaries of MFs. However, these can not be independent MFs (can not start nor end a lexeme). For example, white spaces are delimiters in segmented languages.</Paragraph>
      <Paragraph position="4"> 3. The set of the punctuation marks and other symbols: These act as a boundary of MFs as well as an MF. Examples are an apostrophe in &amp;quot;It's,&amp;quot; a period in &amp;quot;Mr.,&amp;quot; and a hyphen in &amp;quot;F-16.&amp;quot; * Using these information, the process of recognizing MFs becomes simple and easy. The process can be implemented by a finite state machine or a simple pattern matcher.</Paragraph>
      <Paragraph position="5"> The following is the example of the definition for English:  1. Language type: segmented language 2. Delimiters: white spaces, tabs, and carriagereturns null 3. Punctuation marks: \[.\] \[,\]\[:\] \[;\] \['\] \[&amp;quot;l \[-\] . &amp;quot; &amp;quot;\[0\] \[1\] \[2\]. * *  As is clear from the definition, &amp;quot;punctuation marks&amp;quot; are not necessary for non-segmented language, since  every character is an MF. The following is the example for Japanese and Chinese.</Paragraph>
      <Paragraph position="6"> 1. Language type: non-segmented language 2. Delimiters: not required 3. Punctuation marks: not required  Though Korean sentences are separated by spaces into phrasal segments, Korean is a non-segmented language essentially, since each phrasal segment does not have lexeme boundaries. We call this type of languages incompletely-segmented languages. German is also categorized as this type. The following is the example for Korean.</Paragraph>
      <Paragraph position="7">  1. Language type: non-segmented language 2. Delimiters: spaces, tabs, and carriage-returns 3. Punctuation marks: not required  In incompletely-segmented languages, such as Korean, we have to consider two types of connection of lexemes, one is &amp;quot;over a delimiter&amp;quot; and the other is &amp;quot;inside a segment&amp;quot; (Hirano and Matsumoto, 1996). If we regard delimiters as lexemes, a trigram model can make it possible to treat both types.</Paragraph>
      <Paragraph position="8"> The definition gives possible starting positions of MFs in sentences of the language and the same morphological analysis system is usable for any language. null We examined an effect of applying the morpho-fragments to analysis. Conditions of the experiment are almost the same as &amp;quot;NOR.&amp;quot; The difference is that we use the morpho-fragments definition for English. The row labeled &amp;quot;MF&amp;quot; in Table 1 shows the results of the analysis. Using the morpho-fragments decreases the analysis time drastically. The accuracy is also better than those of the naive approaches. Well studied language such as English may have a hand tuned tokenizer which is superior to ours. However, to tune a tokenizer by hand is not suitable to implement many minor languages.</Paragraph>
    </Section>
    <Section position="3" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
4.3 Implementation
</SectionTitle>
      <Paragraph position="0"> We implement a language independent morphological analysis system based on the concept of morphofragments(Yamashita, 1999). With an existence of tagged corpus, it is straightforward to implement part-of-speech taggers. We have implemented several of such taggers. The system uses an HMM.</Paragraph>
      <Paragraph position="1"> This is trained by a part-of-speech tagged corpus.</Paragraph>
      <Paragraph position="2"> We overview the setting and performance of tagging for several languages.</Paragraph>
      <Paragraph position="3"> English An HMM is trained by the part-of-speech tagged corpus part of Penn Treebank(Santorini, 1990) (1.3 million morphemes). We use a tri-gram model. The lexemes in the dictionary are taken from the corpus as well as from the entry words in Oxford Advanced Learner's Dictionary(Mitton, 1992). The system achieves 97% precision and recall for training data, 95% precision and recall for test data.</Paragraph>
      <Paragraph position="4"> Japanese An HMM is trained by Japanese part-of-speech tagged corpus(Rea, 1998) (0.9 million morphemes). We use a trigram model. The lexemes in the dictionary are taken from the corpus as well as from the dictionary of ChaSen(Matsumoto et al., 1999), a freely available Japanese morphological analyzer. The system achieves 97% precision and recall for training and test data.</Paragraph>
      <Paragraph position="5"> Chinese An HMM is trained by the Chinese part-of-speech tagged corpus released by CKIP(Chinese Knowledge Information Processing Group, 1995) (2.1 million morphemes). We use a bi-gram model. The lexemes in the dictionary are taken only from the corpus. The system achieves 95% precision and recall for training data, 91% precision and recall for test data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML