XML Viewer - p98-1108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1108_metho.xml
Size: 14,280 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1108">
  <Title>Use of Mutual Information Based Character Clusters in Dictionary-less Morphological Analysis of Japanese Hideki Kashioka, Yasuhiro Kawata, Yumiko Kinjo,</Title>
  <Section position="3" start_page="0" end_page="658" type="metho">
    <SectionTitle>
2 Use of Information on Characters
</SectionTitle>
    <Paragraph position="0"> Many languages in the world do not insert a space between words in the written text.</Paragraph>
    <Paragraph position="1"> Japanese is one of them. Moreover, the number of characters involved in Japanese is very large. 1 a Unlike English being basically written in a 26character alphabet, the domain of possible characters appearing in an average Japanese text is a set involving tens of thousands of characters,</Paragraph>
    <Section position="1" start_page="658" end_page="658" type="sub_section">
      <SectionTitle>
2.1 Character Sort
</SectionTitle>
      <Paragraph position="0"> There are three clearly identifiable character sorts in Japanese: 2 Kanji are Chinese characters adopted for historical reasons and deeply rooted in Japanese. Each character carries a semantic sense.</Paragraph>
      <Paragraph position="1"> Hiragana are basic Japanes e phonograms representing syllables. About fifty of them constitute the syllabary.</Paragraph>
      <Paragraph position="2"> Katakana are characters corresponding to hiragana, but their use is restricted mainly to foreign loan words.</Paragraph>
      <Paragraph position="3"> Each character sort has a limited number of elements, except for Kanji whose exhaustive list is hard to obtain.</Paragraph>
      <Paragraph position="4"> Identifying each character sort in a sentence would help in predicting the word boundaries and subsequently in assigning the parts-ofspeech. For example, between characters of different sorts, word boundaries are highly likely. Accordingly, in formalizing heuristics, character sorts must be assumed.</Paragraph>
    </Section>
    <Section position="2" start_page="658" end_page="658" type="sub_section">
      <SectionTitle>
2.2 Character Cluster
</SectionTitle>
      <Paragraph position="0"> Apart from the distinctions mentioned above, are there things such as natural classes with respect to the distribution of characters in a certain set of sentences (therefore, the classes are empirically learnable)? If there are, how can we obtain such knowledge? It seems that only a certain group of characters tends to occur in a certain restricted context. For example, in Japanese, there are many numerical classifier expressions attached immediately after numericals. 3 If such is the case, these classifiers can be clustered in terms of their distributions with respect to a presumably natural class called numericals. Supposing one of a certain group of characters often occurs as a neighbor to one of the other groups of characters, and supposing characters are clustered and organized in a hierarchical fashion, then it is possible to refer to such groupings by pointing ~Other sorts found in ordinary text are Arabic numerics, punctuations, other symbols, etc.</Paragraph>
      <Paragraph position="1"> 3For example, &amp;quot; 3 ~ (san-satsu)&amp;quot; for bound objects &amp;quot;3 copies of&amp;quot;, &amp;quot;2 ~ (ni-mai)&amp;quot; for flat objects &amp;quot;2 pieces~sheets of&amp;quot;.</Paragraph>
      <Paragraph position="2"> out a certain node in the structure. Having a way of organizing classes of characters is clearly an advantage in describing facts in Japanese.</Paragraph>
      <Paragraph position="3"> The next section presents such a method.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="658" end_page="659" type="metho">
    <SectionTitle>
3 Mutual Information-Based
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="658" end_page="659" type="sub_section">
      <SectionTitle>
Character Clustering
</SectionTitle>
      <Paragraph position="0"> One idea is to sort words out in terms of neighboring contexts. Accordingly research has been carried out on n-gram models of word clustering (Brown et. al. 1992) to obtain hierarchical clusters of words by classifying words in such a way so as to minimizes the reduction of MI.</Paragraph>
      <Paragraph position="1"> This idea is general in the clustering of any kind of list of items into hierarchical classes. 4 We therefore have adopted this approach not only to compute word classes but also to compute character clusterings in Japanese.</Paragraph>
      <Paragraph position="2"> The basic algorithm for clustering items based on the amount of MI is as follows: s  1) Assign a singleton class to every item in the set.</Paragraph>
      <Paragraph position="3"> 2) Choose two appropriate classes to create a new class which subsumes them.</Paragraph>
      <Paragraph position="4"> 3) Repeat 2) until the additional new items  include all of the items in the set.</Paragraph>
      <Paragraph position="5"> With this method, we conducted an experimental clustering over the ATR travel conversation corpus. 6 As a result, all of the characters in the corpus were hierarchically clustered according to their distributions.</Paragraph>
      <Paragraph position="6"> Example: A partial character clustering</Paragraph>
      <Paragraph position="8"> Each node represents a subset of all of the different characters found in the training data.</Paragraph>
      <Paragraph position="9"> We represent tree structured clusters with bit strings, so that we may specify any node in the structure by using a bit substring.</Paragraph>
      <Paragraph position="10"> 4Brown, et. al. (1992) for details.</Paragraph>
      <Paragraph position="11"> 5This algorithm, however, is too costly because the amount of computation exponentially increases depending on the number of items. For practical processing, the basic procedure is carried out over a certain limited number of items, while a new item is supplied to the processing set each time clustering is done.</Paragraph>
      <Paragraph position="12"> 880,000 sentences, with a total number of 1,585,009 characters and 1,831 different characters.</Paragraph>
      <Paragraph position="13">  Numerous significant clusters are found among them. r They are all natural classes computed based on the events in the training set.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="659" end_page="660" type="metho">
    <SectionTitle>
4 Decision-Tree Morphological Analysis
</SectionTitle>
    <Paragraph position="0"> The Decision-Tree model consists of a set of questions structured into a dendrogram with a probability distribution associated with each leaf of the tree. In general, a decision-tree is a complex of n-ary branching trees in which questions are associated with each parent node, and a choice or class is associated with each child node. 8 We represent answers to questions as bits.</Paragraph>
    <Paragraph position="1"> Among other advantages to using decisiontrees, it is important to note that they are able to assign integrated costs for classification by all types of questions at different feature levels provided each feature has a different cost.</Paragraph>
    <Section position="1" start_page="659" end_page="659" type="sub_section">
      <SectionTitle>
4.1 Model
</SectionTitle>
      <Paragraph position="0"> Let us assume that an input sentence C = cl c2 ... cn denotes a sequence of n characters that constitute words 1Y= = Wl w2 ... win, where each word wi is assigned a tag ti (T = tl t2 ... tin).</Paragraph>
      <Paragraph position="1"> The morphological analysis task can be formally defined as finding a set of word segmentations and part-of-speech assignments that maximizes the joint probability of the word sequence and tag sequence P(W,T\[C).</Paragraph>
      <Paragraph position="2"> The joint probability P(W, TIC) is calculated by the following formulae:</Paragraph>
      <Paragraph position="4"> The Word Model decision-tree is used as the word tokenizer. While finding word boundrFor example, katakana, numerical classifiers, numerics, postpositional case particles, and prefixes of demonstrative pronouns.</Paragraph>
      <Paragraph position="5"> SThe work described here employs only binary decision-trees. Multiple alternative questions are represented in more than two yes/no questions. The main reason for this is the computational efficiency. Allowing questions to have more answers complicates the decision-tree growth algorithm.</Paragraph>
      <Paragraph position="6"> OWe call this the &amp;quot;Word Model&amp;quot;.</Paragraph>
      <Paragraph position="7"> 1deg~,Ve call this the &amp;quot;Tagging Model&amp;quot;.</Paragraph>
      <Paragraph position="8"> aries, we use two different labels: Word+ and Word-. In the training data, we label Word+ to a complete word string, and Word- to every substring of a relevant word since these sub-strings are not in fact a word in the current context. 11 The probability of a word estimates the associated distributions of leaves with a word decision-tree.</Paragraph>
      <Paragraph position="9"> We use the Tagging Model decision-tree as our part-of-speech tagger. For an input sentence C, let us consider the character sequence from Cl to %-1 (assigned Wl w2 ... wk-1) and the following character sequence from p to p + l to be the word wk; also, the word wk is assumed to be assigned the tag tk.</Paragraph>
      <Paragraph position="10"> We approximate the probability of the word wk assigned with tag tk as follows: P(tk) = p(ti\[wl, ..., wk,q,..., tk-1, C). This probability estimates the associated distributions of leaves with a part-of-speech tag decision-tree.</Paragraph>
    </Section>
    <Section position="2" start_page="659" end_page="660" type="sub_section">
      <SectionTitle>
4.2 Growing Decision-Trees
</SectionTitle>
      <Paragraph position="0"> Growing a decision-tree requires two steps: selecting a question to ask at each node; and determining the probability distribution for each leaf from the distribution of events in the training set. At each node, we choose from among all possible questions, the question that maximizes the reduction in entropy.</Paragraph>
      <Paragraph position="1"> The two steps are repeated until the following conditions are no longer satisfied: * The number of leaf node events exceeds the constant number.</Paragraph>
      <Paragraph position="2"> * The reduction in entropy is more than the threshold.</Paragraph>
      <Paragraph position="3"> Consequently, the list of questions is optimally structured in such a way that, when the data flows in the decision-tree, at each decision point, the most efficient question is asked.</Paragraph>
      <Paragraph position="4"> Provided a set of training sentences with word boundaries in which each word is assigned with a part-of-speech tag, we have a) the necessary structured character clusters, and b) the necessary structured word clusters; 12 both of them are based on the n-gram language model.</Paragraph>
      <Paragraph position="5"> laFor instance, for the word &amp;quot;mo-shi-mo-shi&amp;quot; (hello), &amp;quot;mo-shi-mo-shi&amp;quot; is labeled Word-I-, and &amp;quot;mo-shi-mo&amp;quot;, &amp;quot;mo-shi', &amp;quot;mo&amp;quot; are all labeled Word-. Note that &amp;quot;moshi&amp;quot; or &amp;quot;mo-shi-mo&amp;quot; may be real words in other contexts, e.g., &amp;quot;mo-shi/wa-ta-shi/ga ... (If I do ... )'. 12Here, a word token is based only on a word string, not on a word string tagged with a part-of-speech.</Paragraph>
      <Paragraph position="6">  We also have c) the necessary decision-trees for word-splitting and part-of-speech tagging, each of which contains a set of questions about events. We have considered the following points in making decision-tree questions.</Paragraph>
      <Paragraph position="7"> 1) MI character bits We define self-organizing character classes  represented by binary trees, each of whose nodes are significant in the n-gram language model. We can ask which node a character is dominated by.</Paragraph>
      <Paragraph position="8"> 2) MI word bits Likewise, MI word bits (Brown et. al.</Paragraph>
      <Paragraph position="9"> 1992) are also available so that we may ask which node a word is dominated by.</Paragraph>
      <Paragraph position="10"> 3) Questions about the target word These questions mostly relate to the morphology of a word (e.g., Is it ending in 'shi-i' (an adjective ending)? Does it start with 'do-'?).</Paragraph>
      <Paragraph position="11"> 4) Questions about the context  Many of these questions concern continuous part-of-speech tags (e.g., Is the previous word an adjective?). However, the questions may concern information at different remote locations in a sentence (e.g., Is the initial word in the sentence a noun?). These questions can be combined in order to form questions of greater complexity.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="660" end_page="660" type="metho">
    <SectionTitle>
5 Analysis with Decision-Trees
</SectionTitle>
    <Paragraph position="0"> Our proposed morphological analyzer processes each character in a string from left to right.</Paragraph>
    <Paragraph position="1"> Candidates for a word are examined, and a tag candidate is assigned to each word. When each candidate for a word is checked, it is given a probability by the word model decision-tree.</Paragraph>
    <Paragraph position="2"> We can either exhaustively enumerate and score all of the cases or use a stack decoder algorithm (Jelinek 1969; Paul 1991) to search through the most probable candidates.</Paragraph>
    <Paragraph position="3"> The fact that we do not use a dictionary, 13 is one of the great advantages. By using a dictionary, a morphological analyzer has to deal with unknown words and unknown tags, 14 and is also fooled by many words sharing common substrings. In practical contexts, the system a3Here, a dictionary is a listing of words attached to part-of-speech tags.</Paragraph>
    <Paragraph position="4"> 14Words that are not found in the dictionary and necessary tags that are not assigned in the dictionary.  refers to the dictionary by using heuristic rules to find the more likely word boundaries, e.g., the minimum number of words, or the maximum word length available at the minimum cost. If the system could learn how to find word boundaries without a dictionary, then there would be no need for such an extra device or process.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML