File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0127_metho.xml
Size: 21,608 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0127"> <Title>I : I I I I I</Title> <Section position="3" start_page="298" end_page="304" type="metho"> <SectionTitle> 2. Chinese Word Classification Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="298" end_page="299" type="sub_section"> <SectionTitle> 2.1 Basic Idea </SectionTitle> <Paragraph position="0"> We adopt the top-down binary splitting technique to the all words using average mutual information as similarity metric like McMahon (1995). This method has its merits: Top-down technique can represent the hierarchy information explicitly; the position of the word in class-space can be obtained without reference to the positions of other words, while the bottom-up technique treats every word in the vocabulary as one class and merges two classes among this vocabulary according to certain similarity metric, then repeats the merging process until the demanded number of classes is obtained.</Paragraph> <Paragraph position="1"> Brown r.~. a\] . (1992) have shown that any classification system whose average class mutual information is maximized will lead to class-based language models of lower perplexities.</Paragraph> <Paragraph position="2"> The concept of mutual information, taken from information theory, was proposed as a measure of word association (Church 1990; &quot;Jelinek et al. 1990,1992; Dagan, 1995;). It reflects the strength of relationship between words by comparing their actual co-occurrence probability with the probabi I ity that would be expected by chance. The mutual information of two events x and y is defined as follows:</Paragraph> <Paragraph position="4"> where P(x,)and P(x,)are the probabilities of the events, and P(x,,x2)is the probability of the joint event. If there is a strong association between x,and x, then P(x,,x:)>>P(x,)P(x:) as a result I(x,,x2)>>O. If there is a weak association between x, and x~ then P(x,,x,)=P(x,)P(x2) and/(x,,x2)=0. If P(x,,x,.)<<P(x,)P(x:) then I(x,,x2)<<O. Owing to the unreliability of measuring negative mutual information values between content words in corpora that are not extremely large, we have considered that any negative value to be 0. We also set \](xl,x2) to 0 if = 0.</Paragraph> <Paragraph position="5"> The average mutual information I~ between events x,,x2, .... x N is defined similarly.</Paragraph> <Paragraph position="7"> Rather than estimate the relationship between words, we measure the mutual information between classes. Let C,, C\] be the classes, i.j= 0,1.2 ..... N ; N denotes the number of classes.</Paragraph> <Paragraph position="8"> Then average mutual information between classes 01, C 2,..., C N is p(c,,c,)</Paragraph> <Paragraph position="10"> The complening process is described as follows: We split the vocabulary into a binary tree. We only consider one dimension neighbor.</Paragraph> <Paragraph position="11"> #1:Take the whole words' in vocabulary as one class and take this level &quot; in the binary tree as 0. That is Level=0, Branch=0, else goto end.</Paragraph> <Paragraph position="12"> From the algorithm descrbed above, we can conclude that the computation time is to be order O(hV 3) for tree height h and vocabulary size V to move a word from one class to another. If the height of the binary tree is h, the number of all possible classes will be 2 h. During the splitting process, especially at the bottom of the binary tree, some classes may be empty because the classes higher than them can not be splitted further more.</Paragraph> </Section> <Section position="2" start_page="299" end_page="304" type="sub_section"> <SectionTitle> 2.2 Improvement to the Basic Algorithm </SectionTitle> <Paragraph position="0"> As mentioned in Introduction, Brill (1993), and McMahon ~1995) only consider one dimension neighbor, while Schutze (1995) consider 50 dimensions &quot;neighbors. How long the dimensions neighbors should be indeed? For long-distance bigrams mentioned in Huang, et al. (1993) and Rosefield (1994), training-set perplexity is low for the conventional bigram(d=l), and it increases significantly as they move though W = 2,3,4 and 5. For&quot; d = 6,...\]0, training-set perplexity remained at about the same level. Thus, Huang,X.-D.et al. (!993) conclude that some information indeed exists in the more distant past but it is spread thinly across the entire history. We do the test on Chinese in the same way. And similar results are obtained. So, 50 is too long for dimensions and the search in searching space is computationally prohibitive, and I is so small for dimensions that much information will be lost.</Paragraph> <Paragraph position="1"> In this paper, we let d = 2.</Paragraph> <Paragraph position="2"> so, P(~), P(C,)and P(~,C/) can be calculated as follows:</Paragraph> <Paragraph position="4"> where N,o,, I is the total times of words which are in the vocabulary occurring in the corpus.</Paragraph> <Paragraph position="5"> d is the calculating distance considered in the corpus. N,,. is the total times of word w occurring in the corpus \]V,~.,w: is the total times of words couple wtw 2 occurring in the corpus within the distance d.</Paragraph> <Paragraph position="6"> In the works of Brill (1993), Brill,E. et al. use the sum of two relative entropies as the similarity metric to compare two words. They treat the word's neighbors equally without considering the possible different influences of l.eft neighbor and right neighbor to the word. But in natural language, the effect from left neighbor and right neighbor is asymmetric, that is, the effect is directional. For example, In &quot;~,~.~&quot;( &quot;I ate an apple&quot;), the Chinese word ~&quot;(~I&quot;) and &quot;%~9~&quot;(~apple &quot;) has different functions in this sentence. We can not say that &quot;~...~.~&quot;(&quot;An apple ate I&quot;). So, it is necessary to induce a similarity metric which reflects this directional property. Applying this idea in our algorithm, we create two binary trees to represent different directions. One binary tree is produced to represent the word relation direction from the left to the right, and the other is to represent the word relation direction from the right to the left. The former is from the left to the right is the default circumstance mentioned in 2.1.2.</Paragraph> <Paragraph position="7"> The similar idea about directional property is presented by Dagan, et ai.(1995) also. Dagan, et al. (1995) defines a similarity metric of two words that can reflect the directional property according to mutual information to determine the degree of simil~rity between two words. But the metric does not have transitivity. The intransitivity of the metric detex-mines this metric can not be used in clustering words to equivalence classes.</Paragraph> <Paragraph position="8"> To reflect the different influence of left neighbor and right neighbor of the word, we introduce the probability for each word w to every class. That is, for the classes produced by the binary tree which represent the word relation direction from the left to the right, we distribute probability ~r(~Iw)for each word w corresponding every class ~, the probability ~(~lw) reflect the degree the word w belongs to class ~. For the classes the binary tree which represent the word relation direction from the right to the left produce, Pw(~lw) is calculated likewise.</Paragraph> <Paragraph position="9"> Mutual information can be explained as: the ability of dispelling the uncertainty of information source. And entropy of information is defined as the uncertainty of information source. So, the probability word w which belongs to class ~ can be presented as follows: where I(~,~) is the mutual information between the class ~ and the i other class ~ which is in the same binary branch with ~. S(~) is the | entropy of class ~.</Paragraph> <Paragraph position="10"> So, 1- ~ denotes the probability that word w didn't belong to class I That is, in the binary tree, \]-Pc, denotes the probability of the other branch class corresponding to ~. Because the average mutual information is little, it is possible that Pc, is less than l-Pc. To I avoid distributing the less probability to the word assigned to this class than the probability to the word not assigned to this class, we * distribute the probability 1 to the word assigned to the class. | Thus, for each class in the certain level of the binary tree, we multiple the probabilities either 1 or |- Pc, to its original l probabilities, in which ~ is the other branch class opposite to the J class the word not belonging to.</Paragraph> <Paragraph position="11"> The description on above is only word w belong to a certain class in I certain level without consider the affection from its upper levels. To obtain the real probabiliSy of word w belonging to certain class, all belonging probabilities of its ancestors should be multiplied I together.</Paragraph> <Paragraph position="12"> The distribution of the probability is not optimal, but it reflects the degree a word belonging to a class. It should be noted that * ~-'~P(~iW) must be normalized both for the left-right and the right-left results. And the normalized results of the left-right and the right-left binary tree also must be normalized together.</Paragraph> <Paragraph position="13"> Since there is directional property between words, the transitivity will not be satisfied between different directions. That is, if we didn't introduce the probability ~r(~lw) and P~(~lw), we would not merge the classes because there is no transitivity between the class in which word relation is from the left to the right and the class in which word relation is from the right to the left. For example, &quot;~\]&quot;(&quot;we&quot;) and ~i~.&quot;(&quot;you &quot;) are contained in one class derived by the left-right binary tree, and other two words ~\]&quot;(~you&quot;) and &quot;~_~&quot;(&quot;apple&quot;) belong to another class derived from right-left binary tree. This do not mean that the words &quot;~\]&quot;(&quot;we&quot;) and ~&quot;(&quot;apple&quot;) belong to one class.</Paragraph> <Paragraph position="14"> But when we put forward the probability, unlike the intransitivity of similarity metric presented by Dagan, et al. (!995), the classes generated by two binary trees can be merged because the probabilities can make the &quot;hard&quot; intransitivity &quot;soft&quot;.</Paragraph> <Paragraph position="15"> Although this top-down splitting method has the advantage we mentioned above, it has its obvious shortcomings. Magerman, (1994) describes these shortcomings in detail. Since the splitting procedure is restricted to be trees, as opposed to arbitrary directed graphs, there is no mechanism for merging two or more nodes in the tree growing process. That is to say, if we distribute the words to the wrong classes from global sense, we will not be able to any longer move it back. So, it is difficult to merge the classes obtained by left-right binary tree and right-left binary tree during the process of growing tree. To solve this problem, we adopt the bottom-up merging method to the resulting classes.</Paragraph> <Paragraph position="16"> A number of different similarity measures can be used. We choose to use relative entropy , also known as the Kullback-Leibler distance(Pereira, et ai.1993; Brili,1993;). Rather than merge two words, we merge the two classes which belong to the resulting classes generated by left-right&quot; binary tree and right-left binary tree respectively, and select the merged class which can lead to maximum value of similarity metric. This procedure can be done recursively until the demanded number of classes is reached.</Paragraph> <Paragraph position="17"> Let P and Q be the probability distribution. The Kullback-Leibler distance from P to Q is defined as:</Paragraph> <Paragraph position="19"> For two words w and w I , let Pa(w, wl) be the probability of word w occurring to the left of w I within the distance d. The probability, contained in classes C, and C~ in the left-right and right-left trees respectively. Then, the Kullback-Leibler distance between words w I and w_, in the left-right tree is: t',,( w. .,1)e,. ( C, l w. ) D,,(w, II w.,) = ,,v~F-'Pa(w'w')t&quot;~(C'lw))ldegg P,,(w,w,)P~,.(C~lw,_) The divergence of words w I and w 2 in the left-right tree is: Div,,.(w,, w 2 ) = D,,.(w, II w2 ) + m,,(w.,ll w, ) Similarly, the Kullback-Leibler distance between words w I and w 2 in the right-left tree is: D,~(w, llw,_)= ~&quot;~P,,.(w,w,)P,~(C, lw,)log wGl&quot; where V is the vocabulary.</Paragraph> <Paragraph position="20"> P (w. w.)e. (C, lw.) P,,(w.w._)e,,(C,l..,..) We can then define the similarity of wl and w 2 as: \] S( ~1&quot;, . ~,': ) = \] - ~ { D~'~ ( w , . ~'~ ) + Div ~ ( .'t . ~'~ ) } ( 1 O) -S(w,,w:) ranges from 0 to 1, with S(w,w)=l.</Paragraph> <Paragraph position="21"> The computation cost of this simiarity is not high, for the components of equation (I0) have been obtained during the early computation.</Paragraph> <Paragraph position="22"> The number of all possible classes is 2 h. During the splitting process, especially at the bottom of the binary tree, it may be empty for some classes because the classes at higher level than it can not be splitted further more according to the rule of maximum average mutual information. The .number of the resulting classes can not be controlled accurately. So, we can define the number of the demanding classes in advance. As long as the number of the resulting classes is less than the pre-defined number, the splitting process will be continued. When the number of the resulting classes is larger than the pre-defined number, we use the merging technique presented above to reduce the number until it is equal to the pre-defined number. The procedure can be described as follows: After we have merged two classes taken from the left-right and the right-left trees respectively, we use this merged class to replace two original classes respectively. Then we repeat this process until certain step is reached. In this paper, we define the number of steps as equal to the larger number of the classes between two trees' resulting classes. Finally, we merge all resulting classes until the pre-defined number is reached.</Paragraph> <Paragraph position="23"> This merging process guaranteed the probabililty to be nonzero whenever the word distributions are. This is a useful advantage compared with agglomerative clustering techniques that need to compare individual objects being considered for grouping.</Paragraph> </Section> </Section> <Section position="4" start_page="304" end_page="25130" type="metho"> <SectionTitle> 3. Experimental Resul~ and Discussion 3.1Word Classification Results </SectionTitle> <Paragraph position="0"> We use Pentium 586/133MHz, 32M memory to calculate. The OS is Windows NT 4.0. And Visual C++ 4.0 is our programming language.</Paragraph> <Paragraph position="1"> We use the electric news corpus named &quot;Chinese One hundred kinds of newspapers---1994&quot;. The total size of it is 780 million bytes. It is not feasible to do classification experiments on this original corpus.</Paragraph> <Paragraph position="2"> So, we extract a part of it which contain the news published in April from the original news texts.</Paragraph> <Paragraph position="3"> To be convenient, the sentence boundary markers, { l, ? .... ; : ,} are replaced by only two sentence boundary markers: &quot;! &quot; and &quot; &quot; which denote the beginning and end of the sentence or word phrase respectively.</Paragraph> <Paragraph position="4"> The texts are segmented by Variable-distance algorithm\[ Gao, J. and Chen, X.X. (1996)\] We select four subcorpora which contains 10323, 17451, 25130 and 44326 Chinese words. The vocabulary contains 2103, 3577, 4606 and 6472 words correspondingly. The results of the classification without introducing probabilities can be summarized in Table I.</Paragraph> <Paragraph position="5"> The computation of merging process is only equal to the splitting calculation in one level in the tree. From table I, we can find surprisely that the computation time for right-left is much shorter than the time for left-right. But this is reasonable. In the process of left-right, the left branch contains more words than the right branch. To move each word from the left branch to the right branch, we need to match this word throughout the corpus. But when we do the process of right-left, the left branch has less words than the right. We only need to match the small number of words in the corpus. From this, we can know that the preprocessed procedure costs much time.</Paragraph> <Paragraph position="6"> The number of empty classes is increasing with the tree grows. Table II shows the number of empty classes in different levels in the left-right tree when we process the subcorpora containing 10323 words. Although our method is to calculate distributional classification, it still demonstrates that it has powerful part-of-speech functions.</Paragraph> <Paragraph position="7"> Some typical word classes which is the part of results of subcorpus containing 17451 words are listed below. (Resulting classes of left- null right binary tree) .</Paragraph> <Paragraph position="8"> class !3: ~ ~ ~ ~ ~,~- ~ ~ ~ ~ ~ ~ class ~: ,~ ~ ~ ~ ~T f@-~ ~ ~ ~ ~ ~-~ Class 96: _-'-~.. Jq ~T&quot; ~ H.~ ~ ?~ ~,~ \[~'~ :~.~ ~ ~\[~j But some of classes present no obvious part-of-speech category. Most of them conZain only' very small number of words. This may caused by the predefined classification number. Thus, excessive or insufficient classification may be encountered. And another shortcoming is that a small number of words in almost every resulting class doesn't belong to the part-of-speech categories which most of words in that class belong to.</Paragraph> <Section position="1" start_page="351" end_page="25130" type="sub_section"> <SectionTitle> 3.2 Use Word Classification Resul~ in Statistical Language Modefing </SectionTitle> <Paragraph position="0"> Word class-based language model is more competitive than word-based language model. It has far fewer parameters, thus making better use of training data to solve the problem of data sparseness. We compare word class-based N-gram language model with typical N-gram language model using perplexity.</Paragraph> <Paragraph position="1"> Perplexity (Jelinek, 1990a; McCandless,1994;) is an information-theoretic measure for evaluating how well a statistical language model predicts a particular test set. It is an excellent metric for comparing two language models because it is entirely independent of how each language model functions internally, and also because it is very simple to compute. For a given vocabulary size, a language model with lower perplexity is modeling language more accurately, which will generally correlate with lower error rates during speech recognition.</Paragraph> <Paragraph position="2"> Perplexity is derived from the average log probability that the language model assigns to each word in the test set:</Paragraph> <Paragraph position="4"> where wl,...,w~, are all words of the test set constructed by listing the sentences of the test set end to end, separated by a sentence boundary marker. The perplexity is then 2 ~, S may be interpreted as the average number of bits of information needed to compress each word in the test set given that the language model is providing us with information.</Paragraph> <Paragraph position="5"> We compare the perplexity result of the N-gram language model with class-based N-gram language model. The perplexities PP of N-gram for word and class are: Bigram for class2 exp(--77~ln(P(~lC(w,))P(C(~)lC(~_t))) ) (14) where w, denotes the ith word in the corpus and C(~) denotes the class that w, is assigned to. N is the number of words in the corpus. P(C(wi)IC(w,_,))can be estimated by:</Paragraph> <Paragraph position="7"> The perplexities PP based on different N-gram for word and class are presented in table III.</Paragraph> <Paragraph position="8"> Note that we present &quot;hard&quot; classification and ~soft&quot; classification results in word class- based language model respectively. For probabilistic classification, we define the word as belonging to certain class in which this word has the largest probability.</Paragraph> <Paragraph position="9"> The training corpus contains more than 12,000 Chinese words. And the vocabulary has I034 Chinese words which are most frequent. We use four subcorpora mentioned above as test sets.</Paragraph> <Paragraph position="10"> An arbitrary nonzero probability is given to all Chinese words and symbols that do not exist in the vocabulary. We set P(w)- 2N to the word w which are not in the vocabulary. N is the number of words in the training corpus.</Paragraph> <Paragraph position="11"> From table III, we can know that perplexity of &quot;hard&quot; class-based bigram is 28.7% lower than the word-based bigram, while perplexity of the &quot;soft&quot; class-based bigram is much lower than the &quot;hard&quot; class-based bigram, perplexity reduction is about 43% compared with ~hard&quot; class-based bigram.</Paragraph> </Section> </Section> class="xml-element"></Paper>