File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0109_metho.xml
Size: 18,990 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0109"> <Title>Automatic Construction of a Chinese Electronic Dictionary</Title> <Section position="3" start_page="107" end_page="111" type="metho"> <SectionTitle> 2. Automatic Construction of Electronic Dictionary with Reestimation Approach </SectionTitle> <Paragraph position="0"> The fundamental building blocks for the above-mentioned automatic Chinese electronic dictionary construction system contain the following modules: (i) automatic word extraction system, and (ii) automatic part-of-speech tagging system. Figure 1 shows the block diagram of such a system, where the word extraction system is shown to be a word segmentation module implemented with the Viterbi Training procedure for words.</Paragraph> <Paragraph position="1"> The system reads a large untagged plain text and produces its segmented version based on a segmentation model (with or without TCC post-filtering). The main purpose of the segmentation module is to segment the Chinese text corpus into words because there is no natural delimiter between Chinese words in a text. After segmentation, each word in the segmented text is automatically tagged with its part of speech. The possible parts of speech for each word in the segmented plain text are then collected to form a POS annotated electronic dictionary.</Paragraph> <Paragraph position="2"> A Viterbi reestimation process, as outlined below, could be used both for the word segmentation and POS tagging tasks to optimize the tagging patterns (including segmentation patters and POS tagging patterns) to a reasonable way. The principle is to find a set of initial segmentation or tagging parameters first from the small segmented or tagged seed corpus, and use this set of parameters to optimize the segmentation or POS tagging tasks. After the task is done, the best tagging pattern is updated, and the set of parameters are reestimated based on the distribution of the new tagging patterns and the seed. This process is repeated until a stopping criterion is met.</Paragraph> <Paragraph position="3"> Since only the best tagging pattern for each sentence is used for reestimating the parameters, such a training procedure will be referred to as a Viterbi Training (VT) procedure, in contrast to an EM algorithm \[Dempster 77\], which considers all possible patterns and their expectations. Since an EM version of the training procedure may require a long computation time, we will leave this option to future research. 3. Automatic Word Identification: Viterbi Training for Words (VTW) To compile an electronic dictionary (i.e., a word-tag list in the current task), we need to gather the word list within the corpus first. Since there is no natural delimiter, like space, between Chinese words, all the character n-grams in the text corpus are potential candidates for words. The first lexicon acquisition task is therefore to identify appropriate words embedded in the text corpus which are not known to the seed corpus. This task could be, resolved by using a word segmentation model or a two-class classifier (to be described in the next sections).</Paragraph> <Paragraph position="4"> Rule-based approaches \[Ho 83, Chen 86, Yeh 91\] as well as probabilistic approaches \[Fan 88, Sproat 90, Chang 91, Chiang 92\] to word segmentation had been proposed. For a large-scale system, the probabilistic approach is more practical when considering the capability of automatic training and cost. Practical probabilistic segmentation models can achieve quite satisfactory results \[Chang 91, Chiang 92\] provided that there is no unknown word to the system.</Paragraph> <Paragraph position="5"> A particular segmentation pattern can be expressed in terms of the words they have. Given a string of n Chinese characters cl,c2 ..... c n , represented as c 1 , a Bayesian decision rule requires that we find the if' among all possible segmentation patterns Wj which maximizes the best word segmentation pattern following probability:</Paragraph> <Paragraph position="7"> where W j'mj j,\] are the mj words in the j-th alternative segmentation pattern W i . In the current task, we assume that there is only a small segmented seed corpus available. To reduce estimation error, we adopt the simple model used in \[Chang 91\]: P(Wj= j,mi n'~ ~ mj w. cl) , Pw..ri ( which uses the product of word probabilities as the scoring function for segmentation. Other more complicated segmentation models \[Chiang 92\] may get better results. However, a more complicated model might not be appropriate in the current unsupervised mode of learning since the estimation error for the parameters may be high due to the small seed corpus. The following figure shows the block diagram of such a system.</Paragraph> <Paragraph position="8"> Note the loop in re-estimating the word probabilities. Initially, the n-grams embedded in the unsegmented corpus is gathered to form a word candidate list. For practical purpose, we will only retain n-grams that are more frequent than a lower bound (LB=5), and only n-grams up to n=4 are considered (since most Chinese words are of length 1, 2, 3, or 4). The frequency lower bound restriction is applied to reduce the number of possible word candidates; it also removes n-grams that are not sufficiently useful even</Paragraph> <Paragraph position="10"> though they are judged as word candidates. Note that the words in the seed corpus are always included in the candidate list. In this sense, it plays the role of an initial dictionary. Furthermore, all the characters (1-grams) are included to avoid the generation of 'unknown word regions' in the segmented patterns.</Paragraph> <Paragraph position="11"> Each word Candidate will be associated with a non-zero word probability; the various segmentation patterns of the unsegmented corpus are then expanded in terms of such word candidates. The path (i.e., the segmentation pattern) with the highest score as evaluated according to the initial set of parameters (i.e., word probabilities) is then marked as the best path for the current iteration. A new set of parameters are then re-estimated based on the best path. This process repeats until the segmentation patterns no more change or a maximum number of iteration is reached. We then derive the word list to be included in the electronic dictionary from the segmented text corpus.</Paragraph> <Paragraph position="12"> Initially, the word probability P(w.i,i ) is estimated from the small tagged seed corpus. In the reestimation cycle, both the seed corpus and the segmented text corpus acquired in the previous iteration are jointly considered to get a better estimation for the word probabilities.</Paragraph> <Paragraph position="13"> 4. Automatic Word Identification: A Two-Class Classification (TCC) Model The word list acquired through the above reestimation process is based on the optimization of the likelihood value of the word segmentation pattern in a sentence, which implicitly takes the contextual words into account. However, it may not take into account the features for forming a word from characters. It is desirable, for instance, to take some &quot;strength&quot; measures for the chunks of characters into account in order to know whether an n-gram is a word. Therefore, an alternative approach, which could also be used to supplement the VTW reestimation approach, is a Two-Class Classification model for classifying the character n-grams into words and non-words.</Paragraph> <Paragraph position="14"> To identify whether an n-gram belongs to the word class (w) or the non-word class (w), each n-gram could be associated with a feature vector ~ observed from the large untagged corpus. It is then judged to see whether it is more likely to be generated from a word model or a non-word model based on ~ .</Paragraph> <Paragraph position="15"> To simplify the the design of the classifier, we use a simple linear discrimination function for classification: g(~,~)-- ~.~ where Xs is the feature vector (or score vector) and Ws is a set of weights, acquired from the seed corpus, for the various components of the score vector. An n-gram will be classified as a word if the weighting sum of Ws and ~s is greater than zero (or larger than a threshold ~0)&quot; (For better results, a score vector derived from a log-likelihood ratio test as in \[Su 94\] could be used. Such an approach is being studied.) For estimating the weights, the seed n-grams are firstly separated into the word and non-word classes by checking them against the known segmentation boundaries in the seed corpus. The feature values for the n-grams are estimated from the statistics of the n-grams in the large unsegmented corpus. A set of initial weights are used to classify the word and non-word n-grams in the seed corpus according to their feature values. The weights are then adjusted according to the misclassified instances in the word or non-word n-grams until some optimization criteria for the classification results are achieved. A probabilistic descent method is used for adjusting the weights \[Amari 67\]. In brief, the weights are adjusted in the direction which is likely to decrease the risk, in terms of precision and recall, of the classifier.</Paragraph> </Section> <Section position="4" start_page="111" end_page="114" type="metho"> <SectionTitle> 5. Features for Classification </SectionTitle> <Paragraph position="0"> To classify the character n-grams, we need to use some discriminative features for the classifier. In particular, we found that the following features may be useful \[Wu 93, Su 94, Tung 94\].</Paragraph> <Paragraph position="1"> Frequency. Intuitively, a character n-gram is likely to be a word if it appears more frequently than the average. Therefore, we use the frequency measure f(xi) as the first feature for classification.</Paragraph> <Paragraph position="2"> Mutual Information. In general, a word n-gram should contain characters that are strongly associated. One possible measure to tell the strength of character association is the mutual information measure \[Church 90\] which had been applied successfully for measuring word association among 2-word compounds. The definition of mutual information for a bigram is defined as:</Paragraph> <Paragraph position="4"> where P(x) and P(y) are the prior probabilities of the individual characters and P(x,y) is the joint probability for the two characters to appear in the same 2-gram. This measure is an indicator between the probability for the individual characters to occur independently (denominator) and the probability for the characters to appear dependently (nominator). If the mutual information measure is much larger than 0, then it tends to have strong association. To deal with n-grams with n greater than 2, such idea of dependent vs. independent was extended to the following definition for the 3-gram mutual information:</Paragraph> <Paragraph position="6"> In the above definition, the nominator PD means the probability for the three characters to occur dependently (i.e., the probability for the three characters to form a 3-character word), and the denominator Pl means the total probability (or average probability, to a scaling factor of 3) for the three characters to appear in the same 3-gram independently (i.e., by chance, possibly from two or three individual words). The extension could be made to other n-grams in a similar way.</Paragraph> <Paragraph position="7"> Entropy. It is also desirable to know how the neighboring characters for an n-gram is distributed. If the distribution of the neighboring characters is random, it may suggest that the n-gram has a natural break at the n-gram boundary, and thus suggest that the n-gram is a potential word. Therefore, we use the left entropy H E and right entropy H R of an n-gram measures are defined as follows \[Tung 94\]:</Paragraph> <Paragraph position="9"> as another feature for classification. The left and right entropy</Paragraph> <Paragraph position="11"> where PL(Ci;X) are the probabilities of the left neighboring characters of the n-gram x, and PR(X;Ci) are the probabilities of the right neighboring characters. It is possible to use any function of the left and right entropies for the classification task. In this paper, the average of the left and right entropies is used as a feature.</Paragraph> <Paragraph position="12"> Furthermore, since the dynamic ranges of the frequencies and mutual information are very large, we used the log-scaled frequency, log-scaled mutual information and unsealed entropy measure as the features for the two class classifier. Without confusion, we will still use the terms of frequency and mutual information throughout the paper. In other words, the score vector for the classifier is</Paragraph> <Paragraph position="14"> 6. Automatic Lexical Tagging: Viterbi Training for POS Tags (VTT) Once a word-segmented text corpus is acquired, the segmented version can be annotated with parts of speech so as to: extract a POS annotated electronic dictionary. The problem of POS tagging can be formulated as the problem of finding the best possible tagging pattern that maximizes the following lexical score \[Church 88, Lin 92\]:</Paragraph> <Paragraph position="16"> where Tj is the j-th possible set of lexical tags (parts of speech) for the segmentation pattern W. The tagging process can thus be optimized based on the product of the POS tag transition probabilities P(t ill i_ 1) and the distribution for P( w ilt/).The Viterbi training process for POS tagging based on this optimization function is shown in Figure 3.</Paragraph> <Paragraph position="17"> Initially, P(tilt i-1) and P( w ilti) are estimated from the small seed corpus. Furthermore, each n-gram in the segmented text corpus will be assigned the most frequently encountered N POS tags in the seed corpus; in our experiments, N is selected as 10 since the most frequently used 10 POS tags already cover over 90% of the tags in the seed.</Paragraph> <Paragraph position="18"> During the training sessions, the various parts of speech sequences for the untagged text corpus are expanded first, and the lexical score for each path is evaluated. We then choose the path with the highest score and the corresponding parts of speech of the path for re-estimating the required probabilities. The re-estimated probabilities are acquired from both the seed corpus and the highest-scored tagging results. This process repeats until the tagging results no more change or until a maximum number of iteration is reached.</Paragraph> <Paragraph position="19"> 7. Integrated Systems for Dictionary Construction There are several ways to combine the above techniques to form an integrated automatic dictionary construction system. The following sections describe two such possibilities. Their performances will be compared in the next chapter.</Paragraph> <Section position="1" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 7.1 Basic Model: Viterbi Training for Words + Viterbi Training for POS Tags (VTW + VTT) </SectionTitle> <Paragraph position="0"> In the simplest topology, the Viterbi Training procedure for words is applied until the word segmentation parameters converge. The segmented text thus acquired (and hence the word n-grams) is then labelled with POS tags using the Viterbi Training procedure for POS tags. As mentioned in Figure 2, the n-grams are acquired from the unsegmented text corpus; n-grams that are less frequent than a lower bound (LB) are filtered out. The remaining n-grams then form the word candidates for expanding the various segmentation patterns.</Paragraph> </Section> <Section position="2" start_page="113" end_page="114" type="sub_section"> <SectionTitle> 7.2 PostFiltering Model: Viterbi Training for Words + Two-Class Classifier PostFiltering + Viterbi </SectionTitle> <Paragraph position="0"> Training for Tags (VTW + TCC + VTT) In the Basic Model, all n-grams that occur more frequent than 5 times in the large text corpus are considered potential words. Therefore, the number of possible segmentation patterns is extremely large. In fact, however, only about 17% of bigrams, 3% of trigrams and 4% of 4-grams in the frequency-filtered word candidates are recognized as words in a human constructed dictionary of more than 80K entries. Therefore, it is very difficult to find the best segmentation patterns, and thus the word list, with the Basic Model. To relieve the problem, the VTW module can be considered as a filter to the frequency-filtered word candidates, and we can further filter out inappropriate candidates by a TCC postfilter at the output end of the Basic Model. Intuitively, the post-TCC module will have a better chance to find out real word candidates from the output word list of the Basic Model, even though the VTW module may not perform well. The configuration is shown in Figure 4.</Paragraph> <Paragraph position="1"> In this topology, the Viterbi training procedure for words is applied first to acquire the possible word list which maximizes the likelihood of the segmentation patterns. The two-class classifier is then used as a postfilter to confirm whether the candidates are real word n-grams. The word n-grams thus acquired are then used as the word candidates of a second word segmentation module to produce a segmented text ! corpus. The segmented version is then labelled with POS tags using the Viterbi Training procedure for POS tagging.</Paragraph> <Paragraph position="2"> 8. Experiment Environments In our experiments, the untagged Chinese text corpus contains 311,591sentences (about 1,670,000 words, 9 M byteS). Its major domain is news articles and reports from the China Times daily news. There are 246,036 distinct n-grams in this corpus, including 3,994 1-grams, 99,407 2-grams, 99,211 3-grams and 43,424 4-grams. Since most Chinese words are not longer than 4 characters, only 1 -, 2-, 3- and 4-grams are in the word candidate list.</Paragraph> <Paragraph position="3"> A seed corpus of 9,676 sentences (127,052 words, about 415 K bytes) of computer domain is available. A smaller seed of 1,000 sentences is uniformly sampled from the above corpus. This small seed corpus contains 12,849 words (about 42K bytes). The numbers of n-grams for n=l, 2, 3, 4 are 893, 7782, 12289 and 12989, respectively. Among these n-grams, only 1275 bigrams, 317 trigrams and 40 4-grams are registered as words in a dictionary.</Paragraph> <Paragraph position="4"> Note that, since the numbers of word n-grams for n=3 and 4 are very small, the parameters (and performances) estimated based on such n-grams will introduce large estimation errors. Hence, the estimated performance will be very unreliable. For this reason, the conclusions will be drawn from the 2-gram performances; the performances for 3-gram and 4-gram will be listed for reference only.</Paragraph> </Section> </Section> class="xml-element"></Paper>