File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0305_metho.xml
Size: 18,122 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0305"> <Title>HMM-based Part-of-Speech Tagging for Chinese Corpora</Title> <Section position="4" start_page="0" end_page="42" type="metho"> <SectionTitle> 2 The HMM-based Part-of- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="40" type="sub_section"> <SectionTitle> Speech Tagger </SectionTitle> <Paragraph position="0"> Kupiec \[4\] describes a HMM-based tagging system which can be trained with a corpus of untagged text.</Paragraph> <Paragraph position="1"> There are two new features in Kupiec's tagger: (1) word equivalence classes and (2) predefined networks. Words with the same set of parts-of-speech are defined a.s an equivalence class. For example, &quot;type&quot; and &quot;'store&quot; belong to the equivalence class noun-orverb. This not only reduces the number of parameters effectively and also makes the tagging system robust. The first-order model is extended with predefined networks based on error analysis and linguistic considerations. Their experimental results show that the predefined networks reduced the overall error rate by only 0.2%. Thus, we adopt the concept of equivalence classes but. consider that predefined networks are not. worthwhile.</Paragraph> <Paragraph position="2"> Let us briefly review the formulation of HMM for part-of-speech tagging: A first-order tlMM of N states and M possible observations has three sets of parameters: state transition probability distribution A (N by N), observation probability distribution B (N by M), and initial state distribution P (N). For an observation sequence O of length T, there are algorithms, e.g., Viterbi, to uncover the hidden state sequence 1. For tagging, N is the number of parts-of-speech in the language, M can be the number of words or the number of equivalence classes (as Kupiec defined), in Chinese, tile number of words is more than 100,090 while the number of equivalence classes is less than 1,000. The use of equivalence classes reduces the size of B by 100 times.</Paragraph> <Paragraph position="3"> The problem of tagging is: Given a word sequence (observations), find out the correct part-of-speech sequence (states).</Paragraph> </Section> <Section position="2" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 2.1 The Part-of-Speech Tag Set </SectionTitle> <Paragraph position="0"> The tag set contains 46 regular tags plus 11 special tags. Regular tags include A0 (adjective), C0-CI (conjunctions), D0-D2 (pronouns), 10 (interjection),</Paragraph> <Paragraph position="2"> Z0-Z2 (adverbs}. Special tags are for punctuations (PAR, SEN, PCT, DUN, COM, SEM, COL), unknown words (UNK), foreign words (ABC), and composed numbers (NUM, ARA). It is simplified and reorganized from tile classification of Chinese Knowledge Information Processing Group (CKIP), Academia Smica. Taipei. The original CK1P classification i~ a five-level sy~tent, tOO complicated even for humati to use. SulJ \[12\] designed a three-level tag set TUCWS of 120 tags for Chinese word segmentation. However, they tag the corpus by hand without an automatic tagger. Thus, it is difficult to decide if the set is good for automatic tagging. Other Chinese tag sets can be found in the literature: 33 tags in Su \[I 1\], 30 tags in Lee and Chang Chien \[5\], and 34 tags in Lee et hi. \[6\]. These three tag sets are of two origins, CKIP \[5\] and NTHU \[6, i1\]. The numbers of tags in them are considered too small.</Paragraph> </Section> <Section position="3" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 2.2 Corpus Preparation </SectionTitle> <Paragraph position="0"> The 1991 United Daily corpus contains more than l0 million Chinese characters, about twenty days of news articles published by United Informatics, Inc.</Paragraph> <Paragraph position="1"> during January through March 1991. Basically, it is a collection of articles in the form of raw text (i.e., character stream). Thus, we have to segment the character stream into a word stream before it can be used for training or testing the model. The corpus preparation process consists of the following steps: Preprocessing Clean up inappropriate parts, such as titles, parenthesized texts, reporter information, figures, etc., in the input article. Articles mostly composed of inappropriate parts are deleted.</Paragraph> <Paragraph position="2"> Clause identification Divide up the article into clauses delimited by clause-ending punctuations such as periods, commas, question marks.</Paragraph> <Paragraph position="3"> Automatic word segmentation Segment the characters in a clause into words using a dictionary-based, Viterbi decoding word identification system.</Paragraph> <Paragraph position="4"> Manual correction (optional) Check the segmented text to correct segmentation errors due to unregistered words or inaccuracy of the segmentation algorithm. This step is optional but helpful especially for training.</Paragraph> <Paragraph position="5"> Equivalence class look-up Words in the clause are then converted to identifiers of equivalence class (EQC-ids) via dictionary look-up.</Paragraph> <Paragraph position="6"> After the above steps, an article is converted into a series of sequences of EQC,-ids.</Paragraph> <Paragraph position="7"> Manual tagging of the whole corpus would take several man-years. However, tagged corpus is necessary for evaluation of the model and helpful for initialization of the HMM parameters as Merialdo \[8\] pointed out. Thus, we also tag part of the corpus by the steps below: (I) Train the IIMM using the articles to be tagged; (2) Tag the articles using the trained HMM; (3) Correct the erroneous tags by hand.</Paragraph> </Section> <Section position="4" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.3 Training the Model </SectionTitle> <Paragraph position="0"> The untagged corpus of EQC-ids is then used for training the HMM for tagging using the Baum-Welch reestimation procedure with multiple observation sequences \[9\]. Before training, the model parameters, A, B, P, can be initialized with a tagged corpus.</Paragraph> <Paragraph position="1"> A The tag bigrams in the tagged corpus are counted to initialize A, the state transition matrix. All counts are incremented by one then normalized.</Paragraph> <Paragraph position="2"> B The EQC-id to tag correspondences are counted to set up B, the observation matrix. All possible states for an EQC are then incremented by one.</Paragraph> <Paragraph position="3"> P The initial state matrix P is initialized by counting the tags of first words in the clause. All counts are incremented by one then normalized.</Paragraph> <Paragraph position="4"> After training, the model parameters are adjusted to bestly predict the most probable tag sequence for the training data.</Paragraph> </Section> <Section position="5" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.4 Automatic Tagging </SectionTitle> <Paragraph position="0"> Having the trained model parameters, we can automatically tag an unseen text based on an HMM decoding algorithm such as Viterbi's For a given clause, the tagging process is: Automatic word segmentation Segment the characters in the clause into words using the above-mentioned word identification system.</Paragraph> <Paragraph position="1"> Equivalence class look-up Words in the clause are then converted to EQC-ids via dictionary look-up.</Paragraph> <Paragraph position="2"> Viterbi decoding The sequence of EQC.-ids, as observations, is then fed to the Viterbi decoder in order to find on! the mosl probable hidden state sequence, namely, the tag sequence.</Paragraph> <Paragraph position="3"> Pattern-driven Tag Correction First-order models are not enough to describe local constraints for predicting part-of-speech tags. Higher-order models have much more parameters to estimate and need a lot more training data and resources (memory, CPU time). Kupiec \[4\] proposed using networks to model higher-order context based on error analysis and linguistic considerations. However, using networks is considered not elegant and had only very limited success. We use a simple pattern-driven tag corrector to postprocess the tag output: The EQCC,-id sequence is matched against predefifined patterns; when a match is found, the corresponding tag corrections are made. These patterns are designed according to analysis of error patterns.</Paragraph> </Section> <Section position="6" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 2.5 The Dictionary </SectionTitle> <Paragraph position="0"> The general dictionary has some 80,000 lexical entries each of which contains the Chinese characters and its EQG.-id. The original dictionary is a collaborated work of CCL/ITRI with Academia Sinica, Taipei: ITRI collected the words, their pronunciations and word frequencies, while Academia Sinica provided syntactic and semantic markers. For our purpose, only the words and their syntactic information (parts-of-speech) are useful. As mentioned, we restructured the general dictionary based on our newly designed compact tag set. For purpose of comparison, we also constructed a closed dictionary in which the words and their tags in the training and testing corpora are collected.</Paragraph> </Section> <Section position="7" start_page="41" end_page="42" type="sub_section"> <SectionTitle> 2.6 An Example </SectionTitle> <Paragraph position="0"> In the following, we use a real-world example to illustrate the tagging process.</Paragraph> <Paragraph position="1"> A tagged corpus, called corpusl, was prepared through the steps described in the subsection Corpus Preparation. The corpus is composed of 1,418 clauses or 12,284 word tokens. A larger corpus, called corpus3, contains 3,784 clauses, corpus3 is segmented but untagged, useful only for training. There are totally 338 word equivalence classes: Each of the 100 most frequently used ambiguous words is assigned a unique EQC-id; the rest 238 EQC-ids are assigned to sets of words with the same set of possible tags.</Paragraph> </Section> <Section position="8" start_page="42" end_page="42" type="sub_section"> <SectionTitle> 3.1 Inside Test, Uniformly Initialized, General Dictionary </SectionTitle> <Paragraph position="0"> was used and the model parameters are uniformly initialized, i.e., the tags in the corpus are not used to initialize the parameters.</Paragraph> <Paragraph position="1"> The accuracy rate for all words is 86.37% (1,674 errors out of 12,284 words). Excluding unknown words (words not in the dictionary), the accuracy rate is 93.16% (779 errors). In other words, approximately halfofthe errors can be attributed to unknown words.</Paragraph> <Paragraph position="2"> If we only consider ambiguous (multi-POS) words, the accuracy is 80.26% (771 errors). We can also observe that only about 35% of the words are ambiguous. (The difference between the latter two numbers of error is due to special usage of some registered words, e.g., 9&quot;~'~ 'everyday' is Z0 (adverb) in the dictionary but is used as a company name N2 in ~</Paragraph> </Section> </Section> <Section position="5" start_page="42" end_page="43" type="metho"> <SectionTitle> 'Everyday Department Store'.) 3 Experimental Results </SectionTitle> <Paragraph position="0"> The whole tagging system, including word segmentation module, equivalence class mapper, HMM trainer, and Viterbi decoder, is implenmnted in C: on a Sun Sparcstation.</Paragraph> <Section position="1" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 3.2 Inside Test, Initialized with Tagged Text, General Dictionary </SectionTitle> <Paragraph position="0"> Tagged texts are useful for initializing the model parameters before training. Table 2 shows that the accurao' for ambiguous words was improved by about three percent (from 80.26% to 83.21%). The accuracy All words and their used tags in corpus l are collected to form an ideal dictionary, so-called closed dictionary, for tagging the corpus. The HMM-based tagger is able to correctly tag 96.83% of all words or 84.00% of ambiguous words (Table 3). The accuracy rate is comparable to that of Kupiec's llMM-based English tagger for the well-known Brown corpus.</Paragraph> <Paragraph position="1"> pus is divided into two parts: one for training, the other for testing. The first two columns (Train and Test) are the numbers of clauses (not words) used for training and testing, respectively. The accuracy rates are not as good as those for inside tests: degraded by about 2 percent for known words, by 5 percent for ambiguous words. In general, the system is able to tag approximately 80 percent of ambiguous words correctly.</Paragraph> <Paragraph position="2"> In the last row, corpus3 (3,784 clauses, 35,849 words, translated AP news) was used for training while corpusl (1,418 clauses, 12,284 words, domestic news) for testing. Due to difference of text type, accuracy rates are degraded by about 3 percent for ambiguous words. However, the system is still able to assign correct tags to 91.83 percent of all words. This shows the robustness of the model, due to the concept of equivalence classes.</Paragraph> </Section> </Section> <Section position="6" start_page="43" end_page="43" type="metho"> <SectionTitle> 3.5 Outside Test, Closed Dictionary </SectionTitle> <Paragraph position="0"> closed dictionary. Approximately 96% of all words and 80% of ambiguous words are tagged correctly.</Paragraph> </Section> <Section position="7" start_page="43" end_page="45" type="metho"> <SectionTitle> 4 Error Analysis </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="43" end_page="45" type="sub_section"> <SectionTitle> 4.1 Confusion Matrix </SectionTitle> <Paragraph position="0"> Table 6 shows part of the confusion matrix for the test described in subsection 3.2; only the confusing parts-of-speech are shown.</Paragraph> <Paragraph position="1"> The ANVZ problem: Due to lack of inflections in Chinese, a Chinese word can have many different parts-of-speech, yet only one form. It is sometimes very difficult even for human to identify the correct tag. For example, Chinese does not have -ing ending for nominalization of verbs, -ly for adverbs, -tion for verbal nouns, -en for past participles. Thus, a word such as ~IR can be a verb (V0) 'distribute', a noun (N0) 'distribution', an adjective (A0) 'distributive', 'distributing' or 'distributed', and an adverb (Z0) 'distributively' in different contexts. Nouns and verbs are especially hard to distinguish. That is why the V0-N0 (180), N0-V0 (47) confusions are common.</Paragraph> <Paragraph position="2"> The RP problem: Open classes, such as nouns and verbs, have large population, while closed classed, such as prepositions and particles have small population. In general, this is not a problem for tagging. However, in our tag set, R5 (aspect prefix) has only three members ~ (P0 R5 V0), ~, and iE. The former two words are also common prepositions (P0).</Paragraph> <Paragraph position="3"> From the experiments, we observed that while ~ is a preposition in most instances, it is always tagged as R5 (aspect). After studying the trained model parameters A, B, P, we found (Figure 1) that R5 was assigned large probabilities in B matrix (0.683 for :~ , 0.227 for ~) since R5 has only three words while P0 was assigned much smaller probabilities (Due to the probabilistic characteristic, sum of the observation probabilities for a state, such as P0, R5, must be one.) In addition. R5 and P0 have not significant difference in the incoming or outcoming entries of A matrix because of the characteristic of unsupervised learning: all instances of ~ are considered a.s possible candidates for R5. We consider this as a weakness of HMM for tagging.</Paragraph> </Section> <Section position="2" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 4.2 Error Patterns </SectionTitle> <Paragraph position="0"> Tagging errors usually occur in clusters; that is, an error may cause further mistagging of its neighbors if they are also ambiguous. Common patterns of mistagging include V0-V0 (as N0-N0), Z0-V0 (as A0-NO), V0-N0 (as Ci-Z2), V0-P0 (as N0-R5), P0-N0 (as R5-V0), P0-NI (as R5-V4), and N0-V0-N0 (as U1-CI-Z2). They can be classified into three types: ANVZ type These error patterns are due to the above-mentioned ANVZ problem. This type of error is reasonable.</Paragraph> <Paragraph position="1"> RP type Those error patterns involving R5 are due to the RP problem. The type of error should be eliminated by model improvement or postprocessing. null idiomatic type Some idiomatic expressions are composed of highly ambiguous words. For example, in &quot;,P.I ... ;~ lg', all the three words ~A (C1 N3 P0), ;~ (C1 P0 V0), ig (A0 NO g2), are 3-way ambiguous words. That is why the V0-N0 sequence is frequently mistagged as C1-Z2.</Paragraph> <Paragraph position="2"> If we consider the mistagging of unknown words, more long tagging error clusters would appear. Actually, an unknown word not only causes mistagging of the word itself but also affects the tagging of its neighbors.</Paragraph> </Section> <Section position="3" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 4.3 Without Equivalence Classes </SectionTitle> <Paragraph position="0"> only) To verify feasibility of the concept of equivalence classes, we implemented a version of the HMM tagger considering each word as a unique observation (without EQC). Table 7 compares the results for inside/outside tests on closed dictionary. To our surprise, the concept of equivalence classes not only has the advantages of saving space/time and making the tagger robust but also achieve higher tagging accuracy, especially in case of outside tests. Tbis might be due to insufficient training data for the much larger number of parameters to estimate. Nevertheless, it also proves that the concept is valid and useful.</Paragraph> </Section> </Section> class="xml-element"></Paper>