File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0316_metho.xml
Size: 16,346 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0316"> <Title>POS-Tagger for English-Vietnamese Bilingual Corpus</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 English - Vietnamese Bilingual Corpus </SectionTitle> <Paragraph position="0"> The bilingual corpus that needs POS-tagging in this paper is named EVC (English - Vietnamese Corpus).</Paragraph> <Paragraph position="1"> This corpus is collected from many different resources of bilingual texts (such as books, dictionaries, corpora, etc.) in selected fields such as Science, Technology, daily conversation (see table 1). After collecting bilingual texts from different resources, this parallel corpus has been normalized their form (text-only), tone marks (diacritics), character code of Vietnam (TCVN3), character font (VN-Times), etc. Next, this corpus has been sentence aligned and checked spell semiautomatically. An example of unannotated EVC as the following: *D02:01323: Jet planes fly about nine miles high.</Paragraph> <Paragraph position="2"> +D02:01323: Cac phi co pha n lu c bay cao khoa ng chin da m.</Paragraph> <Paragraph position="3"> Where, the codes at the beginning of each line refer to the corresponding sentence in the EVC corpus. For full details of building this EVC corpus (e.g. collecting, normalizing, sentence alignment, spelling checker, etc.), please refer to Dinh Dien (2001b).</Paragraph> <Paragraph position="4"> Next, this bilingual corpus has been automatically word aligned by a hybrid model combining the semantic class-based model with the GIZA++ model.</Paragraph> <Paragraph position="5"> An example of the word-alignment result is as in figure 1 below. The accuracy of word-alignment of this parallel corpus has been reported approximately 87% in (Dinh Dien et al., 2002b). For full details of word alignment of this EVC corpus (precision, recall, coverage, etc.), please refer to (Dinh Dien et al., 2002a).</Paragraph> <Paragraph position="6"> The result of this word-aligned parallel corpus has been used in various Vietnamese NLP tasks, such as in training the Vietnamese word segmenter (Dinh Dien et al., 2001a), word sense disambiguation (Dinh Dien, 2002b), etc.</Paragraph> <Paragraph position="7"> Remarkably, this EVC includes the SUSANNE corpus (Sampson, 1995) - a golden corpus has been manually annotated such necessary English linguistic annotations as lemma, POS tags, chunking tags, syntactic trees, etc. This English corpus has been translated into Vietnamese by English teachers of Foreign Language Department of Vietnam University of HCM City. In this paper, we will make use of this valuable annotated corpus as the training corpus for our bootstrapped English POS-tagger.</Paragraph> <Paragraph position="8"> 1. Computer books 9,475 165,042 239,984 17.42 7.67 2. LLOCE dictionary 33,078 312,655 410,760 9.45 14.53 3. EV bilingual dictionaries 174,906 1,110,003 1,460,010 6.35 51.58 4. SUSANNE corpus 6,269 131,500 181,781 20.98 6.11 5. Electronics books 12,120 226,953 297,920 18.73 10.55 6. Children's Encyclopedia 4,953 79,927 101,023 16.14 3.71 7. Other books 9,210 126,060 160,585 13.69 5.86 Jet planes fly about nine miles high Cauc phi co phaun loic bay cao khoaung chin daem</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Our Bootstrapped English POS-Tagger </SectionTitle> <Paragraph position="0"> So far, existing POS-taggers for (mono-lingual) English have been well developed with satisfactory achievements and it is very difficult (it is nearly impossible for us) to improve their results. Actually, those existing advanced POS-taggers have exhaustively exploited all linguistic information in English texts and there is no way for us to improve English POS-tagger in case of such a monolingual English texts. By contrast, in the bilingual texts, we are able to make use of the second language's linguistic information in order to improve the POS-tag annotations of the first language.</Paragraph> <Paragraph position="1"> Our solution is motivated by I.Dagan, I.Alon and S.Ulrike (1991); W.Gale, K.Church and D.Yarowsky (1992). They proposed the use of bilingual corpora to avoid hand-tagging of training data. Their premise is that &quot;different senses of a given word often translate differently in another language (for example, pen in English is stylo in French for its writing implement sense, and enclos for its enclosure sense). By using a parallel aligned corpus, the translation of each occurrence of a word such as pen can be used to automatically determine its sense&quot;. This remark is not only true for word sense but also for POS-tag and it is more exact in such typologically different languages as English vs. Vietnamese.</Paragraph> <Paragraph position="2"> In fact, POS-tag annotations of English words as well as Vietnamese words are often ambiguous but they are not often exactly the same (table 4). For example, &quot;can&quot; in English may be &quot;Aux&quot; for ability sense, &quot;V&quot; for to make a container sense, and &quot;N&quot; for a container sense and there is hardly existing POS-tagger which can tag POS for that word &quot;can&quot; exactly in all different contexts. Nevertheless, if that &quot;can&quot; in English is already word-aligned with a corresponding Vietnamese word, it will be POS-disambiguated easily by Vietnamese word' s POS-tags. For example, if &quot;can&quot; is aligned with &quot;co the &quot;, it must be Auxiliary ; if it is aligned with &quot;dong ho p&quot; then it must be a Verb, and if it is aligned with &quot;cai ho p&quot; then it must be a Noun. However, not that all Vietnamese POS-tag information is useful and deterministic. The big question here is when and how we make use of the Vietnamese POS-tag information? Our answer is to have this English POS-tagger trained by TBL method (section 2) with the SUSANNE training corpus (section 3). After training, we will extract an ordered sequence of optimal transformation rules. We will use these rules to improve an existing English POS-tagger (as baseline tagger) for tagging words of the English side in the word-aligned EVC corpus. This English POS-tagging result will be projected to Vietnamese side via word-alignments in order to form a new Vietnamese training corpus annotated with POS-tags.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The English POS-Tagger by TBL method </SectionTitle> <Paragraph position="0"> To make the presentation clearer, we re-use notations in the introduction to fnTBL-toolkit of Radu Florian and Grace Ngai (2001b) as follows: * kh : denotes the space of samples: the set of words which need POS-tagging. In English, it is simple to recognize the word boundary, but in Vietnamese (an isolate language), it is rather complicated.</Paragraph> <Paragraph position="1"> Therefore, it has been presented in another work (Dinh Dien, 2001a).</Paragraph> <Paragraph position="2"> * C : set of possible POS-classifications c (or tagset). For example: noun (N), verb (V), adjective (A), ...</Paragraph> <Paragraph position="3"> For English, we made use of the Penn TreeBank tagset and for Vietnamese tagset, we use the POS-tagset mapping table (see appendix A).</Paragraph> <Paragraph position="4"> * S = khxC: the space of states: the cross-product between the sample space (word) and the classification space (tagset), where each point is a couple (word, tag).</Paragraph> <Paragraph position="5"> * p : predicate defined on S + space, which is on a sequence of states. Predicate p follows the specified templates of transformation rules. In the POS-tagger for English, this predicate only consists of English factors which affect the POS-tagging process, for example the current word. Positive values of i mean preceding (its left side), and negative ones mean following (its right side). i ranges within the window from -m to +n. In this English-Vietnamese bilingual POS-tagger, we add new is the Vietnamese POS-tag corresponding to the current English word via its word-alignment. These Vietnamese POS-tags are determined by the most frequent tag according to the Vietnamese dictionary.</Paragraph> <Paragraph position="6"> * A rule r defined as a couple (p, c) which consists of predicate p and tag c. Rule r is written in the form p = c. This means that the rule r = (p, c) will be applied on the sample x if the predicate p is satisfied on it, whereat, x will be changed into a new tag c.</Paragraph> <Paragraph position="7"> * Giving a state s = (x,c) and rule r = (p, c), then the result state r(s), which is gained by applying rule r on s, is defined as: s if p(s)=False (x, c') if p(s)=True r(s) = * T : set of training samples, which were assigned correct tag. Here we made use of the SUSANNE golden corpus (Sampson, 1995) whose POS-tagset was converted into the PTB tagset.</Paragraph> <Paragraph position="8"> * The score associated with a rule r = (p, c) is usually the difference in performance (on the training data) that results from applying the rule, as follows:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 The TBL algorithm for POS-Tagging </SectionTitle> <Paragraph position="0"> The TBL algorithm for POS-tagging can be briefly described as follows (see the flowchart in figure 2): Step 1: Baseline tagging: To initiatize for each sample x in SUSANNE training data with its most likely POS-tag c. For English, we made use of the available English tagger (and parser) of Eugene Charniak (1997) at Brown University (version 2001). For Vietnamese, it is the set of possible parts-of-speech tags (follow the appearance probability order of that part-of-speech in dictionary). We call the starting training data as T</Paragraph> <Paragraph position="2"> , choose the one with the highest Score(r) and applying it to the training data to obtain new corpus T</Paragraph> <Paragraph position="4"> }. If there are no more possible transformation rules which satisfies: Score(r) > b, the algorithm is stopped. b is the threshold, which is preset and adjusted for new corpus EVC after this corpus has been POS-tagged with baseline tags similar to those of the training period.</Paragraph> <Paragraph position="5"> * Convergence ability of the algorithm: call e k the number of error (the difference between the tagging result in conformity with rule r and the correct tag in the golden corpus in time k th ), we have: e</Paragraph> <Paragraph position="7"> [?]N, so the algorithm will be converged after limited steps.</Paragraph> <Paragraph position="8"> * Complexity of the algorithm: O(n*t*c) where n: size of training set (number of words); t: size of possible transformation rule set (number of candidate rules); c: size of corpus satisfied rule applying condition (number of order satisfied predicate p).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Experiment and Results of Bootstrapped English POS-Tagger </SectionTitle> <Paragraph position="0"> After the training period, this system will extract an ordered sequence of optimal transformation rules under following format, for examples: These are intuitive rules and easy to understand by human beings. For examples: the 2 nd rule will be understood as follows: &quot;if the POS-tag of current word is VB (Verb) and its word-form is &quot;can&quot; and its corresponding Vietnamese word-tag is MD (Modal), then the POS-tag of current word will be changed into MD&quot;.</Paragraph> <Paragraph position="1"> We have experimented this method on EVC corps with the training SUSANNE corpus. To evaluate this method, we held-back 6,000-word part of the training corpus (which have not been used in the training period) and we achieved the POS-tagging results as follows: tagger for English side in EVC.</Paragraph> <Paragraph position="2"> It is thanks to exploiting the information of the corresponding Vietnamese POS that the English POS-tagging results are improved. If we use only available English information, it is very difficult for us to improve the output of Brown POS-tagger. Despite the POS-tagging improvement, the results can hardly said to be fully satisfactory due to the following reasons: POS-information is effective enough to disambiguate the POS of English words (please refer to table 3). Through the statistical table 3 below, the information of Vietnamese POS-tags can be seen as follows: - Case 1,2,3,4: no need for any disambiguation of English POS-tags.</Paragraph> <Paragraph position="3"> - Case 5, 7: Full disambiguation of English POS-tags (majority).</Paragraph> <Paragraph position="4"> - Case 6, 8, 9: Partial disambiguation of English POS-tags by TBL-method.</Paragraph> <Paragraph position="6"> 1. One POS-tag only One POS-tag only Two POS-tags are identical 25.2 2. One POS-tag only One POS-tag only Two POS-tags are different 1.2 3. One POS-tag only More than 1 POS-tag One common POS-tag only 5.3 4. One POS-tag only More than 1 POS-tag No common POS-tag 3.5 5. More than 1 POS-tag One POS-tag only One common POS-tag only 50.5 6. More than 1 POS-tag One POS-tag only No common POS-tag 2.8 7. More than 1 POS-tag More than 1 POS-tag One common POS-tag only 6.1 8. More than 1 POS-tag More than 1 POS-tag More than 1 common POS-tag 4.1 9. More than 1 POS-tag More than 1 POS-tag No common POS-tag 1.3 After having English-POS-tag annotations with high precision, we proceed to directly project those POS-tag annotations from English side into Vietnamese side. Our solution is motivated by a similar work of David Yarowsky and Grace Ngai (2001). This projection is based on available word-alignments in the automatically word-aligned English-Vietnamese parallel corpus.</Paragraph> <Paragraph position="7"> Nevertheless, due to typological difference between English (an inflected typology) vs.</Paragraph> <Paragraph position="8"> Vietnamese (an isolated typology), direct projection is not a simple 1-1 map but it may be a complex m-n map: null Regarding grammatical meanings, English usually makes use of inflectional facilities, such as suffixes to express grammatical meanings. For example: -s -plural, -ed -past, ing-continuous, 's - possesive case, etc.</Paragraph> <Paragraph position="9"> Whilst Vietnamese often makes use of function words, word order facilities. For example: &quot;cauc&quot;' &quot;nhoong&quot; - plural, &quot;nao&quot; - past, &quot;nang&quot; - continuous, &quot;cuua&quot; - possessive cases, etc.</Paragraph> <Paragraph position="10"> null Regarding lexicalization, some words in English must be represented by a phrase in Vietnamese and vice-versa. For example: &quot;cow&quot; and &quot;ox&quot; in English will be rephrased into two words &quot;boo caui&quot; (female one) and &quot;boo noic&quot; (male one) in Vietnamese; or &quot;ngheu&quot; in Vietnamese will be rephrased into two words &quot;buffalo calf&quot; in English.</Paragraph> <Paragraph position="11"> The result of projecting is as table 4 below.</Paragraph> <Paragraph position="12"> In addition, tagsets of two languages are different. Due characteristics of each language, we must use two different tagset for POS-tagging.</Paragraph> <Paragraph position="13"> Regarding English, we made use of available POS-tagset of PennTreeBank. While in Vietnamese, we made use of POS-tagset in the standard Vietnamese dictionary of Hoang Phe (1998) and other new tags.</Paragraph> <Paragraph position="14"> So, we must have an English-Vietnamese consensus tagset map (please refer to Appendix A).</Paragraph> <Paragraph position="15"> tagging in parallel corpus EVC Regarding evaluation of POS-tag projections, because so far, there has been no POS-annotated corpus available for Vietnamese, we had to manually build a small golden corpus for Vietnamese POS-tagging with approximately 1000 words for evaluating. The results of Vietnamese POS-tagging is as table 5 below: from English side to Vietnamese in EVC.</Paragraph> </Section> </Section> class="xml-element"></Paper>