File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0316_intro.xml
Size: 8,343 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0316"> <Title>POS-Tagger for English-Vietnamese Bilingual Corpus</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> POS-tagging is assigning to each word of a text the proper POS tag in its context of appearance. Although, each word can be classified into various POS-tags, in a defined context, it can only be attributed with a definite POS. As an example, in this sentence: &quot;I can can a can&quot;, the POS-tagger must be able to perform the following: &quot;I In order to proceed with POS-tagging, such various methods as Hidden Markov Models (HMM); Memory-based models (Daelemans, 1996); Transformation-based Learning (TBL) (Brill, 1995); Maximum Entropy; decision trees (Schmid, 1994a); Neural network (Schmid, 1994b); and so on can be used. In which, the methods based on machine learning in general and TBL in particular prove effective with much popularity at present.</Paragraph> <Paragraph position="1"> To achieve good results, the abovementioned methods must be equipped with exactly annotated training corpora. Such training corpora for popular languages (e.g. English, French, etc.) are available (e.g. Penn Tree Bank, SUSANNE, etc.). Unfortunately, so far, there has been no such annotated training data available for Vietnamese POS-taggers. Furthermore, building manually annotated training data is very expensive (for example, Penn Tree Bank was invested over 1 million dollars and many person-years). To overcome this drawback, this paper will present a solution to indirectly build such an annotated training corpus for Vietnamese by taking advantages of available English-Vietnamese bilingual corpus named EVC (Dinh Dien, 2001b). This EVC has been automatically word-aligned (Dinh Dien et al., 2002a). Our approach in this work is to use a bootstrapped POS tagger for English to annotate the English side of a word-aligned parallel corpus, then directly project the tag annotations to the second language (Vietnamese) via existing word-alignments (Yarowsky and Ngai, 2001). In this work, we made use of the TBL method and SUSANNE training corpus to train our English POS-tagger. The remains of this paper is as follows: null POS-Tagging by TBL method: introducing to original TBL, improved fTBL, traditional English POS-Tagger by TBL.</Paragraph> <Paragraph position="2"> null English-Vietnamese bilingual Corpus (EVC): resources of EVC, word-alignment of EVC.</Paragraph> <Paragraph position="3"> null Bootstrapping English-POS-Tagger: bootstrapping English POS-Tagger by the POS-tag of corresponding Vietnamese words. Its evaluation null Projecting English POS-tag annotations to Vietnamese side. Its evaluation.</Paragraph> <Paragraph position="4"> null Conclusion: conclusions, limitations and future developments.</Paragraph> <Paragraph position="5"> 2 POS-Tagging by TBL method The Transformation-Based Learning (or TBL) was proposed by Eric Brill in 1993 in his doctoral dissertation (Brill, 1993) on the foundation of structural linguistics of Z.S.Harris. TBL has been applied with success in various natural language processing (mainly the tasks of classification). In 2001, Radu Florian and Grace Ngai proposed the fast Transformation-Based Learning (or fTBL) (Florian and Ngai, 2001a) to improve the learning speed of TBL without affecting the accuracy of the original algorithm.</Paragraph> <Paragraph position="6"> The central idea of TBL is to start with some simple (or sophisticated) solution to the problem (called baseline tagging), and step-by-step apply optimal transformation rules (which are extracted from a annotated training corpus at each step) to improve (change from incorrect tags into correct ones) the problem. The algorithm stops when no more optimal transformation rule is selected or data is exhausted. The optimal transformation rule is the one which results in the largest benefit (repairs incorrect tags into correct tags as much as possible).</Paragraph> <Paragraph position="7"> A striking particularity of TBL in comparison with other learning methods is perceptive and symbolic: the linguists are able to observe, intervene in all the learning, implementing processes as well as the intermediary and final results. Besides, TBL allows the inheritance of the tagging results of another system (considered as the baseline or initial tagging) with the correction on that result based on the transformation rules learned through the training period.</Paragraph> <Paragraph position="8"> TBL is active in conformity with the transformational rules in order to change wrong tags into right ones. All these rules obey the templates specified by human. In these templates, we need to regulate the factors affecting the tagging. In order to evaluate the optimal transformation rules, TBL needs the annotated training corpus (the corpus to which the correct tag has been attached, usually referred to as the golden corpus) to compare the result of current tagging to the correct tag in the training corpus. In the executing period, these optimal rules will be used for tagging new corpora (in conformity with the sorting order) and these new corpora must also be assigned with the baseline tags similar to that of the training period. These linguistic annotation tags can be morphological ones (sentence boundary, word boundary), POS tags, syntactical tags (phrase chunker), sense tags, grammatical relation tags, etc.</Paragraph> <Paragraph position="9"> POS-tagging was the first application of TBL and the most popular and extended to various languages (e.g. Korean, Spanish, German, etc.) (Curran, 1999).</Paragraph> <Paragraph position="10"> The approach of TBL POS-tagger is simple but effective and it reaches the accuracy competitive with other powerful POS-taggers. The TBL algorithm for POS-tagger can be briefly described under two periods as follows: * The training period: null Starting with the annotated training corpus (or called golden corpus, which has been assigned with correct POS tag annotations), TBL copies this golden corpus into a new unannotated corpus (called current corpus, which is removed POS tag annotations).</Paragraph> <Paragraph position="11"> null TBL assigns an inital POS-tag to each word in corpus. This initial tag is the most likely tag for a word if the word is known and is guessed based upon properties of the word if the word is not known.</Paragraph> <Paragraph position="12"> null TBL applies each instance of each candidate rule (following the format of templates designed by human beings) in the current corpus. These rules change the POS tags of words based upon the contexts they appear in. TBL evaluates the result of applying that candidate rule by comparing the current result of POS-tag annotations with that of the golden corpus in order to choose the best one which has highest mark. These best rules are repeatedly extracted until there is no more optimal rule (its mark isn't higher than a preset threshold). These optimal rules create an ordered sequence.</Paragraph> <Paragraph position="13"> * The executing period: null Starting with the new unannotated text, TBL assigns an inital POS-tag to each word in text in a way similar to that of the training period.</Paragraph> <Paragraph position="14"> null The sequence of optimal rules (extracted from training period) are applied, which change the POS tag annotations based upon the contexts they appear in. These rules are applied deterministically in the order they appear in the sequence.</Paragraph> <Paragraph position="15"> In addition to the above-mentioned TBL algorithm that is applied in the supervised POS-tagger, Brill (1997) also presented an unsupervised POS-tagger that is trained on unannotated corpora. The accuracy of unsupervised POS-tagger was reported lower than that of supervised POS-tagger.</Paragraph> <Paragraph position="16"> Because the goal of our work is to build a POS-tag annotated training data for Vietnamese, we need an annotated corpus with as high as possible accuracy. So, we will concentrate on the supervised POS-tagger only. For full details of TBL and FTBL, please refer to Eric Brill (1993, 1995) and Radu Florian and Grace Ngai (2001a).</Paragraph> </Section> class="xml-element"></Paper>