File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2006_metho.xml
Size: 20,285 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2006"> <Title>Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> given in the translation list (TL) as illustrated in Figure 2. These probabilities may be estimated from a large and preferably balanced, corpus. In this work, we used statistics from the Brown and WSJ corpora combined.</Paragraph> <Paragraph position="1"> In this section, we will use the term POS tag to denote only the main part-of-speech tags (noun, verb, adjective, adverb, preposition, etc.) and not the fine-grained tags (such as Noun-Genitive-fem-plur-def).</Paragraph> <Paragraph position="2"> True Romanian POS English translation list mandat N warrant; proxy; mandate; money order; power of attorney manechin N model, dummy manifesta V arise, express itself, show manual Adj manual;</Paragraph> <Paragraph position="4"> The POS tags are used only for evaluation and are not available in many bilingual dictionaries.</Paragraph> <Paragraph position="5"> However, when a translation candidate is phrasal (e.g. mandat B0 money order), one can model the more general probability of the foreign word's part of speech tag (CC</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> CU </SectionTitle> <Paragraph position="0"> ) given the part of speech sequence of the English phrasal translation (CC B5. However, because English words often have multiple parts of speech (e.g. order may be a verb), one may weight phrasal POS sequence probabilities (making an independence assumption) as: B5. The first is to assume that the part-of-speech usage of phrasal (English) translations is generally consistent across dictionaries (e.g. C8B4C6 of publisher or language). Hence one could use any foreign-English bilingual dictionary that also includes the true foreign word part of speech in addition to its translations to train these probabilities. Alternately, one could do a first-pass assignment of foreign-word part of speech based on only single word translations as in Figure 2, and use this to B5 for those foreign words having both phrasal and single-word definitions (such as mandat). The advantage of this approach is that it may benefit dictionaries with different phrasal translation styles from the training dictionary (e.g. use or omission of the word 'to' in verb definitions). However, given the assumption of relatively consistent dictionary formatting styles (which was unfortunately not the case for Kurdish), we evaluated this work based on supervised phrasal training from a single independent third language dictionary. Table 1 measures the POS induction performance on three languages, where the true POS tags were given in the dictionary (as in Figure 1), but ignored except for evaluation. The accuracy values in this table are based on exact matches between a word's dictionary-provided POS and the most probable tag in its induced distribution.</Paragraph> <Paragraph position="1"> For our target application of part-of-speech tagging, what matters is to have a robust tag probability distribution that includes the true candidate with sufficiently large probability to seed further training. By setting this baseline threshold to 0.1 and deleting lower ranked candidates, up to 98% of the true POS were found to be above this threshold and hence were considered in future training.</Paragraph> <Paragraph position="2"> The Mean Probability of Truth, as shown in Table 1, is another measure of the quality of the POS predictions made by the algorithm, representing the probability mass associated with the true POS tag averaged over all words.</Paragraph> <Paragraph position="3"> In some cases the algorithm could not predict a POS tag, primarily due to English translations for which no POS distribution was known (often an obscure word, proper name or OCR error). This oc- null glish translation lists. Results are measured by type (all dictionary entries are weighted equally). casional omission is measured by the coverage column. null Most of the observed errors are due to differences in phrasal definitional conventions in the training and testing dictionaries, long phrasal idioms, single-word definitions with ambiguous English parts-of-speech and OCR errors. The Kurdish dictionary was particularly hindered by frequent long phrasal translations which often included an explanation or definition in their translation. Because all dictionary entries are equally weighted, errors on rare words such as mythological characters or kinship terms can substantially downgrade performance. But for the purposes of providing seed POS distributions to context-sensitive taggers, performance is quite adequate for this follow-on task.</Paragraph> </Section> <Section position="6" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Inducing Morphological Analyses </SectionTitle> <Paragraph position="0"> There has been extensive previous work in the supervised and minimally supervised induction of both affix paradigms (e.g. Goldsmith, 2000; Snover and Brent, 2001) and diverse models of regular and irregular concatenative and non-concatenative morphology (e.g. Schone and Jurafsky, 2000; van den Bosch and Daelemans, 1999; Yarowsky and Wicentowski, 2000). While such approaches are important from the perspective of learning theory or broad coverage handling of irregular forms, another possible paradigm for minimal supervision is to begin with whatever knowledge can be efficiently manually entered from the grammar book in several hours work.</Paragraph> <Paragraph position="1"> We defined such grammar-based &quot;supervision&quot; as entry of regular inflectional affix changes and their associated part of speech in standardized ordering of fine-grained attributes, as in Table 2 for Spanish and Romanian. The full tables have approximately 200 lines each and required roughly 1.5-2 person-hours for entry.</Paragraph> <Paragraph position="2"> Given a dictionary marked with core parts of speech, it is trivial to generate hypothesized inflected forms following the regular paradigms, as shown in the left size of Figure 3. However, due to irregularities and semi-regularities such as stem-</Paragraph> <Paragraph position="4"> paradigms (suffix context is marked by $).</Paragraph> <Paragraph position="5"> changes, such generation will clearly have substantial inaccuracies and overgenerations.</Paragraph> <Paragraph position="6"> However, through weighted-Levenshtein-based iterative alignment models, such as described in Yarowsky and Wicentowski (2000), one can perform a probabilistic string match from all lexical tokens actually observed in a monolingual corpus, as dictionary roots under regular paradigms in the right side of Figure 3 .</Paragraph> <Paragraph position="7"> For example, when looking for a potential analysis path for the Spanish irregular inflection destrocen, the closest string match is the regular hypothesis destrozar/V B0 destrozen/V-pres_subj-3pl. Likewise, the closest string match for destruyen is destruir/V B0 destruen/V-pres_indic-3pl. The differences between these regular hypotheses and observed inflected forms are the relatively productive stem changes BNAXDD and DEAXCR, neither of which was listed in the inflectional supervision table, and yet they were correctly handled. Note that a traditional C8B4POSCYsuffix) model would fail to handle this case given that the common inflection suffix -en corresponds to two different parts of speech here (present indicative or subjunctive depending on -ir or -ar paradigm).</Paragraph> <Paragraph position="8"> Also note that the irregular stem change processes such as dormirAXduermen have a correct best-fit analysis, despite the absence of any internal stem change exemplars (e.g. oAXue) in the human-generated inflectional supervision table.</Paragraph> <Paragraph position="9"> For further robustness, the consensus model of</Paragraph> <Paragraph position="11"> CYBYCFB5 is estimated as a weighted mixture of the part-of-speech tags of the most closely aligned For processing efficiency, one additional constraint is that potential hypothesizedB0observed string pair candidates must exactly match in both initial consonant cluster and suffix of the generated hypothesis.</Paragraph> <Paragraph position="12"> pseudo-regular generated inflections.</Paragraph> <Paragraph position="13"> The inflections of closed-class words (such as pronouns, determiners and auxiliary verbs) are not well handled by this generative-alignment model, both due to their often very high irregularity (e.g. the Spanish verb ser (to be)) and/or their typical shortness (e.g. the pronominal inflections of mi, tu, su). Thus as one final amount of supervision, lists of closed-class words, paired with their inflections and fine-grained part-of-speech tags were entered manually from the grammar book (e.g. aquellas#(aquel)Adj_Demfem-plur-p3). This final source of supervision utilized an average of 400 lines and 3 person-hours per language.</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 POS Model Induction </SectionTitle> <Paragraph position="0"> The non-traditional supervision methodology in Sections 2 and 3 yields a noisy but broad-coverage candidate space of parts of speech with little human effort.</Paragraph> <Paragraph position="1"> We then perform a noise-robust combination of model estimation and re-estimation techniques for smoothed tries is trained on the raw initial tag distributions, yielding coverage to unseen words and smoothing of low-confidence initial tag assignments.</Paragraph> <Paragraph position="2"> AF Paradigmatic cross-context tag modeling is performed as in Cucerzan and Yarowsky (2000) when sufficiently large unannotated corpora are available.</Paragraph> <Paragraph position="3"> AF Sub-part-of-speech contextual agreement for features such as gender is performed as described in Section 4.1.</Paragraph> <Paragraph position="4"> AF The part-of-speech tag sequence models AF Both the tag-sequence and lexical prior models are iteratively retrained using these additional evidence sources and first-pass probability distributions. null The success of this model is based on the assumption that (a) words of the same part of speech tend to have similar tag sequence behavior, and (b) there are sufficient instances of each POS tag labeled by either the morphology models or closed-class entries described in Section 3. One example where these assumptions do not hold is for the Romanian word a, which has 5 possible POS tags, including Infinitive_Marker(corresponding to the English word to). But because the Infinitive_Marker tag has no other word instances in Romanian, no other filial supervision exists to resolve the ambiguity of a if no context-sensitive tagging is provided (such as the preference for a to be labeled Infinitive_Markerwhen followed by a Verb-Infinitive). Thus one avenue of potential improvement to these models would be to include limited tagged contexts for ambiguous small class (or singleton class) words, although such supervision is less readily extractable from grammar books by non-native speakers, and was not employed here.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Contextual-agreement models for </SectionTitle> <Paragraph position="0"> part-of-speech subtags Traditional part-of-speech models assume a strict Markovian sequential dependency. However, Adj-Noun, Det-Noun and Noun-Verb agreement at the subtag-level (e.g. for person, number, case and gender) often do not require direct adjacency, and are based on the selective matching of isolated subfeatures. This is particularly important for grammatical gender, where the lack of gender features projected from English rootwords in a bilingual dictionary (as in Section 2) require contextual agreement to assign gender to many inflected and root forms.</Paragraph> <Paragraph position="1"> However, given the assumptions of minimal supervision, it is not reasonable to require a parser or dependency model to identify non-adjacent agreeing pairs explicitly. Rather, we utilize a much more general tendency for words exhibiting a property such as grammatical gender to co-occur in a relatively narrow window with other words of the same gender (etc.) with a probability greater than chance.</Paragraph> <Paragraph position="2"> Empirically, we observe this in Figures 4-5, which show the gender-agreement ratio between a target noun/adjective and other gender marked words appearing in context at relative position A6CX. Adjectives in Romanian exhibit a stronger agreement tendency with words to their left (5/1 ratio), while for nouns the agreement ratio is quite closely balanced between -1 (primarily determiners) and +1 (primarily adjectives), although weaker (2.4/1 ratio), perhaps due to a greater relative tendency for nouns to juxtapose directly with other independent clauses of different gender. Also, both parts of speech con- null marked adjective (above) or noun (below) agrees in gender with another noun/adjective/determiner at relative position i over the frequency of gender disagreement at that relative position.</Paragraph> <Paragraph position="3"> marked word will occur within a window of A6CX words relative to another gender marked word (of any part of speech).</Paragraph> <Paragraph position="4"> verge on the agreement ratio expected by chance (0.82) relatively quickly. Thus while any individual context may suggest incorrect gender based on agreement, if one aggregates over all occurrences of a word in a corpus, a consensus gender preference emerges, with the true gender agreement signal exceeding nearby spurious gender noise.</Paragraph> <Paragraph position="5"> Formally, we can model this window-weighted global feature consensus as: The A6BF window-size parameter was selected prior to the studies shown in Figures 4-5, but is supported by them. Beyond this window the agreement/disagreement ratio approaches chance, but with a smaller window the probability of finding any gender-marked word in the window drops below the 80% coverage observed for A6BF, trading lower coverage for increased accuracy.</Paragraph> <Paragraph position="6"> If one makes the assumption that the overwhelming majority of nouns have a single grammatical gender independent of context, we perform smoothing to force nouns with sufficient global context frequency towards their single most likely gender.</Paragraph> <Paragraph position="7"> Finally, the trie-based suffix model noted in Section 3 can be utilized here to further generalize gender affixal tendencies for use in smoothing poorly represented single words. Through this approach we successfully discover a wide space of low-entropy gender affix tendencies, including the common -a, -dad and -cion feminine affixes in Spanish, without any human or dictionary supervision of nominal gender. But even those words without gender-distinguishing affixes (e.g. parte, cabal) can be successfully learned via global context maximization. null</Paragraph> </Section> </Section> <Section position="8" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Evaluation of the Full Part-of-speech Tagger </SectionTitle> <Paragraph position="0"> One problem with minimally supervised learning of foreign languages is that annotated evaluation data are often not available for the features being induced, or are otherwise difficult to obtain. Thus we have used for initial test languages two languages familiar to the authors (Romanian and Spanish) for which sufficient evaluation resources could be obtained. However, the monolingual corpora utilized for bootstrapping were quite small (123 thousand words of the book 1984 for Romanian and 3.2 million words of newswire for Spanish), which are easily comparable to the sizes that can be accessed on-line for 60-100 world languages. The seed dictionaries were located online (for Spanish - 42k entries) and via OCR (for Romanian - 7k entries), and small grammar references were obtained at a local bookstore. 1000 words of test data were annotated with a standardized, finely detailed part-of-speech tag inventory including the full complex distinctions for gender, person, number, case, detailed tense and nominal definiteness (an inventory of 259 and 230 fine-grained tags were used for Spanish and Romanian respectively).</Paragraph> <Paragraph position="1"> The minimal supervision in this study consisted of an average total of 4 person-hours per language for manually entering the inflectional paradigms and associated parts of speech from a grammar as in Section 3, and an additional average of 3 person-hours per language for dictionary extraction and entry parsing. OCR itself on our high-speed 2-sided scanner with OmniPage Pro took under 30 minutes). As would be expected given that data entry was done by computer scientists which were not native speakers of the test languages, significant analysis errors or gaps were introduced when rather blindly transferring from the reference grammar.</Paragraph> <Paragraph position="2"> Thus to test the relative contributions of limited native speaker help when available, for roughly 4 additional total person hours in a second test condition for Romanian a native speaker corrected and augmented gaps in the patterns previously entered from the grammar book, focusing almost exclusively on the complex inflections of closed-class words.</Paragraph> <Paragraph position="3"> A summary of the results for these three supervision modes is given in Table 3. Performance is broken down by fine-grained part of speech. Exact-match accuracy is measured over both the full fine-grained (up to 5-feature) part-of-speech space, as well as the 12-class core POS tag (noun and proper noun, pronoun, verb, adjective, adverb, numeral, determiner, conjunction, preposition, interjection, particle, punctuation). The feature of grammatical gender was specifically isolated because it is rarely salient for cross-language applications such as machine translation (where grammatical gender rarely transfers), and because its induction algorithm in Section 4.1 depends heavily on the size of the mono-lingual corpus (which is small in these experiments, suggesting size-dependent potential for significant further improvement here).</Paragraph> <Paragraph position="4"> Finally, a post-hoc analysis of the system vs. test data discrepancies showed that a significant number were simply arbitrary differences in annotation convention between the grammar-book analyses and the test data tagging policy. For example, one such &quot;error&quot;/discrepancy is the rather arbitrary distinction of whether the Romanian word oricare (meaning any) should be considered an adjective (as listed in a standard bilingual dictionary) or a determiner.</Paragraph> <Paragraph position="5"> Another difference is whether proper-name citations of common nouns (e.g. Casa Blanca) should be annotated for gender/number etc. or not.</Paragraph> <Paragraph position="6"> Yet regardless of exactly how many system-test discrepancies are just policy differences rather than errors, even the raw accuracy here is very promising given the very fined-grained part-of-speech inventory and small monolingual data size used for bootstrapping. And ultimately the performance is quite based on 1 person-day of supervision, no tagged training corpora and a fine-grained (AP250 tags) tagset. NNS and NN refer to non-native-speaker and native-speaker effort.</Paragraph> <Paragraph position="7"> remarkable given that it is the result of less than 1 total person day of data collection and supervision, in contrast to the thousands of hours and $100,000$1,000,000 spent on some annotated training data in a much more limited tagset inventories. Thus in terms of cost-benefit analysis, the supervision paradigm and associated bootstrapping models presented here offer quite a good value of new functionality per labor invested.</Paragraph> </Section> class="xml-element"></Paper>