File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2013_metho.xml
Size: 28,582 bytes
Last Modified: 2025-10-06 14:07:01
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2013"> <Title>Language English Czech Estonian Hungarian Romanian Slovene Automatic Baseline Pull</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Full Morphological Tagging </SectionTitle> <Paragraph position="0"> English Part of Speech (POS) tagging has been widely described in the recent past, starting with the (Church, 1988) paper, followed by numerous others using various methods: neural networks (Julian Benello and Anderson, 1989), HMM tagging (Merialdo, 1992), decision trees (Schmid, 1994), transformation-based error-driven learning (Brill, 1995), and maximum entropy (Ratnaparkhi, 1996), to select just a few. However different the methods were, English dominated in these tests.</Paragraph> <Paragraph position="1"> Unfortunately, English is a morphologically &quot;impoverished&quot; language: there are no complicated agreement relations, word order variation is minireal, and the morphological categories are either extremely simple (-s for plural of nouns, for example), or (almost) nonexistent (cases expressed by inflection, for example) - with not too many exceptions and irregularities. Therefore the number of tags selected for an English tagset is not that large (40-75 in the typical case). Also, the average ambiguity is low (2.32 tags per token on the manually tagged * The work described herein has been started and largely done within author's home institution, the Institute of Formal and Applied Linguistics, Charles University, Prague, CZ, within the project VS96151 of the Ministry of Education of the Czech Republic and partially also under the grant 405/96/K214 of the Grant Agency of the Czech Republic.</Paragraph> <Paragraph position="2"> Wall Street Journal part in the Penn Treebank, for example).</Paragraph> <Paragraph position="3"> Highly inflective and agglutinative languages are different. Obviously we can limit the number of tags to the major part-of-speech classes, plus some (like the Xerox Language Tools (Chanod, 1997) for such languages do), and in fact achieve similar performance, but that limits the usefulness of the results thus obtained for further analysis. These languages, obviously, do not use the rich inflection just for the amusement (or embarrassment) of their speakers (or NLP researchers): the inflectional categories carry important information which ought to be known at a later time (e.g., during parsing). Thus one wants not only to tell apart verbs from nouns, but also nominative from genitive, masculine animate from inanimate, singular from plural - all of them being often ambiguous one way or the other.</Paragraph> <Paragraph position="4"> The average tagset, as found even in a moderate corpus, contains between 500 and 1,000 distinct tags - whereas the size of the set of possible and plausible tags can reach 3,000 to 5,000. Obviously, any of the statistical methods used for English (even if fully supervised) clash with (or, fall through) the data sparseness problem (see below Table 1 for details).</Paragraph> <Paragraph position="5"> There have been attempts to solve this problem for some of the highly inflectional European languages ((Daelemans et al., 1996), (Erjavec et al., 1999), (Tufts, 1999), and also our own in (Haji~ and Hladk~, 1997), (Haji~ and Hladk~, 1998), see also below), but so far no method nor a tagger has been evaluated against a larger number of those languages in a similar setting, to allow for a side-by-side comparison of the difficulty (or ease) of full morphological tagging of those languages. Thanks to the Multext-East project (V6ronis, 1996a), there are now five annotated corpora available (which are manually fully morphologically tagged) to perform such experiments.</Paragraph> </Section> <Section position="3" start_page="0" end_page="94" type="metho"> <SectionTitle> 2 The Languages Used and The Training Data </SectionTitle> <Paragraph position="0"> We use the Multext-East-annotated version of the Orwell's 1984 novel in Czech, Estonian, Hungarian, Romanian and Slovene I. The annotation uses a single SGML-based formal scheme, and even common guidelines for tagset design and annotation, nevertheless the tagsets differ substantially since the languages differ as well: Romanian is a French-like romance language, Hungarian is agglutinative, and the other languages are more or less inflectionaltype languages 2. The annotated data contains about 100k tokens (including punctuation) for each language; out of those, the first 20k tokens has been used for testing, the rest for training. We have also extended the tag identifiers by appending a string of hyphens ('-') to suit the exponential tagger which expects the tags to be of equal length; the mapping was 1:1 for all tags in all languages, since the &quot;long&quot; tags are in fact the Multext-East standard.</Paragraph> <Paragraph position="1"> From the tagging point of view, the language characteristics displayed in Table 1 are the most relevant 3 .</Paragraph> </Section> <Section position="4" start_page="94" end_page="95" type="metho"> <SectionTitle> 3 The Methodology </SectionTitle> <Paragraph position="0"> The main tagger used for the comparison experiment is the probabilistic exponential-model-based, error-driven learner we described in detail in (Haji~ and Hladk~, 1998). Modifications had to be made, however, to make it more universal across languages.</Paragraph> <Section position="1" start_page="94" end_page="94" type="sub_section"> <SectionTitle> 3.1 Structure of the Model </SectionTitle> <Paragraph position="0"> The model described in (Haji~ and Hladk~, 1998) is a general exponential (specifically, a log-linear) model (such as the one used for Maximum Entropy-based models): pAc(ylx ) = exp(~\]in_1AJi(y, x)) z(x) (1) where fi(y,x) is a binary-valued feature of the event value being predicted and its context, A~ is a weight of the feature fi, and Z(x) is the natural normalization factor. This model is then essentially reduced to Naive Bayes by the approximation of the IThere are more languages involved in the Multext-East project, but only these five languages have been really carefully tagged; English is unfortunately tagged using Eric Brill's tagger trained in unsupervised mode, leaving multiple output at almost every ambiguous token, and Bulgarian is totally unusable since it has been tagged automatically with only a baseline tagger. The English results reported below thus come from the Penn Treebank data, from which we have used roughly 100,000 words to match the training data sizes for the remaining languages. For Czech, Hungarian, and Slovene we use later versions of the annotated data (than those found on the Multext-East CD) which we obtained directly from the authors of the annotations after the Multext-CD had been published, since the new data contain rather substantial improvements over the originally published data.</Paragraph> <Paragraph position="1"> 2For detailed account of the lexical characteristics of these languages, see (Vdronis, 1996b).</Paragraph> <Paragraph position="2"> 3We have included English here for comparison purposes, since these characteristics are independent of the annotation. Ai parameters, which is done because there are millions of possible features in the pool and thus the full entropy maximization is prohibitively expensive, if we want to select a small number of features instead of keeping them all.</Paragraph> <Paragraph position="3"> The tags are predicted separately for each morphological category (such as POS, NUMBER, CASE, DEGREE OF COMPARISON, etc.). The model makes an extensive use of so-called &quot;ambiguity classes&quot; (ACs). An ambiguity class is a set of values (such as genitive and accusative) of a single category (such as CASE) which arises for some word forms as a result of morphological analysis. For un-ambiguous word forms (unambiguous from the point of view of a certain category), the ambiguity class set contains only a single value; for ambiguous forms, there are 2 or more values in the AC. For example, let's suppose we use part-of-speech (POS), number and tense as morphological categories for English; then the word form &quot;borrowed&quot; is 2-way ambiguous in POS ({V, J} for verb and adjective, respectively), unambiguous in number (linguistic arguments apart, number is typically regarded &quot;not applicable&quot; to adjectives as well as to almost all forms of verbs in English), and 3-way ambiguous in tense ({P,N,-) for past tense, past participle, and &quot;not applicable&quot; in the adjective form).</Paragraph> <Paragraph position="4"> The predictions of the models are always conditioned on the ambiguity class of the category (POS, NUMBER, ...) in question. In other words, there is a separate model for each category and an ambiguity class from that category. Naturally, there is no model for unambiguous ACs classes. However, even though the ambiguity classes bring very valuable information about the word form being tagged and a reliable information about the context (since they are fixed during tagging), using ACs causes also an unwelcome effect of partitioning the already scarce data and also effectively ignores statistics of the un-ambiguous cases.</Paragraph> <Paragraph position="5"> The context of features uses the neighboring words (original word forms) and ambiguity classes on subtags, where their relative position in text might be either fixed (0, -1, +1) or &quot;variable&quot; using a value of the POS subtag as the &quot;stop here&quot; criterion, up to 4 text positions (words) apart.</Paragraph> </Section> <Section position="2" start_page="94" end_page="95" type="sub_section"> <SectionTitle> 3.2 General Subtag Features </SectionTitle> <Paragraph position="0"> The original model uses the ambiguity classes not only for conditioning on context in features, but also for the individual models based on category and an AC.</Paragraph> <Paragraph position="1"> More general features have been introduced, which do not depend on the ambiguity class of the subtag being predicted any more. This allows to learn also from unambiguous tokens. However, the training time is increased dramatically by doing so since all events in the training data have to be taken into consideration, as opposed to the case of training the small AC-based model, when only those training events which contain the particular AC are used.</Paragraph> </Section> <Section position="3" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 3.3 Variable Distance Condition </SectionTitle> <Paragraph position="0"> The &quot;stop&quot; criterion for finding the appropriate relative position was originally based on hard coded choices suitable for the Czech language only, and of course it depended on the tagset as well. This dependency has been removed by selecting the appropriate conditions automatically when building the pool of possible features at the initialization phase 5 (using the relative frequency of the POS ambiguity classes, and a threshold to cut off less frequent categories to limit the size of the feature pool).</Paragraph> </Section> <Section position="4" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 3.4 Weight Variation </SectionTitle> <Paragraph position="0"> Even though the full computation of the appropriate feature weight is still prohibitive (the more so when the general features are added), the learner is now allowed to vary the weights (in several discrete steps) during feature selection, as a (somewhat crude) attempt to depart from the Naive Bayes simplification to the approximation of the &quot;correct&quot; Maximum Entropy estimation.</Paragraph> </Section> <Section position="5" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 3.5 Handling Unknown Words </SectionTitle> <Paragraph position="0"> In order to compare the effects of (not) using an independent dictionary, we have added an unknown word handling module to the code. 6 It extracts the prefix and suffix frequency information (and the combination thereof) from the training data. Then, for each of the combinations, it selects the most frequent set of tags seen in the training data and stores it for later use. When tagging, the data is first piped through a &quot;guesser&quot; which assigns to the unknown words such a set of possible tags which is stored with the longest matching prefix/suffix combination.</Paragraph> <Paragraph position="1"> 5Also, the use of variable-distance context may be switched off entirely.</Paragraph> <Paragraph position="2"> 6Originally, the code relied exclusively on the use of such an independent dictionary. Since the coverage of the Czech dictionary we have used is extensive, we have been simply ignoring the unknown word problem altogether in the past.</Paragraph> </Section> </Section> <Section position="5" start_page="95" end_page="98" type="metho"> <SectionTitle> 4 The Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="95" end_page="95" type="sub_section"> <SectionTitle> 4.1 Reporting Error Rate: Words vs. Tokens </SectionTitle> <Paragraph position="0"> Since &quot;best-only&quot; tagging has been carried out, the error rate (i.e, 100 - accuracy in %) measure has been used throughout as the only evaluation criterion. However, since some results reported previously were apparently obtained using only the &quot;real&quot; words as the total for accuracy evaluation, whereas in other experiments every token counts (including punctuation 7, for example), we have computed both and report them separately s.</Paragraph> </Section> <Section position="2" start_page="95" end_page="96" type="sub_section"> <SectionTitle> 4.2 Availability of Dictionary Information </SectionTitle> <Paragraph position="0"> We use two methods to obtain the set of possible tags for any given word form (i.e., to analyze it morphologically). Both methods include handling unknown words. First, we use only information which may be obtained automatically from the manually annotated corpus (we call this method automatic).</Paragraph> <Paragraph position="1"> This is the way the Maximum Entropy tagger (Ratnaparkhi, 1996) runs if one uses the binary version from the website (see the comparison in Section 5).</Paragraph> <Paragraph position="2"> However, it is not unreasonable to assume that a larger independent dictionary exists which can help to obtain a list of possible tags for each word form in test data. This is what we have at our disposal for the languages in question, since the development of such a dictionary was part of the Multext-East project. We can thus assume a dictionary info is available for unknown words in the test data, i.e., even though there is no statistics available for them (since they did not appear in the training data), all possible tags for (almost 9) every test token are available. This method is referred to as independent in the following text.</Paragraph> <Paragraph position="3"> We have also used a third method of obtaining a dictionary information (called mized), namely, by using only the words from the training data, rAnd sometimes a separate token for sentence boundary STable 1 has been computed using all tokens. In fact, the languages differ significantly in the proportion of punctuation: from about 18% (English) to 30% (Estonian).</Paragraph> <Paragraph position="4"> 9Depending on the quality of the independent dictionary. Of course, the tagsets must match, which could be a problem per se. Here it is simple, since the dictionaries have been developed using the same tagsets as the tagged data.</Paragraph> <Paragraph position="5"> but complementing the information about them obtained from the training data by including all other possible tags for such words. Therefore the net result is that during testing, we have only training words at our disposal, but with a complete dictionary information (as if coming from a full morphological dictionary) 1deg.</Paragraph> <Paragraph position="6"> The results on the full training data set are summarized in Table 2.</Paragraph> <Paragraph position="7"> The baseline error rate is computed as follows.</Paragraph> <Paragraph position="8"> First of all, we use the independent dictionary for obtaining the possible tags for each word. Then we extract only the lexical information from the current position 11 and counts used for smoothing (which is based on the ambiguity classes only and it does not use lexical information). The system is then trained normally, which means it uses the lexical information only if the AC-based smoothing 12 alone does not work. This baseline method is thus very close to the usual baseline method of using simple conditional distribution of tags given words.</Paragraph> <Paragraph position="9"> The message of Table 2 seems to be obvious; but before we jump to conclusions, let's present another set of experiments.</Paragraph> <Paragraph position="10"> In view of the recent interest in dealing with &quot;small languages&quot;, and with regard to the questions of cost-effectiveness of using &quot;human&quot; resources (i.e. annotation vs. rule-writing vs. tools development etc.), we have also performed experiments with reduced training data size (but with an enriched feature pool - by lowering thresholds, adding more of the &quot;general features&quot; as described above, etc. - as allowed by reasonable time/space constraints). 13 These results are summarized in Table 3 (using only the dictionary derived from the training data), Table 4 (using words from training data with morphological information complemented from a dictionary) and Table 5 (using the &quot;independent&quot; dictionary). In all cases, we again count only true words (no punctuation). Accordingly, the major POS error rate is reported, too (12 POS tags to be distinguished only: Noun, Verb, Adjective .... ; see Tables 6, 7, and 8).</Paragraph> <Paragraph position="11"> 1degThis arrangement removes the &quot;closed vocabulary&quot; phenomenon from the test data, since for the Multext-East data, we did not have a truly independent vocabulary available. 11Words from the training data which are not singletons (freq > 1) are used. Surprisingly enough, it would not hurt to use them too. We believe it is due to the smoothing method used. Even though this is valid only for the baseline experiment, we have observed in general that this form of exponential model (with error-driven training, that is) is remarkably resistant to overtrainlng.</Paragraph> <Paragraph position="12"> 12Using ACs linearly interpolated with global unigram subtag distribution and finally the uniform distribution. 13By reasonable we mean less than a day of CPU for training. null</Paragraph> </Section> <Section position="3" start_page="96" end_page="98" type="sub_section"> <SectionTitle> 4.3 Tagger Comparison </SectionTitle> <Paragraph position="0"> The work (Erjavec et al., 1999) consistently compares several taggers (HMM, Brill's Transformation-based Tagger, Ratnaparkhi's Maximum Entropy tagger, and the Daelemans et al.'s Memory-based Tagger) on Slovene. We have chosen the Maximum Entropy tagger (Ratnaparkhi, 1996) for a comparison with our universal tagger, since it achieved (by a small margin) the best overall result on Slovene as reported there (86.360% on all tokens) of taggers available to us (MBT, the best overall, was not freely available to us at the time of writing). We have trained and tested the Maximum Entropy Tagger on exactly the same data, using the off-the-shelf (java binary only) version.</Paragraph> <Paragraph position="1"> The results are compared in Table 9.</Paragraph> <Paragraph position="2"> Since we want to show how a tagger accuracy is influenced by the amount of training data available, we have run a series of experiments comparing the results of the exponential tagger to the maximum entropy tagger when there is only a limited amount of data available. The results are summarized in Table 10. Since the public version of the MaxEnt tagger cannot be modified to take advantage of neither the mixed nor the independent dictionary, we have compared it only to the automatic dictionary version of the exponential tagger. To save space, the results are tabulated only for the training data sizes of 2000, 5000 and 20000 words. Again, only the &quot;true&quot; word error rate is reported.</Paragraph> <Paragraph position="3"> As the tables show, for the languages we tested, the exponential, feature-based tagger we adapted from (Haji~ and Hladk~, 1998) achieves similar results as the Maximum Entropy tagger 14 15. (using exactly the same (full) training data; the &quot;score&quot; is 3:3, with the MaxEnt tagger being substantially better on English; probably the development lan14Otherwise the acknowledged leader in English tagging 15The only substantial difference we noticed was in tagging speed. The runtime speed of the MaxEnt tagger is lower, only about 10 words per second vs. almost 500 words per second; it should be noted however that we are comparing MaxEnt's java bytecode and C.</Paragraph> <Paragraph position="4"> guage bias shows herein). However, when the training data size goes down, the advantage of predicting the single morphological categories separately favors the exponential tagger (with the notable and substantial exception of English). The less data, the larger the difference (Tab 10).</Paragraph> <Paragraph position="5"> 16On the other hand, the Exponential tagger has been developed on Czech originally and it lost on this language. It should be noted that the original version of the exponential tagger did contain more Czech-specific features, and thus might in fact do better.</Paragraph> <Paragraph position="6"> The resulting accuracy (of both taggers) is still unsatisfactory not only from the point of view of results obtained on English, but also from the practical point of view: approx. 85% accuracy (Czech, Slovene) typically means that about five out of six 10-word sentences contain at least one error in it.</Paragraph> <Paragraph position="7"> That is bad news e.g. for parsing projects involving tagging as a preliminary step.</Paragraph> </Section> </Section> <Section position="6" start_page="98" end_page="100" type="metho"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.1 The Differences Among Languages </SectionTitle> <Paragraph position="0"> The following discussion abstracts from the tagset design, relying on the fact that the Multext-East project has been driven by common tagset guidelines to an unprecedented extent, given the very different languages involved. At the same time, we acknowledge that even so, their design for the individual languages might have influenced the results. Also, the quality of the annotation is an important factor; we believe though that the later data we obtained for the experiments described here are within the range of usual human error and do not suffer from negligence 1~.</Paragraph> <Paragraph position="1"> First of all, it is clear that these languages differ substantially just by looking at the simple training 17Specifically, we are sure that the post-release Czech, Slovene and Hungarian data we are using are without annotation defects beyond the usual occasional annotation error, as they have been double checked, and we also believe that the other two languages are reasonably clean. Bulgarian, although present on the CD, is unfortunately unusable since it has not been manually annotated; for English, see above.</Paragraph> <Paragraph position="2"> data statistics, where the number of unique tags seen in a relatively small collection of about 100k tokens is high - from 401 (Hungarian) to 1033 (Slovene); compare that to English with only 139 tags. However, it is interesting to see that the average per-token ambiguity is much more narrowly distributed, and in fact English ranks 3rd (after Hungarian and Slovene), Czech being the last with almost every other token ambiguous on average. This ambiguity does not correspond with the results obtained: Slovene, being the second least ambiguous, is the second most difficult to tag. Only Czech behaves consistently by tailing the pack in both cases.</Paragraph> </Section> <Section position="2" start_page="98" end_page="99" type="sub_section"> <SectionTitle> 5.2 Comparison to Previous Results </SectionTitle> <Paragraph position="0"> Any comparison is necessarily difficult due to different evaluation methodologies, even within the &quot;bestonly&quot;, accuracy-based reporting. Nevertheless, we will try.</Paragraph> <Paragraph position="1"> For Romanian, Tufts in his recent work (Tufts, 1999) reports 98.5% accuracy (i.e. 1.5% error rate) on Romanian, using the classifier combination approach advocated by e.g. (Brill and Wu, 1998). His results are well above the 3.29% error rate achieved here (with even a larger tagset of 1391 vs. 486 here), but the paper does not say how this number has been computed (training data size, the all-token/wordsonly question) thus making any conclusions difficult to make. He also argues that his method is language independent but no results are mentioned for other languages.</Paragraph> <Paragraph position="2"> For Czech, previous work achieved similar results (6.20% on newspaper text using the all-tokens-based error rate computation, on 160,000 training tokens; vs. 7.04% here on approx, half that amount of training data; same handling of unknown words). This is in line with the expectations, since the same methodology (tagging as well as evaluation) has been used, except the features used in that work were specifically tuned to Czech.</Paragraph> <Paragraph position="3"> The most detailed account of Slovene (Erjavec et al., 1999) reports various results, which might not be directly comparable because it is unclear whether they use the all-tokens-based or words-only computation of the error rate. They report 6.421% error rate on the full tagset on known words, and 13.583% on all words (tokens?) including unknown words (the exponential tagger we used achieved 13.82% on all tokens, 16.26% on words only). They use almost the same data (Orwell's 1984, but leaving out the Appendices) ls. They also report that the original Czech-specific exponential tagger used as a basis for the work reported here achieved 7.28% error rate on Slovene on full tags on the same data, which means that by the changes to the exponential tagger aimed at its language independence we introduced in Section 3, we have not achieved any improvement (on Slovene) of the exp. tagger (the error rate stayed at 7.26% - using all-tokens-based evaluation numbers, dictionary available; but the data was not exactly the same, presumably).</Paragraph> </Section> <Section position="3" start_page="99" end_page="100" type="sub_section"> <SectionTitle> 5.3 Dictionary vs. Training Data </SectionTitle> <Paragraph position="0"> This is, according to our opinion, the most interesting result of the experiments described so far. As 18Their tag count is lower (1021) than here (1033), but that's not really relevant. They do not report the average ambiguity or a similar measure.</Paragraph> <Paragraph position="1"> already Table 2 clearly suggests, even the baseline tagging results obtained with the help of an independent dictionary are comparable (if not better) than the fully-trained tagger on 100k words, but without the dictionary information. The situation is even clearer when comparing the POS-only results: here the &quot;independent&quot; dictionary results are better by far, with almost no training data needed.</Paragraph> <Paragraph position="2"> Looking at the characteristics of the languages, it is apparent that the inflections cause the problem: the coverage of a previously unseen text is inferior to the usual coverage of English or another analytical language. Therefore, unless we can come up with a really clever way of learning rules for dealing with previously unseen words, it is clearly strongly preferable to work on a morphological dictionary 19, rather than to try to annotate more data.</Paragraph> </Section> </Section> class="xml-element"></Paper>