File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-2001_metho.xml
Size: 32,065 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-2001"> <Title>Automatic Stochastic Tagging of Natural Language Texts</Title> <Section position="3" start_page="138" end_page="143" type="metho"> <SectionTitle> 2. Stochastic Tagging Models </SectionTitle> <Paragraph position="0"> A stochastic optimal sequence of tags T, to be assigned to the words of a sentence W, can be expressed as a function of both lexical P(W \[ T) and language model P(T) probabilities using Bayes' rule:</Paragraph> <Paragraph position="2"> Several assumptions and approximations on the probabilities P(W \[ T) and P(T) lead to good comprises concerning memory and computational complexity.</Paragraph> <Section position="1" start_page="138" end_page="143" type="sub_section"> <SectionTitle> 2.1 Hidden Markov Model (HMM) Approach </SectionTitle> <Paragraph position="0"> The tagging process can be modeled by an HMM by assuming that each hidden tag state produces a word in the sentence, each word wi is uncorrelated with neighboring words and their tags, and each tag is probabilistic dependent on the N previous tags only.</Paragraph> <Paragraph position="1"> 2.1.1 Most probable tag sequence (HMM-TS). The optimal tag sequence for a given observation sequence of words is given by the following equation:</Paragraph> <Paragraph position="3"> where M is the number of words in the sentence W.</Paragraph> <Paragraph position="4"> The optimal solution is estimated by the well-known Viterbi algorithm. The first(Rabiner 1989) and second- (He 1988) order Viterbi algorithms have been presented elsewhere. Recently, Tao (1992) described the Viterbi algorithm for generalized HMMs. 2.1.2 Most probable tags (HMM-T). The optimal criterion is to choose the tags that are most likely to be computed independently at each word event:</Paragraph> <Paragraph position="6"> The optimum tag tio is estimated using the probabilities of the forward-backward algorithm (Rabiner 1989): rio -- argmax P(ti, W) = argmax P(ti, wl,. *., wi)P(wi+l,..., WM \[ ti) (4) ti ti The probabilities in equation 4 are estimated recursively for the first- (Rabiner 1989) and second-order HMM (Watson and Chung 1992).</Paragraph> <Paragraph position="7"> The main difference between the optimization criteria in 2.1.1 and that in 2.1.2 results from the definition of the expected correct tagging rate; the HMM-TS model maximizes the correctly tagged sentences, while the HMM-T model maximizes the correctly tagged words.</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 21, Number 2 2.1.3 Stochastic hypothesis for the unknown words. When a new text is processed, some words are unknown to the tagger lexicon (i.e. they are not included in the training text). In this case, in order to use the forward-backward and the Viterbi algorithm we must estimate the unknown word's conditional probabilities P(w I t). Methods for the estimation of these probabilities have already been proposed (e.g. the use of word endings morphology). Nevertheless, these methods fail if only a small training text is available because of the huge number of events not occurring in this text, such as pairs of tags and word endings. To address the above problem we have approximated the conditional probabilities of the unknown word tags by the conditional probabilities of the less probable word tags, i.e. tags of the words occurring only once. In the following we demonstrate experimentally that this approximation is valid and independent of the training text size.</Paragraph> <Paragraph position="9"> Figures 1 and 2 show the probability distributions of the tags in the training text (known words) and that of the words occurring only once in this text for the English and French language, respectively. Furthermore, the tags' probability distribution of the words that are not included in the training text and are characterized as unknown words is shown. This distribution is measured in a different open testing text, i.e. a text that may include both known and unknown words. The measurements were carried out on newspaper text and split into two parts of the same size--the training and the open testing text. Each part contained 90,000 words for the English text and 50,000 words for the French text. In this experiment, a tagset comprising the main grammatical categories was used: Verb (Vet), Noun (Nou), Adjective (Adj), Adverb (Adv), Pronoun (Pro), Preposition (Pre), Article/Determiner (A-D), Conjunction (Con), Particle (Par), Interjection (Int), Miscellaneous (Mis; i.e., tags that cannot be classified in the previous categories).</Paragraph> <Paragraph position="10"> This experiment has two significant results: a. The probability distribution of the tags of unknown words is significantly different from the distribution for known words, while it is very close to the probability distribution of the tags of the less probable known words both in the English and French text.</Paragraph> <Paragraph position="11"> b. A number of closed and functional grammatical classes has very low probability for both unknown and words occurring only once, e.g., the tags article, determiner, conjunction, pronoun, miscellaneous in English text, and article, determiner, conjunction, pronoun, interjection and miscellaneous in French text.</Paragraph> <Paragraph position="12"> In the English text, verbs, adjectives and conjunctions are more frequent than in the French text. On the other hand, prepositions in the French text have a 0.05 greater probability, which is also the most significant difference between the distributions of the two languages. Prepositions in the words occurring only once and in unknown words are minimal in the English text, while in the French text one out of ten unknown words is a preposition. The text coverage by prepositions is 11.2 percent for the English and 16.2 percent for the French corpus. This difference increases significantly in the lexicon coverage: 0.47 percent for the English and 1.54 percent for the French lexicon. In Figures 3 and 4, the results of chi-square tests that measure the difference between the probability distribution of the tags of the less probable words and that of the unknown words are shown. Various sizes of training text and two sets of grammatical categories, the main set (11 classes) and an extended set (described in detail in Section 5) were used.</Paragraph> <Paragraph position="13"> Distribution of the main grammatical classes of the known and unknown words and the words occurring only once in English text.</Paragraph> <Paragraph position="14"> Distribution of the main grammatical classes of the known and unknown words and the words occurring only once in French text.</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 21, Number 2 Specifically, the grammatically labeled text of 180,000 word entries of the English language was separated into two parts: the training text, where the tag probabilities distribution of the less probable words was estimated, and the open testing text, where the tag probabilities distribution of the unknown words was measured. Multiple chi-square experiments were carried out by transferring successively a portion of 30,000 words from the open testing text to the training text and by modifying the word occurrence threshold from 1 to 15 in order to determine the experimentally optimal threshold. Words having an occurrence below or equal to this threshold in the training text are counted as less probable words. The results of the tests shown in Figures 3 and 4 include threshold values up to 15 because the difference between the distributions for values greater than 15 increases significantly.</Paragraph> <Paragraph position="16"> As shown in the above figures, the close relation between the tested probability distributions is evident for all sizes of training and testing text. Furthermore, we observe that: a.</Paragraph> <Paragraph position="17"> b.</Paragraph> <Paragraph position="18"> C.</Paragraph> <Paragraph position="19"> d.</Paragraph> <Paragraph position="20"> e.</Paragraph> <Paragraph position="21"> The chi-square distance between the tag probability distributions is minimized for low values of the word occurrence threshold. In the tagset of main grammatical classes, this distance is minimized for threshold values less than three, four, or five, depending on the training text size. In the extended set of grammatical classes the distance is minimized in all cases for the threshold value one; i.e., when only the words occurring once in the training text are regarded as less probable words.</Paragraph> <Paragraph position="22"> In the English text the chi-square distance between the tag. probability distributions is minimized for 120,000 words training text for the set of main grammatical classes and for 60,000 words for the extended set. The same results are measured in the French text.</Paragraph> <Paragraph position="23"> There is no significant variation in the chi-square test results for additional training text.</Paragraph> <Paragraph position="24"> The closed and functional grammatical classes can be estimated automatically as the less probable grammatical classes of the less probable words in the tagged text. (The manual definition process is time-consuming when a set of detailed grammatical classes is used).</Paragraph> <Paragraph position="25"> The probability distribution of some grammatical classes of the unknown words changes significantly when the size of the training text is increased. These changes can be measured in the training text from the tags' distribution of the less probable words.</Paragraph> <Paragraph position="26"> Similar results have been achieved by testing the Dutch, German, Greek, Italian, and Spanish texts, both with the tagset of the main grammatical categories and with the common extended set of grammatical categories.</Paragraph> <Paragraph position="27"> Based on the above we can complete both optimization criteria of the HMM formulation, given in 2.1.1 and 2.1.2, by calculating the conditional probability of the unknown word tags using Bayes' rule:</Paragraph> <Paragraph position="29"> Chi-square test for the main grammatical classes' distribution of the unknown and the less probable words in the English text for various training text sizes.</Paragraph> <Paragraph position="30"> Extended tagset of Grammatical classes</Paragraph> <Paragraph position="32"> Chi-square test for the distribution of the grammatical tags of the unknown words and the less probable words in the English text, for the extended tagset of grammatical classes and various training text sizes.</Paragraph> <Paragraph position="33"> Computational Linguistics Volume 21, Number 2 The probability P(Unknown word) is approximated in open testing texts by measuring the unknown word frequency. Therefore the model parameters are adapted each time an open testing text is being tagged. The probability P(t I Less probable word) and the tags probability P(t) are measured in the training text. Finally, each tag-conditional probability of the unknown word tags is normalized:</Paragraph> <Paragraph position="35"> where L is the number of the known words and T is the number of tags.</Paragraph> </Section> <Section position="2" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 2.2 Tagging without Lexical Probabilities </SectionTitle> <Paragraph position="0"> When the corresponding lexical probabilities p(w I t) are not available in the dictionary that specifies the possible tags for each word, a simple tagger can be implemented by assuming that each word wi in a sentence is uncorrelated with the assigned tag ti; e.g., p(wi l ti) = p(wi).</Paragraph> <Paragraph position="1"> In this case the most probable tag sequence, according to equation 2, is given by:</Paragraph> <Paragraph position="3"> which is a Nth-order Markovian chain for the language model (MLM).</Paragraph> <Paragraph position="4"> Taggers based on MLM require the training process to store each tag assigned to every lexicon entry and to define the unknown word tagset.</Paragraph> <Paragraph position="5"> defined by the selection of the most probable tags that have been assigned to the less probable words of the training text. In this way the unknown words' ambiguity is decreased significantly. The word occurrence threshold used to define the less probable words and a tag probability threshold used to isolate the less probable tags are estimated experimentally.</Paragraph> <Paragraph position="6"> Extensive experiments have shown insignificant differences in the tagging error rate when alternative word occurrence thresholds have been tested. The best results are obtained when values less than 10 are used. In this paper the word occurrence threshold has been set to one in all experiments.</Paragraph> </Section> </Section> <Section position="4" start_page="143" end_page="144" type="metho"> <SectionTitle> 3. Tagger Errors </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="143" end_page="144" type="sub_section"> <SectionTitle> 3.1 Errors in the Training Text </SectionTitle> <Paragraph position="0"> Taggers based on the HMM technique compensate for some serious training problems inherent in the MLM approach. The most important one is the presence of errors in the training text. This situation appears when uncorrected tags or analysts' mistakes remain in the text used to estimate the stochastic model parameters. These errors generate tag assignments that are not valid. In MLM taggers these tags are equally weighted to the correct ones. In contrast, in HMM taggers invalid assignments are biased by the very low value of the corresponding conditional probability of the tags (the wrong tag rarely appears in the specific word environment), which decreases the overall probability for incorrect tag assignments.</Paragraph> <Paragraph position="1"> Dermatas and Kokkinakis Stochastic Tagging Another important issue concerns the HMM ability to handle lexicon information, e.g., to find how frequently the tags have been assigned to each lexicon entry. In some languages, taggers based on HMMs almost reduce the prediction error to the half compared to the MLM approach.</Paragraph> </Section> <Section position="2" start_page="144" end_page="144" type="sub_section"> <SectionTitle> 3.2 Tagger prediction errors </SectionTitle> <Paragraph position="0"> Generally, tagger errors can be classified into three categories: a.</Paragraph> <Paragraph position="1"> b.</Paragraph> <Paragraph position="2"> C.</Paragraph> <Paragraph position="3"> Errors due to inadequate training data. When the model parameters are estimated from a limited amount of training data, tagging errors appear because of unknown or inaccurately estimated conditional probabilities. Various interpolation techniques have been proposed for the estimation of the model parameters for unseen events or to smooth the model parameters (Church and Gale 1991; Essen and Steinbiss 1992; Jardino and Adda 1993; Katz 1987; McInnes 1992).</Paragraph> <Paragraph position="4"> Errors due to the syntactical or grammatical style of the testing text. This type of error appears when the testing text has a style unknown to the model (i.e., a style used in the open testing text, not included in the training text). It can be reduced by using multiple models that have been previously trained in different text styles.</Paragraph> <Paragraph position="5"> Errors due to insufficient model hypotheses. In this case the model hypotheses are not satisfied; e.g., there are strong intra-tag relations in distances greater than the model order, idiomatic expressions, language dependent exceptions, etc. A general solution to the variable length and depth of dependency for HMM has been already proposed (Tao 1992), but has not been implemented in taggers.</Paragraph> </Section> </Section> <Section position="5" start_page="144" end_page="152" type="metho"> <SectionTitle> 4. Implementation </SectionTitle> <Paragraph position="0"> In this section we present techniques to speed up the tagging process and avoid underflow or overflow phenomena during the estimation of the optimum solution. These techniques do not increase the prediction error rate or have only minimal influence on it, as proven in the experiments.</Paragraph> <Paragraph position="1"> Two modules consume the majority of the tagger computational time. The first module extracts from the model parameters the intra-tag and the word-tag conditional probabilities requested by the second module, which computes the optimum solution by multiplying the corresponding conditional probabilities. Binary search maximizes the searching speed of the first module, while the following three transformation techniques decrease the computing time of the second module, avoid underflow or overflow phenomena, and use the faster and low-cost fixed-point arithmetic system.</Paragraph> <Section position="1" start_page="144" end_page="145" type="sub_section"> <SectionTitle> 4.1 Logarithmic Transformation </SectionTitle> <Paragraph position="0"> The stochastic solutions described by equations 2 and 7 are computed by multiplying several conditional probabilities. The floating-point multiplications of these probabilities are transformed into an equal number of floating-point additions, by computing the logarithm of the optimum criterion probability. This technique solves the underflow problem which arises when many small probabilities are multiplied, and accelerates the tagger response time.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 21, Number 2</Paragraph> </Section> <Section position="2" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 4.2 Fixed-Point Transformation </SectionTitle> <Paragraph position="0"> The fixed-point transformation converts the floating-point logarithmic additions into an equal number of fixed-point additions. It is realized by the following quantization process: \[/max (ln(Pmin)-ln(Px))\] (8) Ix ---- Round Mw ln(Pmin) where: Px is a conditional probability, Pmin is the minimum conditional probability in the model parameter set,/max is the maximum integer of the fixed-point arithmetic system, Mw is the maximum number of words in a sentence and Round\[.\] is a quantization function mapping real numbers into the nearest integer.</Paragraph> <Paragraph position="1"> After the logarithmic and the fixed-point transformation, equations 2 and 7 become: null</Paragraph> <Paragraph position="3"> The quantization function approximates the computations, producing theoretically differing solutions. In practice the prediction error differences measured for all languages, taggers, and tagsets were less than 0.02 percent.</Paragraph> </Section> <Section position="3" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 4.3 Scaling </SectionTitle> <Paragraph position="0"> The solution obtained by the forward-backward algorithm cannot be logarithmically transformed because of the presence of summations. It is well known that for HMMs the forward and backward probabilities tend exponentially to zero. The scaling process introduced in this case multiplies the forward and backward probabilities by a scaling factor at selective word events in order to keep the computations within the floating-point dynamic range of the computer (Rabiner 1989).</Paragraph> </Section> <Section position="4" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 4.4 Hardware--Software </SectionTitle> <Paragraph position="0"> The taggers have been realized under MS-DOS using a 32-bit C compiler. The lexicon size is limited by the available RAM. A mean value of 35 bytes per word is allocated.</Paragraph> <Paragraph position="1"> The tagger speed exceeds the rate of 500 word/sec in a 80386 (33MHz) for all languages and tagsets in text with known words. A maximum memory requirement of 930Kb has been measured in the experiments described in this paper.</Paragraph> <Paragraph position="2"> A set of symbols and keywords (a sentence separators set) and the maximum length of a sentence are the only manually defined parameters when the HMM taggers are applied.</Paragraph> <Paragraph position="3"> In the MLM taggers, the word occurrence threshold that isolates the less probable words and the tag probability threshold used to reject the less probable tags from the unknown words tagset are the manually defined parameters.</Paragraph> <Paragraph position="4"> The training process has been designed to estimate or update the model parameters from fully tagged text without any manual intervention. Therefore, frequency measurements are defined or updated as model parameters instead of conditional probabilities that are computed afterwards by using the corresponding relative frequencies. null 5. Performance of the Systems</Paragraph> </Section> <Section position="5" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 5.1 Taggers </SectionTitle> <Paragraph position="0"> Five taggers have been realized and tested using bi-POS and tri-POS transition probabilities. Specifically, the first- and the second-order MLM (MLM1 and MLM2, respectively), the first- and the second-order HMM of the most probable tag sequence criterion (HMM-TS1 and HMM-TS2, respectively), and the first-order HMM of the most probable tag criterion (HMM-T1) have been realized.</Paragraph> </Section> <Section position="6" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 5.2 Corpora </SectionTitle> <Paragraph position="0"> The tagger performance has been measured in extensive experiments carried out on corpora of seven languages, English, Dutch, German, French, Greek, Italian and Spanish, annotated according to detailed grammatical categories. In Table 1, the type and the size of these corpora is shown. They are part of corpora selected in the framework of the ESPRIT-I project 291/860: &quot;Linguistic Analysis of the European Languages&quot; (1985-1989) by the project partners (Table 2) and annotated by using semi-automatic taggers. Manual correction was performed by experienced, native analysts for each language separately. In all languages the entries were tagged as they appeared in the text. In the German corpus, for example, where multiple words are concatenated, the words were not separated.</Paragraph> </Section> <Section position="7" start_page="146" end_page="147" type="sub_section"> <SectionTitle> 5.3 Tagsets </SectionTitle> <Paragraph position="0"> Two sets of grammatical tags were isolated from a unified set of grammatical categories defined in the ESPRIT I project 291/860 (ESPRIT-860, Internal report, 1986): A common tagset of 11 main grammatical categories for each language, as described in 2.1.3.</Paragraph> <Paragraph position="1"> An extended set including common categorization of the grammatical information for all languages, as shown in Table 3. In some languages a number of grammatical categories is not applicable. The depth of grammatical analysis and the grammatical structure of each language produce a different number of POS tags. In Table 4 the number of POS tags used for each language and each set of grammatical categories is shown.</Paragraph> </Section> <Section position="8" start_page="147" end_page="149" type="sub_section"> <SectionTitle> 5.4 Corpus Ambiguity </SectionTitle> <Paragraph position="0"> The corpus ambiguity was measured by the mean number of possible tags for each word of the corpus for both sets of grammatical tags (Table 5). The most ambiguous texts are the French, Italian, and English in the tagset of main grammatical classes and the German, Greek, Italian, and French in the extended set of grammatical categories.</Paragraph> <Paragraph position="1"> In Figure 5 the percent occurrence of unknown words in an open testing text of 10,000 words is shown versus the size of the training text.</Paragraph> <Paragraph position="2"> The Italian and Greek corpora have the greatest number of unknown words followed by the Spanish corpus (for the available results with restricted training text). Taking into account the word ambiguity in the training text (Table 5), the occurrence of unknown words in the open testing text (Figure 5), and the hypothesis that the unknown word tagset and the application tagset are the same, the ambiguity of the open testing corpus for both sets of grammatical categories was computed for a For the set of main grammatical classes the ambiguity of the open testing corpus is more or less the same for all languages, varying from a minimum of 7.83 tags per word in the Dutch text to a maximum of 9.32 in the Greek corpus. For the extended set of grammatical categories three types of corpora can be distinguished: a. The most ambiguous is the corpus of the Greek language, because of the great number of grammatical tags (443) and the strong presence of unknown words in the open testing text.</Paragraph> <Paragraph position="3"> b. In the German, Spanish, and Italian texts the same ambiguity is measured.</Paragraph> <Paragraph position="4"> c. The least ambiguous are the Dutch and French texts.</Paragraph> <Paragraph position="5"> Taking into account the previous results, it is important to note that the great differences between languages in text ambiguity, in the presence of unknown words and in the statistics of the grammatical categories, e.g. the different occurrence of prepositions in English and French corpora, prevent a direct comparison of languages from the taggers' error rate. Apart from a few obvious observations given in Section 5.7, such a comparison would require a detailed examination of the corpora and the taggers' errors by experienced linguists. Therefore, the prediction error rates presented in this paper should be regarded only as indication of the probabilistic taggers' efficiency in each separate language when small training texts are available.</Paragraph> </Section> <Section position="9" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 5.5 Experiments </SectionTitle> <Paragraph position="0"> The corpora were divided into 10,000-word entries. All parts except the last one were used to create (initially) and update the model parameters successively. The last part was tagged each time after the model parameters were updated, giving results of the tagger performance on open testing text. The influence of the application tagset on the tagger performance was measured by testing the two totally different tagsets described in Section 5.3.</Paragraph> <Paragraph position="1"> The experimental process was repeated for each language, tagset and tagger.</Paragraph> <Paragraph position="2"> Thus a total number of 2 (tagsets) * 5 (taggers) ~ \[7 (languages) + 1 (Test on English EEC-law text)\] = 80 experiments was carried out.</Paragraph> </Section> <Section position="10" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 5.6 Tagger Speed and Memory Requirements </SectionTitle> <Paragraph position="0"> In Figures 6 and 7 the tagger speed and the memory requirements after the last memory adaptation process are presented for all taggers and languages, and for the extended tagset.</Paragraph> <Paragraph position="1"> The Greek and Italian corpora have a great number of lexical entries (different word forms) for the same amount of 100,000-word training text, as shown in Table 7. As a result these taggers require more memory (Figure 7). In contrast, the small size of the German lexicon decreases the required memory.</Paragraph> <Paragraph position="2"> Tagger speed is closely related to the corpus ambiguity (Table 6). The ambiguity of the Greek corpus is more than three times greater than the next one, the German corpus.</Paragraph> <Paragraph position="3"> The significant influence of the training text size on tagger speed is proven by comparing the experimental results in the English corpus (newspaper and EEC-Law). When the taggers are trained using the 170,000 words of the English newspaper corpus, a greater number of lexicon entries and a greater number of transition probabilities (Figure 7) is measured than in the case of the EEC-law corpus (100K words training text). The model becomes more complex, but tagger speed is slightly higher because of the greater size of the training text, which reduces the presence of unknown words in the testing text. Generally, tagger speed increases when the training text is increased.</Paragraph> </Section> <Section position="11" start_page="149" end_page="152" type="sub_section"> <SectionTitle> 5.7 Tagger Error Rate </SectionTitle> <Paragraph position="0"> The actual tagger error rates for all experiments are given in Appendices A and B. In this section we present a discussion of these error rates.</Paragraph> <Paragraph position="1"> The error rate depends strongly on the test text and language, and the type and size of the tagset. The worst results have been obtained for the Greek language because of its significantly greater ambiguity, the number of tags (requiring significantly greater training text), and its freer syntax.</Paragraph> <Paragraph position="2"> In the main category of tagset experiments, the model parameters for the MLM systems are estimated accurately when the training text exceeds 50,000-90,000 words, Unknown word error rate for the HMM-TS2 tagger and the set of main grammatical categories.</Paragraph> <Paragraph position="3"> in contrast to the extended tagset experiments, where a greater-size training text for the German, Greek, and Spanish languages is required. This phenomenon becomes stronger in taggers based on the HMM where the accuracy of the P(w J t) estimation is proportional to the word and the tag frequency of occurrence in the training text. Thus, for all tagsets and languages a larger training text is required in order to minimize the error rate.</Paragraph> <Paragraph position="4"> The taggers based on the HMM reduce the prediction error almost to half in comparison to the same order taggers based on MLM. Strong dependencies on the language and the estimation accuracy of the model parameters influence this reduction. The alternative HMM solutions give trivial performance differences, confirming recent results obtained in the Treebank corpus by using an HMM tagger (Merialdo 1991). Concerning the performance of the taggers in unknown words, we present in Figure 8 as an example the HMM-TS2 error rate for the tagset of the main grammatical categories, which is also the worst case for this set of grammatical categories. Generally the error rate decreases when the training text is increased. The stochastic model is successful for only half of the unknown words for the Italian text and for approximately two out of three unknown words for the English text. In all other languages the HMM-TS2 tagger gives the correct solution for three out of four unknown words. Similar results are achieved when the extended set of grammatical categories is tested. In this case the unknown word error rate increases about 10-20 percent for all the languages except the Greek language. In the Greek text the error rate reaches approximately 65 percent when 100,000-word text is used to define the parameters of the HMM.</Paragraph> <Paragraph position="5"> The unknown words, which initially cover about 25-35 percent of the text, are reduced to 8-15 percent when all the available text is used as training data. In the majority of the experiments, the tagger error rate decreases when new text updates the Dermatas and Kokkinakis Stochastic Tagging model parameters. Trivial differences of the tagger learning rates between languages and tagsets show the efficiency of the training method in estimating the model transition probabilities for the tested languages and the validity of the stochastic hypothesis for the unknown words.</Paragraph> </Section> </Section> class="xml-element"></Paper>