File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1031_intro.xml
Size: 10,965 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1031"> <Title>TnT -- A Statistical Part-of-Speech Tagger</Title> <Section position="3" start_page="0" end_page="226" type="intro"> <SectionTitle> 2 Architecture 2.1 The Underlying Model </SectionTitle> <Paragraph position="0"> TnT uses second order Markov models for part-of-speech tagging. The states of the model represent tags, outputs represent the words. Transition probabilities depend on the states, thus pairs of tags.</Paragraph> <Paragraph position="1"> Output probabilities only depend on the most recent category. To be explicit, we calculate argmax P(tilti-1, ti-2)P(wilti P(tr+l ItT)</Paragraph> <Paragraph position="3"> for a given sequence of words w I ... W T of length T.</Paragraph> <Paragraph position="4"> tl... tT are elements of the tagset, the additional tags t-l, to, and tT+l are beginning-of-sequence and end-of-sequence markers. Using these additional tags, even if they stem from rudimentary processing of punctuation marks, slightly improves tagging results. This is different from formulas presented in other publications, which just stop with a &quot;loose end&quot; at the last word. If sentence boundaries are not marked in the input, TnT adds these tags if it encounters one of \[.!?;\] as a token.</Paragraph> <Paragraph position="5"> Transition and output probabilities are estimated from a tagged corpus. As a first step, we use the maximum likelihood probabilities /5 which are derived from the relative frequencies:</Paragraph> <Paragraph position="7"> for all tl, t2, t3 in the tagset and w3 in the lexicon. N is the total number of tokens in the training corpus. We define a maximum likelihood probability to be zero if the corresponding nominators and denominators are zero. As a second step, contextual frequencies are smoothed and lexical frequences are completed by handling words that are not in the lexicon (see below).</Paragraph> <Section position="1" start_page="224" end_page="224" type="sub_section"> <SectionTitle> 2.2 Smoothing </SectionTitle> <Paragraph position="0"> Trigram probabilities generated from a corpus usually cannot directly be used because of the sparse-data problem. This means that there are not enough instances for each trigram to reliably estimate the probability. Furthermore, setting a probability to zero because the corresponding trigram never occured in the corpus has an undesired effect. It causes the probability of a complete sequence to be set to zero if its use is necessary for a new text sequence, thus makes it impossible to rank different sequences containing a zero probability.</Paragraph> <Paragraph position="1"> The smoothing paradigm that delivers the best results in TnT is linear interpolation of unigrams, bigrams, and trigrams. Therefore, we estimate a trigram probability as follows:</Paragraph> <Paragraph position="3"> /5 are maximum likelihood estimates of the probabilities, and A1 + A2 + A3 = 1, so P again represent probability distributions.</Paragraph> <Paragraph position="4"> We use the context-independent variant of linear interpolation, i.e., the values of the As do not depend on the particular trigram. Contrary to intuition, this yields better results than the context-dependent variant. Due to sparse-data problems, one cannot estimate a different set of As for each trigram. Therefore, it is common practice to group trigrams by frequency and estimate tied sets of As. However, we are not aware of any publication that has investigated frequency groupings for linear interpolation in part-of-speech tagging. All groupings that we have tested yielded at most equivalent results to context-independent linear interpolation. Some groupings even yielded worse results. The tested groupings included a) one set of As for each frequency value and b) two classes (low and high frequency) on the two ends of the scale, as well as several groupings in between and several settings for partitioning the classes.</Paragraph> <Paragraph position="5"> The values of Ax, A2, and A3 are estimated by deleted interpolation. This technique successively removes each trigram from the training corpus and estimates best values for the As from all other n-grams in the corpus. Given the frequency counts for uni-, bi-, and trigrams, the weights can be very efficiently determined with a processing time linear in the number of different trigrams. The algorithm is given in figure 1. Note that subtracting 1 means taking unseen data into account. Without this subtraction the model would overfit the training data and would generally yield worse results.</Paragraph> </Section> <Section position="2" start_page="224" end_page="225" type="sub_section"> <SectionTitle> 2.3 Handling of Unknown Words </SectionTitle> <Paragraph position="0"> Currently, the method of handling unknown words that seems to work best for inflected languages is a suffix analysis as proposed in (Samuelsson, 1993).</Paragraph> <Paragraph position="1"> Tag probabilities are set according to the word's ending. The suffix is a strong predictor for word classes, e.g., words in the Wall Street Journal part of the Penn Treebank ending in able are adjectives (JJ) in 98% of the cases (e.g. fashionable, variable), the rest of 2% are nouns (e.g. cable, variable).</Paragraph> <Paragraph position="2"> The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix of some predefined maximum length. The term suffix as used here means &quot;final sequence of characters of a word&quot; which is not necessarily a linguistically meaningful suffix.</Paragraph> <Paragraph position="3"> Probabilities are smoothed by successive abstraction. This calculates the probability of a tag t given the last m letters li of an n letter word: P(tlln-r,+l,...ln). The sequence of increasingly more general contexts omits more and more characters of the suffix, such that P(tlln_m+2,...,ln),</Paragraph> <Paragraph position="5"> ing. The recursiou formula is</Paragraph> <Paragraph position="7"> set )%1 ---- )%2 = )%3 = 0 foreach trigram tl,t2,t3 with f(tl,t2,t3) > 0 depending on the maximum of the following three values: f(tl ,t2,ts)-- 1 case f(h,t2)-I &quot; increment )%3 by f(tl,t2,t3)</Paragraph> <Paragraph position="9"> the n-gram frequencies are known. N is the size of the corpus* If the denominator in one of the expressions is 0, we define the result of that expression to be 0.</Paragraph> <Paragraph position="10"> for i = m... 0, using the maximum likelihood estimates/5 from frequencies in the lexicon, weights Oi and the initialization</Paragraph> <Paragraph position="12"> The maximum likelihood estimate for a suffix of length i is derived from corpus frequencies by</Paragraph> <Paragraph position="14"> For the Markov model, we need the inverse conditional probabilities P(/,-i+l,... lnlt) which are obtained by Bayesian inversion.</Paragraph> <Paragraph position="15"> A theoretical motivated argumentation uses the standard deviation of the maximum likelihood probabilities for the weights 0i (Samuelsson, 1993)* This leaves room for interpretation.</Paragraph> <Paragraph position="16"> 1) One has to identify a good value for m, the longest suffix used* The approach taken for TnT is the following: m depends on the word in question.</Paragraph> <Paragraph position="17"> We use the longest suffix that we can find in the training set (i.e., for which the frequency is greater than or equal to 1), but at most 10 characters. This is an empirically determined choice* 2) We use a context-independent approach for 0i, as we did for the contextual weights )%i. It turned out to be a good choice to set all 0i to the standard deviation of the unconditioned maximum likelihood probabilities of the tags in the training corpus, i.e.,</Paragraph> <Paragraph position="19"> This usually yields values in the range 0.03 ... 0.10.</Paragraph> <Paragraph position="20"> 3) We use different estimates for uppercase and lowercase words, i.e., we maintain two different suffix tries depending on the capitalization of the word.</Paragraph> <Paragraph position="21"> This information improves the tagging results* 4) Another freedom concerns the choice of the words in the lexicon that should be used for suffix handling. Should we use all words, or are some of them better suited than others? Accepting that unknown words are most probably infrequent, one can argue that using suffixes of infrequent words in the lexicon is a better approximation for unknown words than using suffixes of frequent words. Therefore, we restrict the procedure of suffix handling to words with a frequency smaller than or equal to some threshold value. Empirically, 10 turned out to be a good choice for this threshold.</Paragraph> </Section> <Section position="3" start_page="225" end_page="226" type="sub_section"> <SectionTitle> 2.4 Capitalization </SectionTitle> <Paragraph position="0"> Additional information that turned out to be useful for the disambiguation process for several corpora and tagsets is capitalization information. Tags are usually not informative about capitalization, but probability distributions of tags around capitalized words are different from those not capitalized. The effect is larger for English, which only capitalizes proper names, and smaller for German, which capitalizes all nouns.</Paragraph> <Paragraph position="1"> We use flags ci that are true if wi is a capitalized word and false otherwise* These flags are added to the contextual probability distributions. Instead of</Paragraph> <Paragraph position="3"> and equations (3) to (5) are updated accordingly.</Paragraph> <Paragraph position="4"> This is equivalent to doubling the size of the tagset and using different tags depending on capitalization.</Paragraph> </Section> <Section position="4" start_page="226" end_page="226" type="sub_section"> <SectionTitle> 2.5 Beam Search </SectionTitle> <Paragraph position="0"> The processing time of the Viterbi algorithm (Rabiner, 1989) can be reduced by introducing a beam search. Each state that receives a 5 value smaller than the largest 5 divided by some threshold value is excluded from further processing. While the Viterbi algorithm is guaranteed to find the sequence of states with the highest probability, this is no longer true when beam search is added. Nevertheless, for practical purposes and the right choice of 0, there is virtually no difference between the algorithm with and without a beam. Empirically, a value of 0 = 1000 turned out to approximately double the speed of the tagger without affecting the accuracy.</Paragraph> <Paragraph position="1"> The tagger currently tags between 30~000 and 60,000 tokens per second (including file I/O) on a Pentium 500 running Linux. The speed mainly depends on the percentage of unknown words and on the average ambiguity rate.</Paragraph> </Section> </Section> class="xml-element"></Paper>