File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1018_metho.xml
Size: 26,580 bytes
Last Modified: 2025-10-06 14:12:54
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1018"> <Title>A Practical Part-of-Speech Tagger</Title> <Section position="3" start_page="0" end_page="133" type="metho"> <SectionTitle> 2 Methodology 2.1 Background </SectionTitle> <Paragraph position="0"> Several different approaches have been used for building text taggers. Greene and Rubin used a rule-based approach in the TAGGIT program \[Greene and Rubin, 1971\], which was an aid in tagging the Brown corpus \[Francis and Ku~era, 1982\]. TAGGIT disambiguated 77% of the corpus; the rest was done manually over a period of several years. More recently, Koskenniemi also used a rule-based approach implemented with finite-state machines \[Koskenniemi, 1990\].</Paragraph> <Paragraph position="1"> Statistical methods have also been used (e.g., \[DeRose, 1988\], \[Garside et al., 1987\]). These provide the capability of resolving ambiguity on the basis of most likely interpretation. A form of Markov model has been widely used that assumes that a word depends probabilistically on just its part-of-speech category, which in turn depends solely on the categories of the preceding two words.</Paragraph> <Paragraph position="2"> Two types of training (i.e., parameter estimation) have been used with this model. The first makes use of a tagged training corpus. Derouault and Merialdo use a bootstrap method for training \[Derouault and Merialdo, 1986\]. At first, a relatively small amount of text is manually tagged and used to train a partially accurate model. The model is then used to tag more text, and the tags are manually corrected and then used to retrain the model. Church uses the tagged Brown corpus for training \[Church, 1988\].</Paragraph> <Paragraph position="3"> These models involve probabilities for each word in the lexicon, so large tagged corpora are required for reliable estimation.</Paragraph> <Paragraph position="4"> The second method of training does not require a tagged training corpus. In this situation the Baum-Welch algorithm (also known as the forward-backward algorithm) can be used \[Baum, 1972\]. Under this regime the model is called a hidden Markov model (HMM), as state transitions (i.e., part-of-speech categories) are assumed to be unobservable. Jelinek has used this method for training a text tagger \[Jelinek, 1985\]. Parameter smoothing can be conveniently achieved using the method of deleted interpolation in which weighted estimates are taken from secondand first-order models and a uniform probability distribution \[Jelinek and Mercer, 1980\]. Kupiec used word equivalence classes (referred to here as ambiguity classes) based on parts of speech, to pool data from individual words \[Kupiec, 1989b\]. The most common words are still represented individually, as sufficient data exist for robust estimation.</Paragraph> <Paragraph position="5"> However all other words are represented according to the set of possible categories they can assume. In this manner, the vocabulary of 50,000 words in the Brown corpus can be reduced to approximately 400 distinct ambiguity classes \[Kupiec, 1992\]. To further reduce the number of parameters, a first-order model can be employed (this assumes that a word's category depends only on the immediately preceding word's category). In \[Kupiec, 1989a\], networks are used to selectively augment the context in a basic first-order model, rather than using uniformly second-order dependencies. null</Paragraph> <Section position="1" start_page="133" end_page="133" type="sub_section"> <SectionTitle> 2.2 Our approach </SectionTitle> <Paragraph position="0"> We next describe how our choice of techniques satisfies the criteria listed in section 1. The use of an HMM permits complete flexibility in the choice of training corpora. Text from any desired domain can be used, and a tagger can be tailored for use with a particular text database by training on a portion of that database. Lexicons containing alternative tag sets can be easily accommodated without any need for re-labeling the training corpus, affording further flexibility in the use of specialized tags. As the resources required are simply a lexicon and a suitably large sample of ordinary text, taggers can be built with minimal effort, even for other languages, such as French (e.g., \[Kupiec, 1992\]). The use of ambiguity classes and a first-order model reduces the number of parameters to be estimated without significant reduction in accuracy (discussed in section 5). This also enables a tagger to be reliably trained using only moderate amounts of text. We have produced reasonable results training on as few as 3,000 sentences. Fewer parameters also reduce the time required for training. Relatively few ambiguity classes are sufficient for wide coverage, so it is unlikely that adding new words to the lexicon requires retraining, as their ambiguity classes are already accommodated. Vocabulary independence is achieved by predicting categories for words not in the lexicon, using both context and suffix information. Probabilities corresponding to category sequences that never occurred in the training data are assigned small, non-zero values, ensuring that the model will accept any sequence of tokens, while still providing the most likely tagging. By using the fact that words are typically associated with only a few part-of-speech categories, and carefully ordering the computation, the algorithms have linear complexity (section 3.3).</Paragraph> </Section> </Section> <Section position="4" start_page="133" end_page="135" type="metho"> <SectionTitle> 3 Hidden Markov Modeling </SectionTitle> <Paragraph position="0"> The hidden Markov modeling component of our tagger is implemented as an independent module following the specification given in \[Levinson et al., 1983\], with special attention to space and time efficiency issues. Only first-order modeling is addressed and will be presumed for the remainder of this discussion.</Paragraph> <Section position="1" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 3.1 Formalism </SectionTitle> <Paragraph position="0"> In brief, an HMM is a doubly stochastic process that generates sequence of symbols S = { Si, S2,...,ST}, Si E W I<i<T, where W is some finite set of possible symbols, by composing an underlying Markov process with a state-dependent symbol generator (i.e., a Markov process with noise), i Th Markov process captures the notion of sequence depen dency and is described by a set of N states, a matrix c transition probabilities A = {aij} 1 <_ i, j <_ N where ai is the probability of moving from state i to state j, and vector of initial probabilities H = {rq} 1 < i < N where is the probability of starting in state i. The symbol ger erator is a state-dependent measure on V described by matrix of symbol probabilities B = {bjk} 1 _< j <__ N an 1 < k < M where M = IWI and bjk is the probability generating symbol s~ given that the Markov process is i state j.2 In part-of-speech tagging, we will model word order d, pendency through an underlying Markov process that ot crates in terms of lexical tags,'yet we will only be ab to observe the sets of tags, or ambiguity classes, that aJ possible for individual words. The ambiguity class of eac word is the set of its permitted parts of speech, only or of which is correct in context. Given the parameters A, and H, hidden Markov modeling allows us to compute tt most probable sequence of state transitions, and hence tt mostly likely sequence of lexical tags, corresponding to sequence of ambiguity classes. In the following, N can identified with the number of possible.tags, and W wit the set of all ambiguity classes.</Paragraph> <Paragraph position="1"> Applying an HMM consists of two tasks: estimating tt model parameters A, B and H from a training set; ar computing the most likely sequence of underlying sta transitions given new observations. Maximum likeliho( estimates (that is, estimates that maximize the probabili of the training set) can be found through application of ternating expectation in a procedure known as the Baur Welch, or forward-backward, algorithm \[Baum, 1972\].</Paragraph> <Paragraph position="2"> proceeds by recursively defining two sets of probabiliti, the forward probabilities,</Paragraph> <Paragraph position="4"> where \[3T(j) = 1 for all j. The forward probabili at(i) is the joint probability of the sequence up to tir t, {Si, S2,...,St}, and the event that the Markov pr cess is in state i at time t. Similarly, the backwa probability \[3t(j) is the probability of seeing the sequen {St+i, St+2 .... , ST} given that the Markov process is state i at time t. It follows that the probability of t entire sequence is</Paragraph> <Paragraph position="6"> for any t in the range l<t <T- 1.a iFor an introduction to hidden Markov modeling see \[l: biner and Juang, 1986\].</Paragraph> <Paragraph position="8"> Given an initial choice for the parameters A, B, and II the expected number of transitions, 7ij, from state i to state j conditioned on the observation sequence S may be computed as follows:</Paragraph> <Paragraph position="10"> Hence we can estimate aij by:</Paragraph> <Paragraph position="12"> Similarly, bj~ and 7ri can be estimated as follows:</Paragraph> <Paragraph position="14"> In summary, to find maximum likelihood estimates for A, B, and II, via the Baum-Welch algorithm, one chooses some starting values, applies equations 3-5 to compute new values, and then iterates until convergence. It can be shown that this algorithm will converge, although possibly to a non-global maximum \[Baum, 1972\].</Paragraph> <Paragraph position="15"> Once a model has been estimated, selecting the most likely underlying sequence of state transitions corresponding to an observation S can be thought of as a maximization over all sequences that might generate S. An efficient dynamic programming procedure, known as the Viterbi algorithm \[Viterbi, 1967\], arranges for this computation to proceed in time proportional to T. Suppose V = {v(t)} 1 < t < T is a state sequence that generates S, then the probability that V generates S is,</Paragraph> <Paragraph position="17"> for 2 < t < T and i _< j _< N. The crucial observation is-that-for each time t and each state i one need only consider the most probable sequence arriving at state i at time t. The probability of the most probable sequence is maxl<_i<.N\[C/T(i)\] while the sequence itself can be reconstructed by defining v(T) = maxl--<_li<g eT(i) and</Paragraph> <Paragraph position="19"/> </Section> <Section position="2" start_page="134" end_page="135" type="sub_section"> <SectionTitle> 3.2 Numerical Stability </SectionTitle> <Paragraph position="0"> The Baum-Welch algorithm (equations 1-5) and the Viterbi algorithm (equation 6) involve operations on products of numbers constrained to be between 0 and 1. Since these products can easily underflow, measures must be taken to rescale. One approach premultiplies the a and 13 probabilities with an accumulating product depending on t \[Levinson et al., 1983\]. Let 51(i) = al(i) and define ct = 5t i l<t<T.</Paragraph> <Paragraph position="1"> Now define &t(i) = ctK~t(i) and use a in place of a in equation 1 to define & for the next iteration:</Paragraph> <Paragraph position="3"> The scaled backward and forward probabilities, 5 and ~, can be exchanged for the unscaled probabilities in equations 3-5 without affecting the value of the ratios. To see this, note that at(i) = C\[at(i) and ~t(i) = ~t(i)C/+l where J C~=Hct.</Paragraph> <Paragraph position="4"> Now, in terms of the scaled probabilities, equation 5, for example, can be seen to be unchanged:</Paragraph> <Paragraph position="6"> A slight difficulty occurs in equation 3 that can be cured by the addition of a new term, ct+l, in each product of the upper sum:</Paragraph> <Paragraph position="8"> Numerical instability in the Viterbi algorithm can be ameliorated by operating on a logarithmic scale \[Levinson et al., 1983\]. That is, one maximizes the log probability of each sequence of state transitions,</Paragraph> <Paragraph position="10"> Hence, equation 6 is replaced by</Paragraph> <Paragraph position="12"> Care must be taken with zero probabilities. However, this can be elegantly handled through the use of IEEE negative infinity \[P754, 1981\].</Paragraph> </Section> <Section position="3" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 3.3 Reducing Time Complexity </SectionTitle> <Paragraph position="0"> As can be seen from equations 1-5, the time cost of training is O(TN~). Similarly, as given in equation 6, the Viterbi algorithm is also O(TN2). However, in part-of-speech tagging, the problem structure dictates that the matrix of symbol probabilities B is sparsely populated. That is, bij 3PS 0 iff the ambiguity class corresponding to symbol j includes the part-of-speech tag associated with state i. In practice, the degree of overlap between ambiguity classes is relatively low; some tokens are assigned unique tags, and hence have only one non-zero symbol probability.</Paragraph> <Paragraph position="1"> The sparseness of B leads one to consider restructuring equations 1-6 so a check for zero symbol probability can obviate the need for further computation. Equation 1 is already conveniently factored so that the dependence on bj(St+l) is outside the inner sum. Hence, ifk is the average number of non-zero entries in each row of B, the cost of computing equation 1 can be reduced to O(kTN).</Paragraph> <Paragraph position="2"> Equations 2-4 can be similarly reduced by switching the order of iteration. For example, in equation 2, rather than for a given t computing/3t(i) for each i one at a time, one can accumulate terms for all i in parallel. The net effect of this rewriting is to place a bj(St+l) = 0 check outside the innermost iteration. Equations 3 and 4 submit to a similar approach. Equation 5 is already only O(N). Hence, the overall cost of training can be reduced to O(kTN), which, in our experience, amounts to an order of magnitude speedupfl null The time complexity of the Viterbi algorithm can also be reduced to O(kTN) by noting that bj(St) can be factored out of the maximization of equation 6.</Paragraph> </Section> <Section position="4" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 3.4 Controlling Space Complexity </SectionTitle> <Paragraph position="0"> Adding up the sizes of the probability matrices A, B, and H, it is easy to see that the storage cost for directly representing one model is proportional to N(N -t- M + 1).</Paragraph> <Paragraph position="1"> Running the Baum-Welch algorithm requires storage for the sequence of observations, the a and /3 probabilities, the vector {ci}, and copies of the A and B matrices (since the originals cannot be overwritten until the end of each iteration). Hence, the grand total of space required for training is proportional to T q- 2N(T q- N + M + 1).</Paragraph> <Paragraph position="2"> Since N and M are fixed by the model, the only parameter that can be varied to reduce storage costs is T. Now, adequate training requires processing from tens of thousands to hundreds of thousands of tokens \[Kupiec, 1989a\].</Paragraph> <Paragraph position="3"> The training set can be considered one long sequence, it which case T is very large indeed, or it can be broken up into a number of smaller sequences at convenient boundaries. In first-order hidden Markov modeling, the stochastic process effectively restarts at unambiguous tokens, such as sentence and paragraph markers, hence these tokens are convenient points at which to break the training set.</Paragraph> <Paragraph position="4"> If the Baum-Weleh algorithm is run separately (from the same starting point) on each piece, the resulting trained models must be recombined in some way. One obvious approach is simply to average. However, this fails if any two 4An equivalent approach maintains a mapping from states i to non-zero symbol probabilities and simply avoids, in the inner iteration, computing products which must be zero \[Kupiec, 1992\].</Paragraph> <Paragraph position="5"> states are indistinguishable (in the sense that they had the same transition probabilities and the same symbol probabilities at start), because states are then not matched across trained models. It is therefore important that each state have a distinguished role, which is relatively easy to achieve in part-of-speech tagging.</Paragraph> <Paragraph position="6"> Our implementation of the Baum-Welch algorithm breaks up the input into fixed-sized pieces of training text.</Paragraph> <Paragraph position="7"> The Baum-Welch algorithm is then run separately on each piece and the results are averaged together.</Paragraph> <Paragraph position="8"> Running the Viterbi algorithm requires storage for the sequence of observations, a vector of current maxes, a scratch array of the same size, and a matrix of C/ indices, for a total proportional to T + N(2 + T) and a grand total (including the model) of T -t- N(N H- M + T / 3). Again, N and M are fixed. However, T need not be longer than a single sentence, since, as was observed above, the HMM, and hence the Viterbi algorithm, restarts at sentence boundaries. null</Paragraph> </Section> <Section position="5" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 3.5 Model Tuning </SectionTitle> <Paragraph position="0"> An HMM for part-of-speech tagging can be tuned in a variety of ways. First, the choice of tagset and lexicon determines the initial model. Second, empirical and a priori information can influence the choice of starting values for the Baum-Welch algorithm. For example, counting instances of ambiguity classes in running text allows one to assign non-uniform starting probabilities in A for a particular tag's realization as a particular ambiguity class. Alternatively, one can state a priori that a particular ambiguity class is most likely to be the reflection of some subset of its component tags. For example, if an ambiguity class consisting of the open class tags is used for unknown words, one may encode the fact that most unknown words are nouns or proper nouns by biasing the initial probabilities in B.</Paragraph> <Paragraph position="1"> Another biasing of starting values can arises from noting that some tags are unlikely to be followed by others.</Paragraph> <Paragraph position="2"> For example, the lexical item &quot;to&quot; maps to an ambiguity class containing two tags, infinitive-marker and to-aspreposition, neither of which occurs in any other ambiguity class. If nothing more were stated, the HMM would have two states which were indistinguishable. This can be remedied by setting the initial transition probabilities from infinitive-marker to strongly favor transitions to such states as verb-uninflected and adverb.</Paragraph> <Paragraph position="3"> Our implementation allows for two sorts of biasing of starting values: ambiguity classes can be annotated with favored tags; and states can be annotated with favored transitions. These biases may be specified either as sets or as set complements. Biases are implemented by replacing the disfavored probabilities with a small constant (machine epsilon) and redistributing mass to the other possibilities.</Paragraph> <Paragraph position="4"> This has the effect of disfavoring the indicated outcomes without disallowing them; sufficient converse data can rehabilitate these values.</Paragraph> </Section> </Section> <Section position="5" start_page="135" end_page="137" type="metho"> <SectionTitle> 4 Architecture </SectionTitle> <Paragraph position="0"> In support of this and other work, we have developed a system architecture for text access \[Cutting et al., 1991\].</Paragraph> <Paragraph position="1"> This architecture defines five components for such systems: corpus, which provides text in a generic manner; analysis, which extracts terms from the text; index which stores term occurrence statistics; and search, which utilizes these statistics to resolve queries.</Paragraph> <Paragraph position="2"> The part-of-speech tagger described here is implemented as an analysis module. Figure 1 illustrates the overall architecture, showing the tagger analysis implementation in detail. The tagger itself has a modular architecture, isolating behind standard protocols those elements which may vary, enabling easy substitution of alternate implementations. null Also illustrated here are the data types which flow between tagger components. As an analysis implementation, the tagger must generate terms from text. In this context, a term is a word stem annotated with part of speech.</Paragraph> <Paragraph position="3"> Text enters the analysis sub-system where the first processing module it encounters is the tokenizer, whose duty is to convert text (a sequence of characters) into a sequence of tokens. Sentence boundaries are also identified by the tokenizer and are passed as reserved tokens.</Paragraph> <Paragraph position="4"> The tokenizer subsequently passes tokens to the lexicon.</Paragraph> <Paragraph position="5"> Here tokens are converted into a set of stems, each annotated with a part-of-speech tag. The set of tags identifies an ambiguity class. The identification of these classes is also the responsibility of the lexicon. Thus the lexicon delivers a set of stems paired with tags, and an ambiguity class.</Paragraph> <Paragraph position="6"> The training module takes long sequences of ambiguity classes as input. It uses the Baum-Welch algorithm to produce a trained HMM, an input to the tagging module.</Paragraph> <Paragraph position="7"> Training is typically performed on a sample of the corpus at hand, with the trained HMM being saved for subsequent use on the corpus at large.</Paragraph> <Paragraph position="8"> The tagging module buffers sequences of ambiguity classes between sentence boundaries. These sequences are disambiguated by computing the maximal path through the HMM with the Viterbi algorithm. Operating at sentence granularity provides fast throughput without loss of accuracy, as sentence boundaries are unambiguous. The resulting sequence of tags is used to select the appropriate stems. Pairs of stems and tags are subsequently emitted.</Paragraph> <Paragraph position="9"> The tagger may function as a complete analysis component, providing tagged text to search and indexing components, or as a sub-system of a more elaborate analysis, such as phrase recognition.</Paragraph> <Section position="1" start_page="136" end_page="137" type="sub_section"> <SectionTitle> 4.1 Tokenizer Implementation </SectionTitle> <Paragraph position="0"> The problem of tokenization has been well addressed by much work in compilation of programming languages. The accepted approach is to specify token classes with regular expressions. These may be compiled into a single deterministic finite state automaton which partitions character streams into labeled tokens \[Aho et al., 1986, Lesk, 1975\].</Paragraph> <Paragraph position="1"> In the context of tagging, we require at least two token classes: sentence boundary and word. Other classes may include numbers, paragraph boundaries and various sorts of punctuation (e.g., braces of various types, commas). However, for simplicity, we will henceforth assume only words and sentence boundaries are extracted.</Paragraph> <Paragraph position="2"> Just as with programming languages, with text it is not always possible to unambiguously specify the required token classes with regular expressions. However the addition of a simple lookahead mechanism which allows specification of right context ameliorates this \[Aho et al., 1986, Lesk, 1975\]. For example, a sentence boundary in English text might be identified by a period, followed by whitespace, followed by an uppercase letter. However the up- null percase letter must not be consumed, as it is the first component of the next token. A lookahead mechanism allows us to specify in the sentence-boundary regular expression that the final character matched should not be considered a part of the token.</Paragraph> <Paragraph position="3"> This method meets our stated goals for the overall system. It is efficient, requiring that each character be examined only once (modulo lookahead). It is easily parameterizable, providing the expressive power to concisely define accurate and robust token classes.</Paragraph> </Section> <Section position="2" start_page="137" end_page="137" type="sub_section"> <SectionTitle> 4.2 Lexicon Implementation </SectionTitle> <Paragraph position="0"> The lexicon module is responsible for enumerating parts of speech and their associated stems for each word it is given.</Paragraph> <Paragraph position="1"> For the English word &quot;does,&quot; the lexicon might return &quot;do, verb&quot; and &quot;doe, plural-noun.&quot; It is also responsible for identifying ambiguity classes based upon sets of tags.</Paragraph> <Paragraph position="2"> We have employed a three-stage implementation: First, we consult a manually-constructed lexicon to find stems and parts of speech. Exhaustive lexicons of this sort are expensive, if not impossible, to produce. Fortunately, a small set of words accounts for the vast majority of word occurences. Thus high coverage can be obtained without prohibitive effort.</Paragraph> <Paragraph position="3"> Words not found in the manually constructed lexicon are generally both open class and regularly inflected. As a second stage, a language-specific method can be employed to guess ambiguity classes for unknown words. For many languages (e.g., English and French), word suffixes provide strong cues to words' possible categories. Probabalistic predictions of a word's category can be made by analyzing suffixes in untagged text \[Kupiec, 1992, Meteer e* al., 1991\].</Paragraph> <Paragraph position="4"> As a final stage, if a word is not in the manually constructed lexicon, and its suffix is not recognized, a default ambiguity class is used. This class typically contains all the open class categories in the language.</Paragraph> <Paragraph position="5"> Dictionaries and suffix tables are both efficiently implementable as letter trees, or tries \[Knuth, 1973\], which require that each character of a word be examined only once during a lookup.</Paragraph> </Section> </Section> class="xml-element"></Paper>