File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0615_metho.xml
Size: 11,528 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0615"> <Title>I HMM Specialization with Selective Lexicalization*</Title> <Section position="3" start_page="121" end_page="121" type="metho"> <SectionTitle> 2 &quot;Standard&quot; Part-of-Speech </SectionTitle> <Paragraph position="0"> Tagging Model based on HMM From the statistical point of view, the tagging problem can be defined as the problem of finding the proper sequence of categories c:,r~ = Cl, c2, ..., cn (n _> 1) given the sequence of words w:,n = wl, w2, ...,wn (We denote the i'th word by wi, and the category assigned to the wi by ci), which is formally defined by the following equation:</Paragraph> <Paragraph position="2"> Charniak (Charniak et al., 1993) describes the &quot;standard&quot; HMM-based tagging model as Equation 2, which is the simplified version of Equation 1.</Paragraph> <Paragraph position="4"> With this model, we select the proper category for each word by making use of the contextual probabilities, P(citci_ 1), and the lexical probabilities, P(wilci). This model has the advantages of a provided theoretical framework, automatic learning facility and relatively high performance. It is thereby at the basis of most tagging programs created over the last few years.</Paragraph> <Paragraph position="5"> For this model, the first-order Markov assumtions are made as follows:</Paragraph> <Paragraph position="7"> With Equation 3, we assume that the current category is independent of the previous words and only dependent on the previous category.</Paragraph> <Paragraph position="8"> With Equation 4, we also assume that the correct word is independent of everything except the knowledge of its category. Through these assmnptions, the Hidden Markov Models have the advantage of drastically reducing the number of parameters, thereby alleviating the sparse data problem. However, as mentioned above, this model consults only a single category as context and does not utilize enough constraints provided by the local context.</Paragraph> </Section> <Section position="4" start_page="121" end_page="122" type="metho"> <SectionTitle> 3 Some Refining Techniques for HMM </SectionTitle> <Paragraph position="0"> Tile first-order Hidden Markov Models described in the previous section provides only a single category as context. Sometimes, this first-order context is sufficient to predict the following parts-of-speech, but at other times (probably much more often) it is insufficient.</Paragraph> <Paragraph position="1"> The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model. We start with the careful observation on the assumptions which are made for the &quot;standard&quot; Hidden Markov Models. With the Equation 3, we assume that the current category is only dependent on the preceding category. As we know, it is not always true and this first-order Markov assumption restricts the disambiguation information witlfin the first-order context.</Paragraph> <Paragraph position="2"> The immediate ways of enriching the context are as follows: * to lexicalize the context.</Paragraph> <Paragraph position="3"> * to extend the context to higher-order.</Paragraph> <Paragraph position="4"> To lexiealize the context, we include the preceding word into the context. Contextual probabilities are then defined by P(eilci_l,Wi-1). Figure 1 illustrates the change of dependency when each method is applied respectively. Figure l(a) represents that each first-order contextual probability and lexical probability are independent of each other in the &quot;standard&quot; Hidden Markov Models, where Figure l(b) represents that the lexical probability of the preceding word and the contextual probability of the current category are tied into a lexicalized contextual probability.</Paragraph> <Paragraph position="5"> To extend the context to higher-order, we extend the contextual probability to the second- null Assumption order. Contextual probabilities are then defined by P(cilci_l,Ci_2). Figure l(c) represents that the two adjacent contextual probabilities are tied into the sec0nd-order contextual probability. null The simple way of enriching the context is to extend or lexica!ize it uniformly. The uniform extension of context to the second order is feasible with an appropriate smoothing technique and is considered a state-of-the-art technique, though its complexity is very high: In the case of the Brown cerpus, we need trigrams up to the number of 0.6 million. An alternative to the uniform extension of context is the selective extension of context. Brants(Brants, 1996) takes this approach and reports a performance equivalent to the uniform extension with relatively much low complexity of the model.</Paragraph> <Paragraph position="6"> The uniform lexicalization of context is computationally prohibitively expensive: In the case of the Brown corpus, we need lexicalized bigrams up to the number of almost 3 billion.</Paragraph> <Paragraph position="7"> Moreover, manylof these bigrams neither contribute to the per~formance of the model, nor occur frequently enough to be estimated properly. An alternative to the uniform lexicalization is the selective lexicalization of context, which is the main topic of this paper.</Paragraph> </Section> <Section position="5" start_page="122" end_page="124" type="metho"> <SectionTitle> 4 Selective Lexicalization of HMM </SectionTitle> <Paragraph position="0"> This section describes a new technique for refining the Hidden Markov Model, which we call selective lexicalization. Our approach automatically finds out s'yntactically uncommon words and makes a new state (we call it a lexiealized state)for each of the words.</Paragraph> <Paragraph position="1"> Given a fixed set of categories, {C 1 , C 2, ..., cC}, e.g., {adjective,..., verb}, we assume the discrete random variable XcJ with domain the set of categories and range a set of conditional probabilities. The random variable XcJ then represents a process of assigning a conditional probability p(cilc j) to every category c i (e i ranges</Paragraph> <Paragraph position="3"> We convert the process of Xcj into the state transition vector, VcJ , which consists of the corresponding conditional probabilities, e.g., Vprep ---- ( P(adjectiveiprep), ..., P(verbiprep) ) T. The (squared) distance between two arbitrary vectors is then computed as follows:</Paragraph> <Paragraph position="5"> Similarly, we define the lexicalized state</Paragraph> <Paragraph position="7"> In this situation, it is possible to regard each lexicMized state transition vector, VcJ,wk, of the same category cJ as members of a cluster whose centroid is the state transition vector, Vc). We can then compute the deviation of each lexicalized state transition vector, Vc~,wk , from its corresponding centroid.</Paragraph> <Paragraph position="9"> ized state transition vectors according to their deviations. As you can see in the figure, the majority of the vectors are near their centroids and only a small number of vectors are very far from their centroids. In the first-order context model (without considering lexicalized context).</Paragraph> <Paragraph position="10"> the centroids represent all the members belonging to it. In fact, the deviation of a vector is a kind of (squared) error for the vector. The error for a cluster is</Paragraph> <Paragraph position="12"> and the error for the overall model is simply the sum of the individual cluster errors:</Paragraph> <Paragraph position="14"> Now, we could break out a few lexicalized state vectors which have large deviation (D > 0) and make them individual clusters to reduce the error of the given model.</Paragraph> <Paragraph position="15"> As an example, let's consider the preposition cluster. The value of each component of the centroid, Vprep, is illustrated in Figure 3(a) and that of the lexicalized vectors, Vprep,in, Vprep,wit h and Vprep,out are in Figure 3(b), (c) and (d) respectively. As you can see in these figures, most of the prepositions including in and with are immediately followed by article(AT), noun(NN) or pronoun(NP), but the word out as preposition shows a completely different distribution. Therefore, it would be a good choice to break out the lexicalized vector, Vprep,out , froln its centroid, Vprep.</Paragraph> <Paragraph position="16"> From the viewpoint of a network, the state representing preposition is split into two states; the one is the state representing ordinary prepositions except out, and the other is the state representing the special preposition out, which we call a lexicalized state. This process of splitting is illustrated in Figure 4. Splitting a state results in some changes of the parameters. The changes of the parameters resulting from lexicalizing a word, w k, in a category, C j, are indicated in Table 1 (c i ranges over cl...cC). This full splitting will increase the complexity of the model rapidly, so that estimating the parameters may suffer from the sparseness of the data.</Paragraph> <Paragraph position="17"> To alleviate it, we use the pseudo splitting which leads to relatively small increment of the</Paragraph> <Paragraph position="19"> parameters. The changes of the parameters in pseudo splitting ate indicated in Table 2.</Paragraph> </Section> <Section position="6" start_page="124" end_page="125" type="metho"> <SectionTitle> 5 Experimental Result </SectionTitle> <Paragraph position="0"> We have tested our technique through part-of-speech tagging eXperiments with the Hidden Markov Models which are variously lexicalized.</Paragraph> <Paragraph position="1"> In ordcr to conduct the tagging experiments, we divided the whole Brown (tagged) corpus containing 53,887 sentences (1,113,191 words) into two parts. For tlle training set. 90% of the sentences were chosen at random, from which we collected all of the statistical data. We reserved the other 10% for testing. Table 3 lists the basic statistics of our corpus.</Paragraph> <Paragraph position="2"> We used a tag set containing 85 categories. The amount of ambiguity of the test set is summarized in Table 4. The second column shows that words to the ratio of 52% (the number of 57,808) are not ambiguous. The tagger attempts to resolve the ambiguity of the remain-</Paragraph> <Paragraph position="4"> ratio(%) 5121:018 71 5 1 I total 100 I Figure 5 and Figure 6 show the results of our part-of-speech tagging experiments with the &quot;standard&quot; Hidden Markov Model and variously lexicalized Hidden Markov Models using full splitting method and pseudo splitting method respectively.</Paragraph> <Paragraph position="5"> We got 95.7858% of the tags correct when we applied the standard Hidden Markov Model without any lexicalized states. As the number of lexicalized states increases, the tagging accuracy increases until the number of lexicalized states becomes 160 (using full splitting) and 210 (using pseudo splitting). As you can see in these figures, the full splitting improves the performance of the model more rapidly but suffer more sevelery from the sparseness of the training data. In this experiment, we employed Mackay and Peto's smoothing techniques for estimating the parameters required for the models. The best precision has been found to be 95:9966% through the model with the 210 lexcalized states using the pseudo splitting method.</Paragraph> </Section> class="xml-element"></Paper>