File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1030_metho.xml
Size: 22,442 bytes
Last Modified: 2025-10-06 14:14:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1030"> <Title>Mistake-Driven Mixture of Hierarchical Tag Context Trees</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper proposes a mistake-driven mixture method for learning a tag model. The method iteratively performs two procedures: 1. constructing a tag model based on the current data distribution and 2.</Paragraph> <Paragraph position="1"> updating the distribution by focusing on data that are not well predicted by the constructed model. The final tag model is constructed by mixing all the models according to their performance. To well reflect the data distribution, we represent each tag model as a hierarchical tag (i.e.,NTT 1 < proper noun < noun) context tree. By using the hierarchical tag context tree, the constituents of sequential tag models gradually change from broad coverage tags (e.g.,noun) to specific exceptional words that cannot be captured by generM tags. In other words, the method incorporates not only frequent connections but also infrequent ones that are often considered to be collocationah We evaluate several tag models by implementing Japanese part-of-speech taggers that share all other conditions (i.e.,dictionary and word model) other than their tag models. The experimental results show the proposed method significantly outperforms both hand-crafted and conventional statistical methods.</Paragraph> </Section> <Section position="4" start_page="0" end_page="230" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The last few years have seen the great success of stochastic part-of-speech (POS) taggers (Church, 1988: Kupiec, 1992; Charniak et M., 1993; Brill, 1992; Nagata, 1994). The stochastic approach generally attains 94 to 96% accuracy and replaces the labor-intensive compilation of linguistics rules by using an automated learning algorithm. However, practical systems require more accuracy because POS tagging is an inevitable pre-processing step for all practical systems.</Paragraph> <Paragraph position="1"> To derive a new stochastic tagger, we have two options since stochastic taggers generally comprise two components: word model and tag model. The word model is a set of probabilities that a word occurs with a tag (part-of-speech) when given the preceding words and their tags in a sentence. On the contrary, the tag model is a set of probabilities that a tag appears after the preceding words and their tags.</Paragraph> <Paragraph position="2"> The first option is to construct more sophisticated word models. (Charniak et al., 1993) reports that their model considers the roots and suffixes of words to greatly improve tagging accuracy for English corpora. However, the word model approach has the following shortcomings: * For agglutinative languages such as Japanese and Chinese, the simple Bayes transfer rule is inapplicable because the word length of a sentence is not fixed in all possible segmentations -~. We can only use simpler word models in these languages.</Paragraph> <Paragraph position="3"> * Sophisticated word models largely depend on the target language. It is time-consuming to compile fine-grained word models for each language. null The second option is to devise a new tag model.</Paragraph> <Paragraph position="4"> (Sch~tze and Singer. 1994) have introduced a variable-memory-length tag model. Unlike conventional bi-gram and tri-gram models, the method selects the optimal length by using the context tree (Rissanen, 1983) which was originally introduced for use in data compression (Cover and Thomas, 1991). Although the variable-memory length approach remarkably reduces the number of parameters, tagging accuracy is only as good as conventional methods. Why didn't the method have higher accuracy ? The crucial problem for current P(,,,)P(,,lu,,) P(wi) cannot be consid- 2In P(w,\]t,) = P(t,) ' ered to be identical for ~ll segmentations.</Paragraph> <Paragraph position="5"> tag models is the set of collocational sequences of words that cannot be captured by just their tags. Because the maximal likelihood estimator (MLE) emphasizes the most frequent connections, an exceptional connection is placed in the same class as a frequent connection.</Paragraph> <Paragraph position="6"> To tackle this problem, we introduce a new tag model based on the mistake-driven mixture of hierarchical tag context trees. Compared to Schiitze and Singer's context tree (Schiitze and Singer, 1994), the hierarchical tag context tree is extended in that the context is represented by a hierarchical tag set (i.e.,NTT < proper noun < noun). This is extremely useful in capturing exceptional connections that can be detected only at the word level.</Paragraph> <Paragraph position="7"> To make the best use of the hierarchical context tree, the mistake-driven mixture method imitates the process in which linguists incorporate exceptional connections into hand-crafted rules: They first construct coarse rules which seems to cover broad range of data. They then try to analyze data by using the rules and extract exceptions that the rules cannot handle. Next they generalize the exceptions and refine the previous rules. The following two steps abstract the human algorithm for incorporating exceptional connections.</Paragraph> <Paragraph position="8"> 1. construct temporary rules which seem to well generalize given data.</Paragraph> <Paragraph position="9"> 2. try to analyze data by using the constructed rules and extract the exceptions that cannot be correctly handled, then return to the first step and focus on the exceptions.</Paragraph> <Paragraph position="10"> To put the above idea into our learning algorithm, The mistake-driven mixture method attaches a weight vector to each example and iteratively performs the following two procedures in the training phase: 1. constructing a context tree based on the current data distribution (weight vector) 2. updating the distribution (weight vector) by focusing on data not well predicted by the con- null structed tree. More precisely, the algorithm reduces the weight of examples that are correctly handled.</Paragraph> <Paragraph position="11"> For the prediction phase, it then outputs a final tag model by mixing all the constructed models according to their performance. By using the hierarchical tag context tree, the constituents of a series of tag models gradually change from broad coverage tags (e.g.,noun) to specific exceptional words that cannot be captured by general tags, In other words, the method incorporates not only frequent connections but also infrequent ones that are often considered to be exceptional.</Paragraph> <Paragraph position="12"> The construction of the paper is as follows. Section 2 describes the stochastic POS tagging scheme and hierarchical tag setting. Section 3 presents a new probability estimator that uses a hierarchical tag context tree and Section 4 explains the mistake-driven mixture method. Section 5 reports a preliminary evaluation using Japanese newspaper articles. We tested several tag models by keeping all other conditions (i.e., dictionary and word model) identical. The experimental results show that the proposed method significantly outperforms both hand-crafted and conventional statistical methods. Section 6 concerns related works and Sections 7 concludes the paper.</Paragraph> </Section> <Section position="5" start_page="230" end_page="231" type="metho"> <SectionTitle> 2 Preliminaries </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="230" end_page="230" type="sub_section"> <SectionTitle> 2.1 Basic Equation </SectionTitle> <Paragraph position="0"> In this section, we will briefly review the basic equations for part-of-speech tagging and introduce hierarchical-tag setting.</Paragraph> <Paragraph position="1"> The tagging problem is formally defined as finding a sequence of tags tl,, that maximize the probability of input string L.</Paragraph> <Paragraph position="2"> P(wl,.,tl,~,L) argmaxt. P(Wl,n,tl,nlL) = argmazq,. P(L) C/~ argmaxtl ....... ~ L P( tl,~ , Wl,~ ) We break out P(ta,~, Wl,n) as a sequence of the products of tag probability and word probability.</Paragraph> <Paragraph position="4"> By approximating word probability as constrained only by its tag, we obtain equation (1). Equation (1) yields various types of stochastic taggers. For example, bi-gram and tri-gram models approximate their tag probability as P(tilti-1) and P(tilti_l,ti_.), respectively. In the rest of the paper, we assume all tagging methods share the word model P(wilti) and differ only in the tag model</Paragraph> <Paragraph position="6"/> </Section> <Section position="2" start_page="230" end_page="231" type="sub_section"> <SectionTitle> 2.2 Hierarchical Tag Set </SectionTitle> <Paragraph position="0"> To construct a tag model that captures exceptional connections, we have to consider word-level context as well as tag-level. In a more general form, we introduce a tag set that has a hierarchical structure. Our tag set has a three-level structure as shown in Figure 1. Tile topmost and the second level of the hierarchy are part-of-speech level and part-of-speech subdivision level respectively. Although stochastic taggers usually make use of subdivision level, part-of-speech level is remarkably robust against data sparseness. The bottom level is word level and is indispensable in coping with exceptional and collocational sequences of words. Our objective is to construct a tag model that precisely evaluates P(tiltl,i-1, Wl,i) (in equation (1)) by using the three-level tag set.</Paragraph> <Paragraph position="1"> To construct this model, we have to answer the following questions.</Paragraph> <Paragraph position="2"> 1. Which level is appropriate for t i .9 2. Which length is to be considered for tl,i-1 and</Paragraph> <Paragraph position="4"> :3. Which level is appropriate for tl,i-1 and wl,i ? To resolve the first question, we fix ti at subdivision level as is done in other tag models. The second and third questions are resolved by introducing hierarchical tag context trees and mistake-driven mixture method that are respectively described in Section 3 and 4.</Paragraph> <Paragraph position="5"> Before moving to the next section, let us define the basic tag set. If all words are considered context candidates, the search space will be enormous. Thus, it is reasonable for the tagger to constrain the candidates to frequent open class words and closed class words. Tile basic tag set is a set of tile most detailed context elements that comprises the words selected above and part-of-speech subdivision level.</Paragraph> </Section> </Section> <Section position="6" start_page="231" end_page="233" type="metho"> <SectionTitle> 3 Hierarchical Tag Context Tree </SectionTitle> <Paragraph position="0"> A hierarchical tag context tree is constructed by a two-step methodology. The first step produces a context tree by using tile basic tag set. The second step then produces the hierarchical tag context tree. It generalizes the basic tag context tree and avoids over-fitting the data by replacing excessively specific context in the tree wi4h more general tags.</Paragraph> <Paragraph position="1"> Finally, the generated tree is transformed into a finite automaton to improve tagging efficiency (Ron et al., 1997).</Paragraph> <Section position="1" start_page="231" end_page="232" type="sub_section"> <SectionTitle> 3.1 Constructing a Basic Tag Context Tree </SectionTitle> <Paragraph position="0"> In this section, we construct a basic tag context tree.</Paragraph> <Paragraph position="1"> Before going into detail of the algorithm, we briefly explain the context tree by using a simple binary case. The context tree was originally introduced in the field of data compression (Rissanen, 1983; Willems et al., 1995; Cover and Thomas, 1991) to represent how many times and in what context each symbol appeared in a sequence of symbols. Figure 2 exemplifies two context trees comprising binary symbols 'a' and 'b'. T(4) is constructed from the sequence 'baab'and T(6) from 'baabab '. The root node of T(4) explains that both 'a'and 'b ' appeared twice in 'baab' when no consideration is taken of previous symbols. The nodes of depth 1 represent an order 1 (bi-gram) model. The left node of T(4) represents that both 'a' and &quot;b' appeared only once after symbol 'a', while the right node of T(4) represents only 'a' occurred once after 'b '. In the same way, the node of depth 2 in T(6) represents an order 2 (tri-gram) context model.</Paragraph> <Paragraph position="2"> It is straightforward to extend this binary tree to a basic tag context tree. In this case, context symbols 'a' and 'b&quot; are replaced by an element of the basic tag set and the frequency table of each node then consists of the part-of-speech subdivision set.</Paragraph> <Paragraph position="3"> The procedure construct-btree which constructs a basic tag context tree is given below. Let a set of subdivision tags to be Sl,--.,sn. Let weight\[t\] be a weight vector attached to the tth example x(t).</Paragraph> <Paragraph position="4"> Initial values of weight\[t\] are set to 1.</Paragraph> <Paragraph position="5"> 1. the only node, the root, is marked with the count table (c(sl,)0,&quot;-, C(Sn,)~) = (0,'--.0)).</Paragraph> <Paragraph position="6"> 2. Apply the following recursively. Let T(t-1) be the last constructed tree with counts of nodes z, (c(sl,z),-.., c(sn,z)). After the next symbol whose subdivision is x(t) is observed, generate the next tree T(t) as follows: follow the T(t-1), starting at the root and taking the branch indicated by each successive symbol in the past sequence by using basic tag level. For each node z visited, increment the component count c(x(t),:) by weight\[t\]. Continue until node w is a leaf node.</Paragraph> <Paragraph position="7"> 3. If w is a leaf, extend the tree by creat null ing new leaves: c(x(t),wsl)=...=c(x(t),wsn) = weight\[t\], c(x(t),wsl) ..... c(x(t),wsn)=O.</Paragraph> <Paragraph position="8"> Define the resulting tree to be T(t).</Paragraph> </Section> <Section position="2" start_page="232" end_page="233" type="sub_section"> <SectionTitle> 3.2 Constructing a Hierarchical Tag Context Tree </SectionTitle> <Paragraph position="0"> This section delineates how a hierarchical tag context tree is constructed from a basic tag context tree.</Paragraph> <Paragraph position="1"> Before describing the algorithm, we prepare some definitions and notations.</Paragraph> <Paragraph position="2"> Let .4 be a part-of-speech subdivision set. As described in the previous section, frequency tables of each node consist of the set A. At ally node s of a context tree, let n(ats ) and /5(als ) be tile count of element a and its probability, respectively.</Paragraph> <Paragraph position="4"> We introduce an information-theoretical criteria A(sb) (Weinberger et al., 1995) to evaluate the gain of expanding a node s by its daughter sb.</Paragraph> <Paragraph position="6"> A(sb) is the difference in optimal code lengths when symbols at node sb are compressed by using probability distribution P(.Is) at node s and P('lsb) at node sb. Thus, the larger A(sb) is, the more meaningful it is to expand a node by sb.</Paragraph> <Paragraph position="7"> Now, we go back to the hierarchical tag context tree construction. As illustrated in Figure 3, the generation process amounts to the iterative selection of b out of word level, subdivision, part-of-speech and null (no expansion). Let us look at the procedure from the information-theoretical viewpoint. Breaking out equation (2) as (3), 2x(sb) is represented as the product of the frequencies of all subdivision symbols at node sb and Kullback-Leibler (KL) divergence.</Paragraph> <Paragraph position="9"> Because the KL divergence defines a distance measure between probability distributions, P(.\]sb) and P(.Is), there is the following trade-off between the two terms of equation (3).</Paragraph> <Paragraph position="10"> * The more general b is, the more subdivision symbols appear at node sb.</Paragraph> <Paragraph position="11"> * The more specific b is, the more /~(-\[s) and P(.Isb) differ.</Paragraph> <Paragraph position="12"> By using the trade-off, the optimal level of b is se*lected. null Table 1 summarizes the algorithm construct-htree that constructs the hierarchical tag context tree. First, construct-htree generates a basic tag context tree by calling construct-btree. Assume that the training examples consist of a sequence of triples, < pt,st,wt >, in which Pt, st and wt represent part-of-speech, subdivision and word, respectively. Eachtime the algorithm reads an example, it first reaches current leaf node s by following the past sequence, computes A(sb), and then selects the optimal b. The initially constructed basic tag context tree is used to compute A(sb)s.</Paragraph> </Section> </Section> <Section position="7" start_page="233" end_page="233" type="metho"> <SectionTitle> 4 Mistake-Driven Mixture of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="233" end_page="233" type="sub_section"> <SectionTitle> Hierarchical Tag Context Trees </SectionTitle> <Paragraph position="0"> Up to this section, we introduced a new tag model that uses a single hierarchical tag context tree to cope with the exceptional connections that cannot be captured by just part-of-speech level. However, this approach has a clear limitation; the exceptional connections that do not occur so often cannot be detected by the single tree model. In such a ease, the first term n(sb) in equation (3) is enormous for general b and the tree is expanded by using more general symbols.</Paragraph> <Paragraph position="1"> To overcome this limitation, we devised the mistake-driven mixture algorithm summarized in Table 4 which constructs T context trees and outputs the final tag model.</Paragraph> <Paragraph position="2"> mistake-driven mixture sets the weights to 1 for all examples and repeats the following procedures T times. The algorithm first construct a hierarchical context tree by using the current weight vector. Example data are then tagged by the tree and the weights of correctly handled examples are reduced by equation (4). Finally, the final tag model is constructed by mixing T trees according to equation (5).</Paragraph> <Paragraph position="3"> By using the mistake-driven mixture method, the constituents of a series of hierarchical tag context trees gradually change from broad coverage tags (e.g.,noun) to specific exceptional words that cannot be captured by part-of-speech and subdivisions. The method, by mixing different levels of trees, incorporates not only frequent connections but also infrequent ones that are often considered to be collocational without over-fitting the data.</Paragraph> </Section> </Section> <Section position="8" start_page="233" end_page="234" type="metho"> <SectionTitle> 5 Preliminary Evaluation </SectionTitle> <Paragraph position="0"> We performed an preliminary evaluation using the first 8939 Japanese sentences in a year's volume of newspaper articles(Mainichi, 1993). We first automatically segmented and tagged these sentences and then revised them by hand. The total number of words in the hand-revised corpus was 226162. We trained our tag models on the corpora with every tenth sentence removed (starting with the first sentence) and then tested the removed sentences. There were 22937 words in the test corpus.</Paragraph> <Paragraph position="1"> As the first milestone of performance, we tested a hand-crafted tag model of JUMAN (Kurohashi et al., 1994), the most widely used Japanese part-of-speech tagger. The tagging accuracy of JUMAN for the test corpus was only 92.0 %. This shows that our corpus is difficult to tag because the corpus contains various genres of texts; from obituaries to poetry.</Paragraph> <Paragraph position="2"> Next. we compared the mixture of bi-grams and the mixture of hierarchical tag context trees. In this experiment, only post-positional particles and auxiliaries were word-level elements of basic tags and all other elements were subdivision level. In contrast, bi-gram was constructedby using subdivision level.</Paragraph> <Paragraph position="3"> We set the iteration number T to 5. The results of our experiments are summarized in Figure 4.</Paragraph> <Paragraph position="4"> As a single tree estimator (Number of Mixture = 1), the hierarchical tag context tree attained 94.1% accuracy, while bi-gram yielded 93.1%. A hierarchical tag context tree offers a slight improvement, but Initialize weight\[j\] = 1 for all examples j Input: sequence of N examples < Pl, dl, wl >, . *., < pN, dN, WN > in which Pi, di and wi represent part-of-speech, subdivision and word, respectively. Initialize the weight vector weight\[i\] =1 for i = 1 ..... N Do for t = 1,2 ..... T Call construct-htree providing it with the weight vector weight D and Construct a part-of-speech tagger ht Let Error be a set of examples that are not identified by ht</Paragraph> <Paragraph position="6"> Output a final tag model</Paragraph> <Paragraph position="8"> not a gret deal* This conclusion agrees with Schiitze and Singer's experiments that used a context tree of usual part-of-speech.</Paragraph> <Paragraph position="9"> When we turn to the mixture estimator, a great difference is seen between hierarchical tag context trees and bi-grams. The hierarchical tag context trees produced by the mistake-driven mixture method, greatly improved the accuracy and over-fitting data was not serious. The best and worst performances were 96.1% (Number of Mixture = 3) and 94.1% (Number of Mixture = 1), respectively.</Paragraph> <Paragraph position="10"> On the other hand, the performance of the bi-gram mixture was not satisfactory. Tile best and worst performances were 93.8 % (Number of Mixture = 2) and 90.8 % (Number of Mixture = 5), respectively.</Paragraph> <Paragraph position="11"> From the result, we may say exceptional connections are well captured by hierarchical context trees but not by bi-grams. Bi-grams of subdivision are too general to selectively detect exceptions.</Paragraph> </Section> class="xml-element"></Paper>