File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0617_metho.xml
Size: 19,101 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0617"> <Title>I POS Tags and Decision Trees for Language Modeling</Title> <Section position="4" start_page="130" end_page="130" type="metho"> <SectionTitle> 2 Redefining the Problem </SectionTitle> <Paragraph position="0"> To add POS tags rinto the language model, we refrain from simply summing over all POS sequences as prior approaches have done. Instead, we redefine the speech recognition problem so that it finds the best word and POS sequence. Let P be a POS sequence for the word sequence W. The goal of the speech recognizer is to now solve the following.</Paragraph> <Paragraph position="2"> The first term Pr(AIWP ) is the acoustic model, which traditionally excludes the category assignment. In fact, the acoustic model can probably be reasonably approximated by Pr(AIW ). The second term Pr(WP) is the POS-based language model and accounts for both the sequence of words and their POS assignment. We rewrite the sequence WP explicitly in terms of the N words and their corresponding POS tags, thus giving the sequence W1,NP1,N. The probability Pr(Wi,NP1,N) forms the basis for POS taggers, with the exception that POS taggers work from a sequence of given words.</Paragraph> <Paragraph position="3"> As in Equation 2, we rewrite Pr(W1,NP1,N) using the definition of conditional probability.</Paragraph> <Paragraph position="5"> Equation 7 involves two probability distributions that need to be estimated. Previous attempts at using POS tags in a language model as well as POS taggers (i.e. (Charniak et al., 1993)) simplify these probability distributions, as given in Equations 8 and 9.</Paragraph> <Paragraph position="7"> However, to successfully incorporate POS information, we need to account for the full richness of the probability distributions. Hence, as we will show in Table 1, we cannot use these two assumptions when learning the probability distributions.</Paragraph> </Section> <Section position="5" start_page="130" end_page="133" type="metho"> <SectionTitle> 3 Estimating the Probabilities </SectionTitle> <Paragraph position="0"> To estimate the probability distributions, we follow the approach of Bahl et al. (1989) and use a decision tree learning algorithm (Breiman et al., 1984) to partition the context into equivalence classes.</Paragraph> <Section position="1" start_page="130" end_page="131" type="sub_section"> <SectionTitle> 3.1 POS Probabilities </SectionTitle> <Paragraph position="0"> For estimating the POS probability distribution, the algorithm starts with a single node with all of the training data. It then finds a question to ask about the POS tags and word identities of the preceding words (Pl, i-lWl, i-1) in order to partition the node into two leaves, each being more informative as to which POS tag occurred than the parent node. Information theoretic metrics, such as minimizing entropy, are used to decide which question to propose. The proposed question is then verified using heldout data: if the split does not lead to a decrease in entropy according to the heldout data, the split is rejected and the node is not further explored (Bahl et al., 1989). This process continues with the new leaves and results in a hierarchical partitioning of the context.</Paragraph> <Paragraph position="1"> After growing a tree, the next step is to use the partitioning of the context induced by the decision tree to determine the probability estimates. Using the relative frequencies in each node will be biased towards the training data that was used in choosing the questions. Hence, Bahl et al. smooth these probabilities with the probabilities of the parent node using interpolated estimation with a second heldout dataset. Using the decision tree algorithm to estimate probabilities is attractive since the algorithm can choose which parts of the context are relevant, and in what order. Hence, this approach lends itself more readily to allowing extra contextual information to be included, such as both the word identifies and POS tags, and even hierarchical clusterings of them. If the extra information is not relevant, it will not be used.</Paragraph> </Section> <Section position="2" start_page="131" end_page="131" type="sub_section"> <SectionTitle> 3.2 Word Probabilities </SectionTitle> <Paragraph position="0"> * The procedure for estimating the word probability is almost identical to the above. However, rather than start with all of the training data in a single node, we first partition the data by the POS tag of the word being estimated. Hence, we start with the probability Pr(Wi \[Pi) as estimated by relative frequency. This is the same value with which non-decision tree approaches start (and end). We then use the decision tree algorithm to further refine the equivalence contexts by allowing it to ask questions about the preceding words and POS tags.</Paragraph> <Paragraph position="1"> Starting the decision tree algorithm with a separate root node for each POS tag has the following advantages. Words only take on a small set of POS tags. For instance, a word that is a superlative adjective cannot be a relative adjective. For the Wall Street Journal, each token on average takes on 1.22 of the 46 POS tags.</Paragraph> <Paragraph position="2"> If we start with all training data in a single root node, the smoothing (no matter how small) will end up putting some probability for each word occurring as every POS tag, leading to less exact probability estimates. Second, if we sta_t with a root node for each POS tag, the number of words that need to be distinguished at each node in the tree is much less than the full vocabulary size. For the Wall Street Journal corpus, there are approximately 42,700 different words in the training data, but the most common POS tag, proper nouns (NNP), only has 12,000 different words. Other POS tags have much fewer, such as the personal pronouns with only 36 words. Making use of this smaller vocabulary size results in a faster algorithm and less memory space.</Paragraph> <Paragraph position="3"> A significant number of words in the training corpus have a small number of occurrences. Such words will prove problematic for the decision tree algorithm to predict. For each POS tag, we group the low occurring words into a single token for the decision tree to predict.</Paragraph> <Paragraph position="4"> This not only leads to better probability estimates, but also reduces the number of parameters in the decision tree. For the Wall Street Joumal corpus, excluding words that occur less than five times reduces the vocabulary size to 14,000 and the number of proper nouns to 3126.</Paragraph> </Section> <Section position="3" start_page="131" end_page="132" type="sub_section"> <SectionTitle> 3.3 Questions about POS Tags </SectionTitle> <Paragraph position="0"> The context that we use for estimating the probabilities includes both word identities and POS tags. To make effective use of this information, we need to allow the decision tree algorithm to generalize between words and POS tags that behave similarly. To learn which words behave similarly, Black et al.(1989) and Magerman (1994) used the clustering algorithm of Brown et al. (1992) to build a hierarchical classification tree. Figure 1 gives the classification tree that we built for the POS tags from the Trains corpus. The algorithm starts with each token in a separate class and iteratively finds two classes to merge that results in the smallest lost of information about POS adjacency. Rather than stopping at a certain number of classes, one continues until only a single class remains. However, the order in which classes were merged gives a hierarchical binary tree with the root corresponding to the entire tagset, each leaf to a single POS tag, and intermediate nodes to groupings of tags that occur in statistically similar contexts. The path from the root to a tag gives the binary encoding for the tag. The decision tree algorithm can ask which partition a word belongs to by asking questions about the binary encoding. Of course it doesn't make sense to ask questions about the bits before the higher level bits are asked about.</Paragraph> <Paragraph position="1"> But we do allow it to ask complex bit encoding questions so that it can find more optimal ques-</Paragraph> </Section> <Section position="4" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 3.4 Questions ~ibout Word Identities </SectionTitle> <Paragraph position="0"> For handling word identities, one could follow the approach used for handling the POS tags (e.g. (Black et al., 1992; Magerman, 1994)) and view the POS tags and word identities as two separate sources of information. Instead, we view the word identities as a further refinement of the POS tags. We start the clustering algorithm with a separate class for each word and each POS tag that it takes on and only allow it to merge c!asses if the POS tags are the same. This results in a word classification tree for each POS tag. Using POS tags in word clustering means that words that take on different POS tags can ibe better modeled (Heeman, 1997). For instance, the word &quot;load&quot; can be used as a verb (V B) or as a noun (NN), and this usage affects with which words it is similar. Furthermore, restricting merges to those of ~you < low> 2 them 157 me 85</Paragraph> <Paragraph position="2"> the same POS tag allows us to make use of the hand-annotated linguistic knowledge for clustering words, which allows more effective trees to be built. It also significantly speeds up the clustering algorithm. For the Wall Street Journal, only 13% of all merges are between words of the same POS tag, and hence do not need to be considered.</Paragraph> <Paragraph position="3"> To deal with low occurring words in the training data, we follow the same approach as we do in in building the classification tree. We group all words that occur less than some freshhold into a single token for each POS tag before clustering. This not only significantly reduces the input size to the clustering algorithm, but also relieves the clustering algorithm from trying to statistically cluster words for which there is not enough training data. Since low occurring words are grouped by POS tag, we have better handling of this data than if all low occuring words were grouped into a single token.</Paragraph> <Paragraph position="4"> Figure 2 shows the classification tree for the personal pronouns (PRP) from the Trains corpus. For reference, we list the number of occurrences of each word. Notice that the algorithm distinguished between the subjective pronouns &quot;I&quot;, &quot;we&quot;, and &quot;they&quot;, and the objective pronouns &quot;me&quot;, &quot;us&quot; and &quot;them&quot;. The pronouns &quot;you&quot; and &quot;it&quot; take both cases and were probably clustered according to their most common usage in the corpus. Although we could have added extra POS tags to distinguish between these two types of pronouns, it seems that the clustering algorithm can make up for some of the shortcomings of the POS tagset.</Paragraph> <Paragraph position="5"> Since words are viewed as a further refinement of POS information, we restrict the decision tree algorithm from asking about the word identity until the POS tag of the word is uniquely identified. We also restrict the deci- null sion tree from asking more specific bit questions until the less specific bits are unquely determined. null</Paragraph> </Section> </Section> <Section position="6" start_page="133" end_page="134" type="metho"> <SectionTitle> 4 Results on Trains Corpus </SectionTitle> <Paragraph position="0"> We ran our first set of experiments on the Trains corpus, a corpus of human-human task oriented dialogs (Heeman and Allen, 1995).</Paragraph> <Section position="1" start_page="133" end_page="133" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> To make the best use of the limited size of the Trains corpus, we used a six-fold cross-validation procedure: each sixth of the data was tested using the rest of the data for training.</Paragraph> <Paragraph position="1"> This was done for both acoustic and language models. Dialogs for each pair of speakers were distributed as evenly between the six partitions in order to minimize the new speaker problem.</Paragraph> <Paragraph position="2"> For our perplexity results, we ran the experiments on the hand-collected transcripts.</Paragraph> <Paragraph position="3"> Changes in speaker are marked in the word transcription with the token <turn>. Contractions, such as &quot;that'll&quot; and &quot;gonna&quot;, are treated as separate words: &quot;that&quot; and '&quot;11&quot; for the first example, and &quot;going&quot; and &quot;ta&quot; for the second. All word fragments were changed to the token <fragment>. In searching for the best sequence of POS tags for the transcribed words, we follow the technique proposed by Chow and Schwartz (1989) and only keep a small number of alternative paths by pruning the low probability paths after processing each word.</Paragraph> <Paragraph position="4"> For our speech recognition results, we used OGI's large vocabulary speech recognizer (Yan et al., 1998; Wu et al., 1999), using acoustic models trained from the Trains corpus. We ran the decoder in a single pass using cross-word acoustic modeling and a trigram word-based backoff model (Katz, 1987) built with the CMU toolkit (Rosenfeld, 1995). For the first pass, contracted words were treated as single tokens in order to improve acoustic recognition of them. The result of the first pass was a word graph, which we rescored in a second pass using our other trigram language models.</Paragraph> </Section> <Section position="2" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 4.2 Comparison with Word-Based Model </SectionTitle> <Paragraph position="0"> Column two of Table 1 gives the results of the word-based backoff model and column three gives the results of our POS-based model. Both models were restricted to only looking at the previous two words (and POS tags) in the context, and hence are trigram models. Our POS-based model gives a perplexity reduction of 8.9% and an absolute word error rate reduction of 1.1%, which was found significant by the Wilcoxon test on the 34 different speakers in the Trains corpus (Z-score of -4.64). The POS-based model also achieves an absolute sentence error rate reduction of 1.3%, which was found significant by the McNemar test.</Paragraph> <Paragraph position="1"> One reason for the good performance of our POS-based model is that we use all of the information in the context in estimating the word and POS probabilities. To show this effect, we contrast the results of our model, which uses the full context, with the results given in column four of a model that uses the simpler context afforded by the approximations given in Equation 8 and 9, which ignore word co-occurence information. This simpler model uses the same decision tree techniques to estimate the probability distributions, but the decision tree can only ask questions of the simpler context, rather than the full context. In terms of POS tagging results, we see that using the full context leads to a POS error rate reduction of 8.4%. 1 But more importantly, using the full context gives a 46.7% reduction in perplexity, and a 4.0% absolute reduction in the word error rate. In fact, the simpler model does not even perform as well as the word-based model.</Paragraph> <Paragraph position="2"> Hence, to use POS tags in speech recognition, one must use a richer context for estimating the probabilities than what has been tradition-</Paragraph> </Section> </Section> <Section position="7" start_page="134" end_page="134" type="metho"> <SectionTitle> J </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="134" end_page="134" type="sub_section"> <SectionTitle> 4.3 Other Decision Tree Models </SectionTitle> <Paragraph position="0"> The differences between our POS-based model and the backoff word-based model are partially due to the extra power of the decision tree approach in estimating the probabilities. To factor out this difference, we compare our POS-based model to word and class-based models built using our decision~ tree approach for estimating the probabilities: For the word-based model, we treated all words as having the same POS tag and hence built a trivial POS classification tree and a single word hierarchical classification tree, and then estimated the probabilities using our decision tree algorithm.</Paragraph> <Paragraph position="1"> We also built a class-based model to test out if a model with automatically learned unambiguous classes could perform as well as our POS-based model. The classes were obtained from our word clustering algorithm, but stopping once a certain number of classes has been reached. Unfortunately, the clustering algorithm of Brown et al. does not have a mechanism to decide an optimal number of word classes (cf. (Kne:ser and Ney, 1993)). Hence, to give an optimal evaluation of the class-based approach, we chose the number of classes that gave the best word error rate, which was 30 classes. We then ,used this class-assignment instead of the POS tags, and used our existing algorithms to build our decision tree models.</Paragraph> <Paragraph position="2"> The results of: the three decision tree models are given in Table 2, along with the results from the backoff'word-based model. First, our word-based decision tree model outperforms the word backoff model, giving an absolute word-error rate reduction of 0.5%, which was found significant by the Wilcoxon test (Z-score -3.26). Hence, some of the improvement of our POS-based model is because we use decision trees with word c!ustering to estimate the probabilities. Second, there is little improvement from using unambiguous word classes. This is because we are already using a word hierarchical classification tree, which allows the decision tree algorithm to make generalizations between words, in the same way that classes do (which explains for why so few classes gives the optimal word error rate). Third, using POS tags does lead to an improvement over the class-based model, with an absolute reduction in word error rate of 0.5%, an improvement found significant by the Wilcoxon test (Zscore -2.73). Hence, using shallow syntactic information, in the form of POS tags, does improve speech recognition since it allows syntactic knowledge to be used in predicting the subsequent words. This syntactic knowledge is also used to advantage in building the classification trees, since we can use the hand-coded knowledge present in the POS tags in our classification and we can better classify words that can be used in different ways.</Paragraph> </Section> </Section> class="xml-element"></Paper>