File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0617_evalu.xml
Size: 3,838 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0617"> <Title>I POS Tags and Decision Trees for Language Modeling</Title> <Section position="8" start_page="134" end_page="135" type="evalu"> <SectionTitle> 5 Results on Wall Street Journal </SectionTitle> <Paragraph position="0"> In order to show that our model scales up to larger training data sizes and larger vocabulary sizes, we ran perplexity experiments on the Wall Street Journal corpus in the Penn Treebank, which is annotated with POS tags. We used one-eighth of the corpus as our test set, and the rest for training.</Paragraph> <Paragraph position="1"> Figure 3 gives the results of varying the amount of training data from approximately 45,000 words up to 1.1 million words. We show both the perplexity of the POS-based model and the word-based backoff model. 2 We see that the POS-based model shows a consistent perplexity reduction over the word-based model. When using all of the available training data, the POS-based model achieves a perplexity rate of 165.9, in comparison to 216.6 for the word-based backoff model, an improvement of 23.4%.</Paragraph> <Paragraph position="2"> For the POS-based model, all word-POS combinations that occurred less than five times in the training data were grouped together for clustering the words and for building the decision tree. Thus, we built the word classification tree using 14,000 word/POS tokens, rather than the full set of 52,100 that occurred in the training data. Furthermore, the decision tree algorithm was not allowed to split a leaf with less than 6 datapoints. This gave us 103,000 leaf nodes (contexts) for the word tree, each with an average of 1277 probabilities, and I 11,000 leaf nodes for the POS tree, each with 47 probabilities, for a total of 136 million parameters. In contrast, the word-based model was composed of 795K trigrams, 376K bigrams, and 43K unigrams and used a total of 2.8 million parameters) In the above, we compared our decision-tree based approach against the backoff approach.</Paragraph> <Paragraph position="3"> Although our approach gives a 23.4% reduction in perplexity, it also gives a 49-fold increase in the size of the language model. We have done some preliminary experiments in reducing the model size. The word and POS trees can be reduced by decreasing the number of leaf nodes. The word decision tree can also be reduced by decreasing the number of probablities in each leaf, which can be done by increasing the number of words put into the lowoccurring group. We built a language model using our decision tree approach that uses only 2.8 million parameters by grouping all words the vocabulary increases from approximately 7500 to 42,700. Hence, fewer words of the test data are being excluded from the perplexity measure.</Paragraph> <Paragraph position="4"> 3The count of 2.8 million parameters includes 795K trigram probabilities, 376K bigram probabilities, 376K bigram backoff weights, 43K unigram probabilities and d3K unigram backoffweights. Since the trigrams and bi-grams are sparse, we include 795K to indicate which tri-grams are included, and 376K to indicate which bigrams are included.</Paragraph> <Paragraph position="5"> that occur 40 times or fewer into the low occurring class, disallowing nodes to be split if they have 50 or fewer datapoints, and pruning back nodes that give the smallest improvement in node impurity. The resulting word tree has 13,700 leaf nodes, each with an average of 80 probabilities, and the POS tree has 12,800 leaf nodes, each with 47 probabilities. This model achieves a perplexity of 191.7, which is still a 11.5% improvement over the word backoff approach. Hence, even for the same model size, the decision tree approach gives a perplexity reduction over the word backoff approach. 4</Paragraph> </Section> class="xml-element"></Paper>