File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1080_metho.xml

Size: 11,282 bytes

Last Modified: 2025-10-06 14:15:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1080">
  <Title>A Pylonic Decision-Tree Language Model with Optimal Question Selection</Title>
  <Section position="3" start_page="606" end_page="608" type="metho">
    <SectionTitle>
2 Description of the Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="606" end_page="606" type="sub_section">
      <SectionTitle>
2.1 The Decision-Tree Classifier
</SectionTitle>
      <Paragraph position="0"> The purpose of the decision-tree classifier is to cluster the word history wl, w2,..., Wn-1 into a manageable number of classes Ci, and to estimate for each class the next word conditional distribution P{wn \[C i}. The classifier, together with the collection of conditional probabilities, is the resultant LM.</Paragraph>
      <Paragraph position="1"> The general methodology of decision tree construction is well known (e.g., see (Jelinek, 1998)). The following issues need to be addressed for our specific application.</Paragraph>
      <Paragraph position="2"> * A tree growing criterion, often called the measure of purity; * A set of permitted questions (partitions) to be considered at each node; * A stopping rule, which decides the number of distinct classes.</Paragraph>
      <Paragraph position="3"> These are discussed below. Once the tree has been grown, we address one other issue: the estimation of the language model at each leaf of the resulting tree classifier.</Paragraph>
      <Paragraph position="4">  We view the training corpus as a set of ordered pairs of the following word wn and its word history (wi,w2,... ,wn-i). We seek a classification of the space of all histories (not just those seen in the corpus) such that a good conditional probability P{wn I C(wi, w2,.. . , Wn-i)} can be estimated for each class of histories. Since several vocabulary items may potentially follow any history, perfect &amp;quot;classification&amp;quot; or prediction of the word that follows a history is out of the question, and the classifier must partition the space of all word histories maximizing the probability P{wn I C(wi, w2, . . . , Wn-i)} as&amp;quot; signed to the pairs in the corpus.</Paragraph>
      <Paragraph position="5"> We seek a history classification such that C(wi,w2,... ,Wn-i) is as informative as possible about the distribution of the next word. Thus, from an information theoretical point of view, a natural cost function for choosing questions is the empirical conditional entropy of the training data with respect to the tree:</Paragraph>
      <Paragraph position="7"> Each question in the tree is chosen so as to minimize the conditional entropy, or, equivalently, to maximize the mutual information between the class of a history and the predicted word.</Paragraph>
    </Section>
    <Section position="2" start_page="606" end_page="607" type="sub_section">
      <SectionTitle>
Decision Pylons
</SectionTitle>
      <Paragraph position="0"> Although a tree with general questions can represent any classification of the histories, some restrictions must be made in order to make the selection of an optimal question computationally feasible. We consider elementary questions of the type w-k E S, where W-k refers to the k-th position before the word to be predicted,  and S is a subset of the vocabulary. However, this kind of elementary question is rather simplistic, as one node in the tree cannot refer to two different history positions. A conjunction of elementary questions can still be implemented over a few nodes, but similar histories become unnecessarily fragmented. Therefore a node in the tree is not implemented as a single elementary question, but as a modified decision tree in itself, called a pylon (Bahl et al., 1989). The topology of the pylon as in Figure 1 allows us to combine answers from elementary questions without increasing the number of classes. A pylon may be of any size, and it is grown as a standard decision tree.</Paragraph>
      <Paragraph position="1">  For each leaf node and position k the problem is to find the subset S of the vocabulary that minimizes the entropy of the split W-k E S.</Paragraph>
      <Paragraph position="2"> The best question over all k's will eventually be selected. We will use a greedy optimization algorithm developed by Chou (1991). Given a partition P = {81,/32,...,/3k} of the vocabulary, the method finds a subset S of P for which the reduction of entropy after the split is nearly optimal.</Paragraph>
      <Paragraph position="3"> The algorithm is initialized with a random partition S t2 S of P. At each iteration every atom 3 is examined and redistributed into a new partition S'U S', according to the following rule: place j3 into S' when l(wlw-kcf~) &lt; Ew f(wlw-k e 3) log I(w w_heS) --E,o f (wlw_ 3) log f(wlW-kEC3) where the f's are word frequencies computed relative to the given leaf. This selection criterion ensures a decreasing empirical entropy of the tree. The iteration stops when S = S' and If questions on the same level in the pylon are constructed independently with the Chou algoritm, the overall entropy may increase. That is why nodes whose children are merged must be jointly optimized. In order to reduce complexity, questions on the same level in the pylon are asked with respect to the same position in the history.</Paragraph>
      <Paragraph position="4"> The Chou algorithm is not accurate when the training data is sparse. For instance, when no history at the leaf has w-k E /3, the atom is invariantly placed in S'. Because such a choice of a question is not based on evidence, it is not expected to generalize to unseen data. As the tree is growing, data is fragmented among the leaves, and this issue becomes unavoidable. To deal with this problem, we choose the atomic partition P so that each atom gets a history count above a threshold.</Paragraph>
      <Paragraph position="5"> The choice of such an atomic partition is a complex problem, as words composing an atom must have similar predictive power. Our approach is to consider a hierarchical classification of the words, and prune it to a level at which each atom gets sufficient history counts. The word hierarchy is generated from training data with an information theoretical algorithm (Lucassen and Mercer, 1984) detailed in section 2.2.  A common problem of all decision trees is the lack of a clear rule for when to stop growing new nodes. The split of a node always brings a reduction in the estimated entropy, but that might not hold for the true entropy. We use a simplified version of cross-validation (Breiman et al., 1984), to test for the significance of the reduction in entropy. If the entropy on a held out data set is not reduced, or the reduction on the held out text is less than 10% of the entropy reduction on the training text, the leaf is not split, because the reduction in entropy has failed to generalize to the unseen data.</Paragraph>
    </Section>
    <Section position="3" start_page="607" end_page="607" type="sub_section">
      <SectionTitle>
2.1.5 Estimating the Language Model
at Each Leaf
</SectionTitle>
      <Paragraph position="0"> Once an equivalence classification of all histories is constructed, additional training data is used to estimate the conditional probabilities required for each node, as described in (Bahl et al., 1989). Smoothing as well as interpolation with a standard trigram model eliminates the zero probabilities.</Paragraph>
    </Section>
    <Section position="4" start_page="607" end_page="608" type="sub_section">
      <SectionTitle>
2.2 The Hierarchical Classification of
Words
</SectionTitle>
      <Paragraph position="0"> The goal is to build a binary tree with the words of the vocabulary as leaves, such that similar words correspond to closely related leaves. A partition of the vocabulary can be derived from such a hierarchy by taking a cut through the tree to obtain a set of subtrees. The reason for keeping a hierarchy instead of a fixed partition of the vocabulary is to be able to dynamically adjust the partition to accommodate for training data fragmentation.</Paragraph>
      <Paragraph position="1"> The hierarchical classification of words was built with an entirely data-driven method. The motivation is that even though an expert could exhibit some strong classes by looking at parts of speech and synonyms, it is hard to produce a full hierarchy of a large vocabulary. Perhaps a combination of the expert and data-driven approaches would give the best result. Nevertheless, the algorithm that has been used in deriving the hierarchy can be initialized with classes based on parts of speech or meaning, thus taking account of prior expert information.</Paragraph>
      <Paragraph position="2"> The approach is to construct the tree backwards. Starting with single-word classes, each iteration consists of merging the two classes most similar in predicting the word that follows them. The process continues until the entire vocabulary is in one class. The binary tree is then obtained from the sequence of merge operations.</Paragraph>
      <Paragraph position="3"> To quantify the predictive power of a partition P = {j3z,/32,...,/3k} of the vocabulary we look at the conditional entropy of the vocabulary with respect to class of the previous word: H(w I P) = EZeP p(/3)H(w \[ w-1 */3) = - E epp(/3) E evp(wl )logp(w I/3) At each iteration we merge the two classes that minimize H(w I P') - H(w I P), where P' is the partition after the merge. In information-theoretical terms we seek the merge that brings the least reduction in the information provided by P about the distribution of the current word.</Paragraph>
      <Paragraph position="4">  partition of a 5000-word vocabulary (each column is a different class) The algorithm produced satisfactory results on a 5000-word vocabulary. One can see from the sample classes that the automatic building of the hierarchy accounts both for similarity in meaning and of parts of speech.</Paragraph>
      <Paragraph position="5"> the vocabulary is significantly larger, making impossible the estimation of N-gram models for N &gt; 3. However, we expect that due to the good smoothing of the trigram probabilities a combination of the decision-tree and N-gram models will give the best results.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="608" end_page="608" type="metho">
    <SectionTitle>
4 Summary
</SectionTitle>
    <Paragraph position="0"> In this paper we have developed a decision-tree method for building a language model that predicts words given their previous history. We have described a powerful question search algorithm, that guarantees the local optimality of the selection, and which has not been applied before to word language models. We expect that the model will perform significantly better than the standard N-gram approach.</Paragraph>
  </Section>
  <Section position="5" start_page="608" end_page="608" type="metho">
    <SectionTitle>
5 Acknowledgments
</SectionTitle>
    <Paragraph position="0"> I would like to thank Prof.Frederick Jelinek and Sanjeev Khudampur from Center for Language and Speech Processing, Johns Hopkins University, for their help related to this work and for providing the computer resources. I also wish to thank Prof.Graeme Hirst from University of Toronto for his useful advice in all the stages of this project.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML