File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-2003_abstr.xml

Size: 20,099 bytes

Last Modified: 2025-10-06 13:48:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-2003">
  <Title>Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies</Title>
  <Section position="2" start_page="0" end_page="222" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Many applications that process natural language can be enhanced by incorporating information about the probabilities of word strings; that is, by using statistical language model information (Church et al. 1991; Church and Mercer 1993; Gale, Church, and Yarowsky 1992; Liddy and Paik 1992). For example, speech recognition systems often require some model of the prior likelihood of a given utterance (Jelinek 1976). For convenience, the quality of these components can be measured by test set perplexity, PP (Bahl, Jelinek, and Mercer 1983; Bahl et al. 1989; Jelinek, Mercer, and Roukos 1990), in spite of some limitations (Ueberla 1994): PP = P(wlN) - ~, where there are N words in the word stream (w~/and ib is some estimate of the probability of that word stream.</Paragraph>
    <Paragraph position="1"> Perplexity is related to entropy, so our goal is to find models that estimate a low perplexity for some unseen representative sample of the language being modeled.</Paragraph>
    <Paragraph position="2"> Also, since entropy provides a lower bound on the average code length, the project of statistical language modeling makes some connections with text compression--good compression algorithms correspond to good models of the source that generated the text in the first place. With an arbitrarily chosen standard test set, statistical language models can be compared (Brown, Della Pietra, Mercer, Della Pietra, and Lai 1992). This allows researchers to make incremental improvements to the models (Kuhn and Mori 1990). It is in this context that we investigate automatic word classification; also, some cognitive scientists are interested in those features of automatic word classification that have implications for language acquisition (Elman 1990; Redington, Chater, and Finch 1994).</Paragraph>
    <Paragraph position="3"> One common model of language calculates the probability of the ith word wi * Department of Computer Science, The Queen's University of Belfast, Belfast BT7 1NN, Northern Ireland. E-maih {J.McMahon,FJ.Smith}@qub.ac.uk @ 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 2 in a test set by considering the n - 1 most recent words (Wi_n+l,Wi_n .... ,Wi_l&gt;, or i-1 (wi_,+l&gt; in a more compact notation. The model is finitary (according to the Chomsky hierarchy) and linguistically naive, but it has the advantage of being easy to construct and its structure allows the application of Markov model theory (Rabiner and Juang 1986).</Paragraph>
    <Paragraph position="4"> Much work has been carried out on word-based n-gram models, although there are recognized weaknesses in the paradigm. One such problem concerns the way that n-grams partition the space of possible word contexts. In estimating the probability of the ith word in a word stream, the model considers all previous word contexts to be identical if and only if they share the same final n - 1 words. This simultaneously fails to differentiate some linguistically important contexts and unnecessarily fractures others. For example, if we restrict our consideration to the two previous words in a stream--that is, to the trigram conditional probability estimate P(wilwi-~)--then the sentences:  (1) a. The boys eat the sandwiches quickly.</Paragraph>
    <Paragraph position="5"> and (2) a. The cheese in the sandwiches is delicious.</Paragraph>
    <Paragraph position="6"> contain points where the context is inaccurately considered identical. We can illustrate the danger of conflating the two sentence contexts by considering the nonsentences: (1) b. *The boys eat the sandwiches is delicious.</Paragraph>
    <Paragraph position="7"> and (2) b. &amp;quot;The cheese in the sandwiches quickly.</Paragraph>
    <Paragraph position="8">  There are some techniques to alleviate this problem--for example O'Boyle's n-gram (n &gt; 3) weighted average language model (O'Boyle, Owens, and Smith 1994). A second weakness of word-based language models is their unnecessary fragmentation of contexts--the familiar sparse data problem. This is a main motivation for the multilevel class-based language models we shall introduce later. Successful approaches aimed at trying to overcome the sparse data limitation include backoff (Katz 1987), Turing-Good variants (Good 1953; Church and Gale 1991), interpolation (Jelinek 1985), deleted estimation (Jelinek 1985; Church and Gale 1991), similarity-based models (Dagan, Pereira, and Lee 1994; Essen and Steinbiss 1992), Pos-language models (Derouault and Merialdo 1986) and decision tree models (Bahl et al. 1989; Black, Garside, and Leech 1993; Magerman 1994). We present an approach to the sparse data problem that shares some features of the similarity-based approach, but uses a binary tree representation for words and combines models using interpolation.</Paragraph>
    <Paragraph position="9"> Consider the word &lt;boys&gt; in (la) above. We would like to structure our entire vocabulary around this word as a series of similarity layers. A linguistically significant layer around the word &lt;boys&gt; is one containing all plural nouns; deeper layers contain more semantic similarities.</Paragraph>
    <Paragraph position="10"> If sentences (la) and (2a) are converted to the word-class streams &lt;determiner noun verb determiner noun adverb&gt; and &lt;determiner noun preposition determiner noun verb adjective&gt; respectively, then bigram, trigram, and possibly even  McMahon and Smith Improving Statistical Language Models higher n-gram statistics may become available with greater reliability for use as context differentiators (although Sampson \[1987\] suggests that no amount of word-class n-grams may be sufficient to characterize natural language fully). Of course, this still fails to differentiate many contexts beyond the scope of n-grams; while n-gram models of language may never fully model long-distance linguistic phenomena, we argue that it is still useful to extend their scope.</Paragraph>
    <Paragraph position="11"> In order to make these improvements, we need access to word-class information (Pos information \[Johansson et al. 1986; Black, Garside, and Leech 1993\] or semantic information \[Beckwith et al. 1991\]), which is usually obtained in three main ways: Firstly, we can use corpora that have been manually tagged by linguistically informed experts (Derouault and Merialdo 1986). Secondly, we can construct automatic part-of-speech taggers and process untagged corpora (Kupiec 1992; Black, Garside, and Leech 1993); this method boasts a high degree of accuracy, although often the construction of the automatic tagger involves a bootstrapping process based on a core corpus which has been manually tagged (Church 1988). The third option is to derive a fully automatic word-classification system from untagged corpora. Some advantages of this last approach include its applicability to any natural language for which some corpus exists, independent of the degree of development of its grammar, and its parsimonious commitment to the machinery of modern linguistics. One disadvantage is that the classes derived usually allow no linguistically sensible summarizing label to be attached (Schfitze \[1995\] is an exception). Much research has been carried out recently in this area (Hughes and Atwell 1994; Finch and Chater 1994; Redington, Chater, and Finch 1993; Brill et al. 1990; Kiss 1973; Pereira and Tishby 1992; Resnik 1993; Ney, Essen, and Kneser 1994; Matsukawa 1993). The next section contains a presentation of  a top-down automatic word-classification algorithm.</Paragraph>
    <Paragraph position="12"> 2. Word Classification and Structural Tags  Most statistical language models making use of class information do so with a single layer of word classes--often at the level of common linguistic classes: nouns, verbs, etc. (Derouault and Merialdo 1986). In contrast, we present the structural tag representation, where the symbol representing the word simultaneously represents the classification of that word (McMahon and Smith \[1994\] make connections between this and other representations; Black et al. \[1993\] contains the same idea applied to the field of probabilistic parsing; also structural tags can be considered a subclass of the more general tree-based statistical language model of Bahl et al. \[1989\]). In our model, each word is represented by an s-bit number the most significant bits of which correspond to various levels of classification; so given some word represented as structural tag w, we can gain immediate access to all s levels of classification of that word. Generally, the broader the classification granularity we chose, the more confident we can be about the distribution of classes at that level, but the less information this distribution offers us about next-word prediction. This should be useful for dealing with the range of frequencies of n-grams in a statistical language model. Some n-grams occur very frequently, so word-based probability estimates can be used. However, as n-grams become less frequent, we would prefer to sacrifice predictive specificity for reliability. Ordinary Pos-language models offer a two-level version of this ideal; it would be preferable if we could defocus our predictive machinery to some stages between all-word n-grams and Pos n-grams when, for example, an n-gram distribution is not quite representative enough to rely on all-word n-grams but contains predictively significant divisions that would be lost at the relatively coarse POS level. Also, for rare n-grams, even Pos distributions succumb to the sparse data problem (Sampson  Computational Linguistics Volume 22, Number 2 1987); if very broad classification information was available to the language-modeling system, coarse-grained predictions could be factored in, which might improve the overall performance of the system in just those circumstances.</Paragraph>
    <Paragraph position="13"> In many word-classification systems, the hierarchy is not explicitly represented and further processing, often by standard statistical clustering techniques, is required; see, for example, Elman (1990), Schtitze (1993), Brill et al. (1990), Finch and Chater (1994), Hughes and Atwell (1994), and Pereira and Tishby (1992). With the structural tag representation, each tag contains explicitly represented classification information; the position of that word in class-space can be obtained without reference to the positions of other words. Many levels of classification granularity can be made available simultaneously, and the weight which each of these levels can be given in, for example, a statistical language model, can alter dynamically. Using the structural tag representation, the computational overheads for using class information can be kept to a minimum. Furthermore, it is possible to organize an n-gram frequency database so that close structural tags are stored near to each other; this could be exploited to reduce the search space explored in speech recognition systems. For example, if the system is searching for the frequency of a particular noun in an attempt to find the most likely next word, then alternative words should already be nearby in the n-gram database. Finally, we note that in the current implementation of the structural tag representation we allow only one tag per orthographic word-form; although many of the current word-classification systems do the same, we would prefer a structural tag implementation that models the multimodal nature of some words more successfully.</Paragraph>
    <Paragraph position="14"> For example, (light) can occur as a verb and as a noun, whereas our classification system currently forces it to reside in a single location.</Paragraph>
    <Paragraph position="15"> Consider sentences (la) and (2a) again; we would like to construct a clustering algorithm that assigns some unique s-bit number to each word in our vocabulary so that the words are distributed according to some approximation of the layering described above that is, (boys) should be close to (people) and (is) should be close to (eat). We would also like semantically related words to cluster, so that, although (boys) may be near (sandwiches) because both are nouns, (girls) should be even closer to (boys) because both are human types. In theory, structural tag representations can be dynamically updated--for example, (bank) might be close to (river) in some contexts and closer to (money) in others. Although we could ,construct a useful set of structural tags manually (McMahon 1994), we prefer to design an algorithm that builds such a classification.</Paragraph>
    <Paragraph position="16"> For a given vocabulary V, the mapping t initially translates words into their corresponding unique structural tags. This mapping is constructed by making random word-to-tag assignments.</Paragraph>
    <Paragraph position="17"> The mutual information (Cover and Thomas 1991) between any two events x and</Paragraph>
    <Paragraph position="19"> If the two events x and y stand for the occurrence of certain word-class unigrams in a sample, say ci and cj, then we can estimate the mutual information between the two classes. In these experiments, we use maximum likelihood probability estimates based on a training corpus. In order to estimate the average class mutual information for a classification depth of s bits, we compute the average class mutual information:</Paragraph>
    <Paragraph position="21"> McMahon and Smith Improving Statistical Language Models where ci and cj are word classes and Ms(t) is the average class mutual information for structural tag classification t at bit depth s. This criterion is the one used by Brown, Della Pietra, DeSouza, Lai, and Mercer (1992); Kneser and Ney (1993) show how it is equivalent to maximizing the bi-Pos-language model probability. We are interested in that classification which maximizes the average class mutual information; we call this t o and it is found by computing:</Paragraph>
    <Paragraph position="23"> Currently, no method exists that can find the globally optimal classification, but sub-optimal strategies exist that lead to useful classifications. The suboptimal strategy used in the current automatic word-classification system involves selecting the locally optimal structure between t and t', which differ only in their classification of a single word. An initial structure is built by using the computer's pseudorandom number generator to produce a random word hierarchy. Its M(t) value is calculated. Next, another structure, t r is created as a copy of the main one, with a single word moved to a different place in the classification space. Its M(t t) value is calculated. This second calculation is repeated for each word in the vocabulary and we keep a record of the transformation which leads to the highest M(t'). After an iteration through the vocabulary, we select that t' having the highest M(t ~) value and continue until no single move leads to a better classification. With this method, words which at one time are moved to a new region in the classification hierarchy can move back at a later time, if licensed by the mutual information metric. In practice, this does happen.</Paragraph>
    <Paragraph position="24"> Therefore, each transformation performed by the algorithm is not irreversible within a level, which should allow the algorithm to explore a larger space of possible word classifications.</Paragraph>
    <Paragraph position="25"> The algorithm is embedded in a system that calculates the best classifications for all levels beginning with the highest classification level. Since the structural tag representation is binary, this first level seeks to find the best distribution of words into two classes. Other versions of the top-down approach are used by Pereira and Tishby (1992) and Kneser and Ney (1993) to classify words; top-down procedures are also used in other areas (Kirkpatrick, Gelatt, and Vecchi 1983). The system of Pereira and Tishby (1992; Pereira, Tishby, and Lee 1993) has the added advantage that class membership is probabilistic rather than fixed.</Paragraph>
    <Paragraph position="26"> When the locally optimal two-class hierarchy has been discovered by maximizing Ml(t), whatever later reclassifications occur at finer levels of granularity, words will always remain in the level 1 class to which they now belong. For example, if many nouns now belong to class 0 and many verbs to class 1, later subclassifications will not influence the M1 (t) value. This reasoning also applies to all classes s = 2, 3... 16 (see Figure 1).</Paragraph>
    <Paragraph position="27"> We note that, in contrast with a bottom-up approach, a top-down system makes its first decisions about class structure at the root of the hierarchy; this constrains the kinds of classification that may be made at lower levels, but the first clustering decisions made are based on healthy class frequencies; only later do we start noticing the effects of the sparse data problem. We therefore expect the topmost classifications to be less constrained, and hopefully more accurate. With a bottom-up approach, the reverse may be the case. The tree representation also imposes its own constraints, mentioned later.</Paragraph>
    <Paragraph position="28"> This algorithm, which is O(V 3) for vocabulary size V, works well with the most  Therefore, a word in class M may only move to class N to maximize the mutual information--any other move would violate a previous level's classification.</Paragraph>
    <Paragraph position="29"> frequent words from a corpus1; however, we have developed a second algorithm, to be used after the first, to allow vocabulary coverage in the range of tens of thousands of word types. This second algorithm exploits Zipf's law (1949)--the most frequent words account for the majority of word tokens--by adding in low-frequency words only after the first algorithm has finished processing high-frequency ones. We make the assumption that any influence that these infrequent words have on the first set of frequent words can be discounted. The algorithm is an order of magnitude less computationally intensive and so can process many more words in a given time. By this method, we can also avoid modeling only a simplified subset of the phenomena in which we are interested and hence avoid the danger of designing systems that do not scale-up adequately (Elman 1990). Once the positions of high-frequency words has been fixed by the first algorithm, they are not changed again; we can add any new word, in order of frequency, to the growing classification structure by making 16 binary decisions: Should its first bit be a 0 or a 1? And its second? Of our 33,360 word vocabulary, we note that the most frequent 569 words are clustered using the main 1 In a worst case analysis, the mutual information metric will be O(V 2) and we need to evaluate the tree on V occasions--~ach time with one word reclassified; lower order terms (for example, the number of iterations at each level) can be ignored. In practice, the mutual information calculation is much less than O(V 2) since there are far fewer than V 2 bigrams observed in our training text.</Paragraph>
    <Paragraph position="30">  McMahon and Smith Improving Statistical Language Models algorithm; the next 15,000 are clustered by our auxiliary algorithm and the remaining 17,791 words are added to the tree randomly. We add these words randomly due to hardware limitations, though we notice that the 15,000th most frequent word in our vocabulary occurs twice only--a very difficult task for any classification system. The main algorithm takes several weeks to cluster the most frequent 569 words on a Sparc-IPC and several days for the supplementary algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML