File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1022_metho.xml

Size: 17,709 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1022">
  <Title>Automatic Learning of Language Model Structure</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Factored Language Models
</SectionTitle>
    <Paragraph position="0"> A standard statistical language model computes the probability of a word sequence W = w1; w2; :::; wT as a product of conditional probabilities of each word wi given its history, which is typically approximated by just one or two preceding words (leading to bigrams, and trigrams, respectively). Thus, a trigram language model is described by</Paragraph>
    <Paragraph position="2"> Even with this limitation, the estimation of the required probabilities is challenging: many word contexts may be observed infrequently or not at all, leading to unreliable probability estimates under maximum likelihood estimation. Several techniques have been developed to address this problem, in particular smoothing techniques (Chen and Goodman, 1998) and class-based language models (Brown and others, 1992). In spite of such parameter reduction techniques, language modeling remains a di cult task, in particular for morphologically rich languages, e.g. Turkish, Russian, or Arabic.</Paragraph>
    <Paragraph position="3"> Such languages have a large number of word types in relation to the number of word tokens in a given text, as has been demonstrated in a number of previous studies (Geutner, 1995; Kiecza et al., 1999; Hakkani-T ur et al., 2002; Kirchho et al., 2003). This in turn results in a high perplexity and in a large number of out-of-vocabulary (OOV) words when applying a trained language model to a new unseen text.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Factored Word Representations
</SectionTitle>
      <Paragraph position="0"> A recently developed approach that addresses this problem is that of Factored Language Models (FLMs) (Kirchho et al., 2002; Bilmes and Kirchho , 2003), whose basic idea is to decompose words into sets of features (or factors) instead of viewing them as unanalyzable wholes.</Paragraph>
      <Paragraph position="1"> Probabilistic language models can then be constructed over (sub)sets of word features instead of, or in addition to, the word variables themselves. For instance, words can be decomposed into stems/lexemes and POS tags indicating their morphological features, as shown below: Word: Stock prices are rising Stem: Stock price be rise Tag: Nsg N3pl V3pl Vpart Such a representation serves to express lexical and syntactic generalizations, which would otherwise remain obscured. It is comparable to class-based representations employed in standard class-based language models; however, in FLMs several simultaneous class assignments are allowed instead of a single one. In general, we assume that a word is equivalent to a xed number (K) of factors, i.e. W f1:K. The task then is to produce a statistical model over the resulting representation - using a trigram approximation, the resulting probability model is as follows:</Paragraph>
      <Paragraph position="3"> (2) Thus, each word is dependent not only on a single stream of temporally ordered word variables, but also on additional parallel (i.e. simultaneously occurring) features. This factored representation can be used in two di erent ways to improve over standard LMs: by using a product model or a backo model. In a product model, Equation 2 can be simpli ed by nding conditional independence assumptions among sub-sets of conditioning factors and computing the desired probability as a product of individual models over those subsets. In this paper we only consider the second option, viz. using the factors in a backo procedure when the word n-gram is not observed in the training data.</Paragraph>
      <Paragraph position="4"> For instance, a word trigram that is found in an unseen test set may not have any counts in the training set, but its corresponding factors (e.g. stems and morphological tags) may have been observed since they also occur in other words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Generalized parallel backo
</SectionTitle>
      <Paragraph position="0"> Backo is a common smoothing technique in language modeling. It is applied whenever the count for a given n-gram in the training data falls below a certain threshold . In that case, the maximum-likelihood estimate of the n-gram probability is replaced with a probability derived from the probability of the lower-order (n 1)-gram and a backo weight. N-grams whose counts are above the threshold retain their maximum-likelihood estimates, discounted by a factor that re-distributes probability mass to the lower-order distribution:</Paragraph>
      <Paragraph position="2"> where c is the count of (wt; wt 1; wt 2), pML denotes the maximum-likelihood estimate and dc is a discounting factor that is applied to the higher-order distribution. The way in which the discounting factor is estimated determines the actual smoothing method (e.g. Good-Turing, Kneser-Ney, etc.) The normalization factor (wt 1; wt 2) ensures that the entire distribution sums to one. During standard backo , the most distant conditioning variable (in this case wt 2) is dropped rst, then the second most distant variable etc. until the unigram is reached.</Paragraph>
      <Paragraph position="3"> This can be visualized as a backo path (Figure 1(a)). If the only variables in the model are words, such a backo procedure is reasonable.</Paragraph>
      <Paragraph position="5"> guage model over words (left) and backo graph for 4-gram over factors (right).</Paragraph>
      <Paragraph position="6"> However, if variables occur in parallel, i.e. do not form a temporal sequence, it is not immediately obvious in which order they should be dropped. In this case, several backo paths are possible, which can be summarized in a backo graph (Figure 1(b)). In principle, there are several di erent ways of choosing among di erent paths in this graph:  1. Choose a xed, predetermined backo path based on linguistic knowledge, e.g. always drop syntactic before morphological variables.</Paragraph>
      <Paragraph position="7"> 2. Choose the path at run-time based on statistical criteria.</Paragraph>
      <Paragraph position="8"> 3. Choose multiple paths and combine their probability estimates.</Paragraph>
      <Paragraph position="9">  The last option, referred to as parallel backo , is implemented via a new, generalized backo function (here shown for a 4-gram):</Paragraph>
      <Paragraph position="11"> where c is the count of (f; f1; f2; f3), pML(fjf1; f2; f3) is the maximum likelihood distribution, 4 is the count threshold, and (f1; f2; f3) is the normalization factor.</Paragraph>
      <Paragraph position="12"> The function g(f; f1; f2; f3) determines the backo strategy. In a typical backo procedure g(f; f1; f2; f3) equals pBO(fjf1; f2). In generalized parallel backo , however, g can be any non-negative function of f; f1; f2; f3. In our implementation of FLMs (Kirchho et al., 2003) we consider several di erent g functions, including the mean, weighted mean, product, and maximum of the smoothed probability distributions over all subsets of the conditioning factors. In addition to di erent choices for g, di erent discounting parameters can be chosen at di erent levels in the backo graph. For instance, at the topmost node, Kneser-Ney discounting might be chosen whereas at a lower node Good-Turing might be applied.</Paragraph>
      <Paragraph position="13"> FLMs have been implemented as an add-on to the widely-used SRILM toolkit1 and have been used successfully for the purpose of morpheme-based language modeling (Bilmes and Kirchho , 2003), multi-speaker language modeling (Ji and Bilmes, 2004), and speech recognition (Kirchho et al., 2003).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning FLM Structure
</SectionTitle>
    <Paragraph position="0"> In order to use an FLM, three types of parameters need to be speci ed: the initial conditioning factors, the backo graph, and the smoothing options. The goal of structure learning is to nd the parameter combinations that create FLMs that achieve a low perplexity on unseen test data. The resulting model space is extremely large: given a factored word representation with a total of k factors, there areP</Paragraph>
    <Paragraph position="2"> possible subsets of initial conditioning factors. For a set of m conditioning factors, there are up to m! backo paths, each with its own smoothing options. Unless m is very small, exhaustive search is infeasible. Moreover, non-linear interactions between parameters make it di cult to guide the search into a particular direction, and parameter sets that work well for one corpus cannot necessarily be expected to perform well on another. We therefore need an automatic way of identifying the best model structure. In the following section, we describe the application of genetic-based search to this problem.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Genetic Algorithms
</SectionTitle>
      <Paragraph position="0"> Genetic Algorithms (GAs) (Holland, 1975) are a class of evolution-inspired search/optimization techniques. They perform particularly well in problems with complex, poorly understood search spaces. The fundamental idea of GAs is to encode problem solutions as (usually binary) strings (genes), and to evolve and test successive populations of solutions through the use of genetic operators applied to the encoded strings.</Paragraph>
      <Paragraph position="1"> Solutions are evaluated according to a tness function which represents the desired optimization criterion. The individual steps are as fol1We would like to thank Je Bilmes for providing and supporting the software.</Paragraph>
      <Paragraph position="2"> lows: Initialize: Randomly generate a set (population) of strings.</Paragraph>
      <Paragraph position="3"> While tness improves by a certain threshold: Evaluate tness: calculate each string's tness Apply operators: apply the genetic operators to create a new population.</Paragraph>
      <Paragraph position="4"> The genetic operators include the probabilistic selection of strings for the next generation, crossover (exchanging subparts of di erent strings to create new strings), and mutation (randomly altering individual elements in strings). Although GAs provide no guarantee of nding the optimal solution, they often nd good solutions quickly. By maintaining a population of solutions rather than a single solution, GA search is robust against premature convergence to local optima. Furthermore, solutions are optimized based on a task-speci c tness function, and the probabilistic nature of genetic operators helps direct the search towards promising regions of the search space.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Structure Search Using GA
</SectionTitle>
      <Paragraph position="0"> In order to use GAs for searching over FLM structures (i.e. combinations of conditioning variables, backo paths, and discounting options), we need to nd an appropriate encoding of the problem.</Paragraph>
      <Paragraph position="1"> Conditioning factors The initial set of conditioning factors F are encoded as binary strings. For instance, a trigram for a word representation with three factors (A,B,C) has six conditioning variables: fA 1; B 1; C 1; A 2; B 2; C 2g which can be represented as a 6-bit binary string, with a bit set to 1 indicating presence and 0 indicating absence of a factor in F. The string 10011 would correspond to F = fA 1; B 2; C 2g.</Paragraph>
      <Paragraph position="2"> Backo graph The encoding of the backo graph is more difcult because of the large number of possible paths. A direct approach encoding every edge as a bit would result in overly long strings, rendering the search ine cient. Our solution is to encode a binary string in terms of graph grammar rules (similar to (Kitano, 1990)), which can be used to describe common regularities in backo graphs. For instance, a node with m factors can only back o to children nodes with m 1 factors. For m = 3, the choices for proceeding to the next-lower level in the backo</Paragraph>
      <Paragraph position="4"/>
      <Paragraph position="6"> Here xi corresponds to the factor at the ith position in the parent node. Rule 1 indicates a backo that drops the third factor, Rule 2 drops the second factor, etc. The choice of rules used to generate the backo graph is encoded in a binary string, with 1 indicating the use and 0 indicating the non-use of a rule, as shown schematically in Figure 2. The presence of two di erent rules at the same level in the backo graph corresponds to parallel backo ; the absence of any rule (strings consisting only of 0 bits) implies that the corresponding backo graph level is skipped and two conditioning variables are dropped simultaneously. This allows us to encode a graph using few bits but does not represent all possible graphs. We cannot selectively apply di erent rules to di erent nodes at the same level { this would essentially require a context-sensitive grammar, which would in turn increase the length of the encoded strings.</Paragraph>
      <Paragraph position="7"> This is a fundamental tradeo between the most general representation and an encoding that is tractable. Our experimental results described below con rm, however, that su ciently good results can be obtained in spite of the above limitation.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Smoothing options
</SectionTitle>
      <Paragraph position="0"> Smoothing options are encoded as tuples of integers. The rst integer speci es the discounting method while second indicates the minimum count required for the n-gram to be included in the FLM. The integer string consists of successive concatenated tuples, each representing the smoothing option at a node in the graph. The GA operators are applied to concatenations of all three substrings describing the set of factors, backo graph, and smoothing options, such that all parameters are optimized jointly.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Data
</SectionTitle>
    <Paragraph position="0"> We tested our language modeling algorithms on two di erent data sets from two di erent languages, Arabic and Turkish.</Paragraph>
    <Paragraph position="1"> The Arabic data set was drawn from the</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
CallHome Egyptian Conversational Arabic
</SectionTitle>
      <Paragraph position="0"> (ECA) corpus (LDC, 1996). The training, development, and evaluation sets contain approximately 170K, 32K, and 18K words, respectively. The corpus was collected for the purpose of speech recognizer development for conversational Arabic, which is mostly dialectal and does not have a written standard. No additional text material beyond transcriptions is available in this case; it is therefore important to use language models that perform well in sparse data conditions. The factored representation was constructed using linguistic information from the corpus lexicon, in combination with automatic morphological analysis tools. It includes, in addition to the word, the stem, a morphological tag, the root, and the pattern. The latter two are components which when combined form the stem. An example of this factored word representation is shown below: Word:il+dOr/Morph:noun+masc-sg+article/ Stem:dOr/Root:dwr/Pattern:CCC For our Turkish experiments we used a morphologically annotated corpus of Turkish (Hakkani-T ur et al., 2000). The annotation was performed by applying a morphological analyzer, followed by automatic morphological disambiguation as described in (Hakkani-T ur et al., 2002). The morphological tags consist of the initial root, followed by a sequence of in ectional groups delimited by derivation boundaries (^DB). A sample annotation (for the word yararlanmak, consisting of the root yarar plus three in ectional groups) is shown  We removed segmentation marks (for titles and paragraph boundaries) from the corpus but included punctuation. Words may have di erent numbers of in ectional groups, but the FLM representation requires the same number of factors for each word; we therefore had to map the original morphological tags to a xed-length factored representation. This was done using linguistic knowledge: according to (O azer, 1999), the nal in ectional group in each dependent word has a special status since it determines in ectional markings on head words following the dependent word.</Paragraph>
      <Paragraph position="1"> The nal in ectional group was therefore analyzed into separate factors indicating the number (N), case (C), part-of-speech (P) and all other information (O). Additional factors for the word are the root (R) and all remaining information in the original tag not subsumed by the other factors (G). The word itself is used as another factor (W). Thus, the above example would be factorized as follows:  Other factorizations are certainly possible; however, our primary goal is not to nd the best possible encoding for our data but to demonstrate the e ectiveness of the FLM approach, which is largely independent of the choice of factors. For our experiments we used subsets of 400K words for training, 102K words for development and 90K words for evaluation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML