File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1031_metho.xml

Size: 23,578 bytes

Last Modified: 2025-10-06 14:14:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1031">
  <Title>Bayesian Grammar Induction for Language Modeling</Title>
  <Section position="3" start_page="0" end_page="228" type="metho">
    <SectionTitle>
2 Grammar Induction as Search
</SectionTitle>
    <Paragraph position="0"> Grammar induction can be framed as a search problem, and has been framed as such almost without exception in past research (7). The search space is taken to be some class of grammars; for example, in our work we search within the space of probabilistic context-free grammars. The objective function is taken to be some measure dependent on the training data; one generally wants to find a grammar that in some sense accurately models the training data.</Paragraph>
    <Paragraph position="1"> Most work in language modeling, including n-gram models and the Inside-Outside algorithm, falls under the maximum-likelihood paradigm, where one takes the objective function to be the likelihood of the training data given the grammar. However, the optimal grammar under this objective function is one which generates only strings in the training data and no other strings. Such grammars are poor language models, as they overfit the training data and do not model the language at large. In n-gram models and the Inside-Outside algorithm, this issue is evaded by bounding the size and form of the grammars considered, so that the &amp;quot;optimal&amp;quot; grammar cannot be expressed. However, in our work we do not wish to limit the size of the grammars considered. null The basic shortcoming of the maximum-likelihood objective function is that it does not encompass the compelling intuition behind Occam's Razor, that simpler (or smaller) grammars are preferable over complex (or larger) grammars. A factor in the objective function that favors smaller grammars over</Paragraph>
    <Paragraph position="3"> large can prevent the objective function from preferring grammars that overfit the training data. ?) presents a Bayesian grammar induction framework that includes such a factor in a motivated manner.</Paragraph>
    <Paragraph position="4"> The goM of grammar induction is taken to be finding the grammar with the largest a posteriori probability given the training data, that is, finding the</Paragraph>
    <Paragraph position="6"> and where we denote the training data as O, for observations. As it is unclear how to estimate p(GIO) directly, we apply Bayes' Rule and get a I = arg p(Ola)p(a) p(o) = arg% xp(O\[a)p(a) Hence, we can frame the search for G ~ as a search with the objective function p(OIG)p(G), the likelihood of the training data multiplied by the prior probability of the grammar.</Paragraph>
    <Paragraph position="7"> We satisfy the goal of favoring smaller grammars by choosing a prior that assigns higher probabilities to such grammars. In particular, Solomonoff proposes the use of the universal a priori probability (?), which is closely related to the minimum description length principle later proposed by (?). In the case of grammatical language modeling, this corresponds to taking</Paragraph>
    <Paragraph position="9"> where l(G) is the length of the description of the grammar in bits. The universal a priori probability has many elegant properties, the most salient of which is that it dominates all other enumerable probability distributions multiplicativelyJ</Paragraph>
  </Section>
  <Section position="4" start_page="228" end_page="230" type="metho">
    <SectionTitle>
3 Search Algorithm
</SectionTitle>
    <Paragraph position="0"> As described above, we take grammar induction to be the search for the grammar G ~ that optimizes the objective function p(OlG)p(G ). While this framework does not restrict us to a particular grammar formalism, in our work we consider only probabilistic context-free grammars.</Paragraph>
    <Paragraph position="1">  probability is given by 7).</Paragraph>
    <Paragraph position="2"> We assume a simple greedy search strategy. We maintain a single hypothesis grammar which is initialized to a small, trivial grammar. We then try to find a modification to the hypothesis grammar, such as the addition of a grammar rule, that results in a grammar with a higher score on the objective function. When we find a superior grammar, we make this the new hypothesis grammar. We repeat this process until we can no longer find a modification that improves the current hypothesis grammar.</Paragraph>
    <Paragraph position="3"> For our initial grammar, we choose a grammar that can generate any string, to assure that the grammar can cover the training data. The initial grammar is listed in Table ??. The sentential symbol S expands to a sequence of X's, where X expands to every other nonterminal symbol in the grammar.</Paragraph>
    <Paragraph position="4"> Initially, the set of nonterminal symbols consists of a different nonterminal symbol expanding to each terminal symbol.</Paragraph>
    <Paragraph position="5"> Notice that this grammar models a sentence as a sequence of independently generated nonterminal symbols. We maintain this property throughout the search process, that is, for every symbol A ~ that we add to the grammar, we also add a rule X ---+ A I. This assures that the sentential symbol can expand to every symbol; otherwise, adding a symbol will not affect the probabilities that the grammar assigns to strings.</Paragraph>
    <Paragraph position="6"> We use the term move set to describe the set of modifications we consider to the current hypothesis grammar to hopefully produce a superior grammar.</Paragraph>
    <Paragraph position="7"> Our move set includes the following moves: Move 1: Create a rule of the form A ---* BC Move 2: Create a rule of the form A --+ BIC For any context-free grammar, it is possible to express a weakly equivalent grammar using only rules of these forms. As mentioned before, with each new symbol A we also create a rule X ---* A.</Paragraph>
    <Section position="1" start_page="228" end_page="230" type="sub_section">
      <SectionTitle>
3.1 Evaluating the Objective Function
</SectionTitle>
      <Paragraph position="0"> Consider the task of calculating the objective function p(OIG)p(G ) for some grammar G. Calculating</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> p(OIG) requires a parsing of the entire training data.</Paragraph>
      <Paragraph position="5"> We cannot afford to parse the training data for each grammar considered; indeed, to ever be practical for data sets of millions of words, it seems likely that we can only afford to parse the data once.</Paragraph>
      <Paragraph position="6"> To achieve this goal, we employ several approximations. First, notice that we do not ever need to calculate the actual value of the objective function; we need only to be able to distinguish when a move applied to the current hypothesis grammar produces a grammar that has a higher score on the objective function, that is, we need only to be able to calculate the difference in the objective function resulting from a move. This can be done efficiently if we can quickly approximate how the probability of the training data changes when a move is applied.</Paragraph>
      <Paragraph position="7"> To make this possible, we approximate the probability of the training data p(OIG ) by the probability of the single most probable parse, or Viterbi parse, of the training data. Furthermore, instead of recalculating the Viterbi parse of the training data from scratch when a move is applied, we use heuristics to predict how a move will change the Viterbi parse.</Paragraph>
      <Paragraph position="8"> For example, consider the case where the training data consists of the two sentences O = {Bob talks slowly, Mary talks slowly} ~Due to space limitations, we do not specify our method for encoding grammars, i.e., how we calculate l(G) for a given G. However, this will be described in the author's forthcoming Ph.D. dissertation.</Paragraph>
      <Paragraph position="9"> In Figure ??, we display the Viterbi parse of this data under the initial hypothesis grammar used in our algorithm.</Paragraph>
      <Paragraph position="10"> Now, let us consider the move of adding the rule</Paragraph>
      <Paragraph position="12"> to the initial grammar (as well as the concomitant rule X ---* B). A reasonable heuristic for predicting how the Viterbi parse will change is to replace adjacent X's that expand to Atazk, and A~zo~,ty respectively with a single X that expands to B, as displayed in Figure ??. This is the actual heuristic we use for moves of the form A ---* BC, and we have analogous heuristics for each move in our move set.</Paragraph>
      <Paragraph position="13"> By predicting the differences in the Viterbi parse resulting from a move, we can quickly estimate the change in the probability of the training data.</Paragraph>
      <Paragraph position="14"> Notice that our predicted Viterbi parse can stray a great deal from the actual Viterbi parse, as errors can accumulate as move after move is applied. To minimize these effects, we process the training data incrementally. Using our initial hypothesis grammar, we parse the first sentence of the training data and search for the optimal grammar over just that one sentence using the described search framework.</Paragraph>
      <Paragraph position="15"> We use the resulting grammar to parse the second sentence, and then search for the optimal grammar over the first two sentences using the last grammar as the starting point. We repeat this process, parsing the next sentence using the best grammar found on the previous sentences and then searching for the  best grammar taking into account this new sentence, until the entire training corpus is covered.</Paragraph>
      <Paragraph position="16"> Delaying the parsing of a sentence until all of the previous sentences are processed should yield more accurate Viterbi parses during the search process than if we simply parse the whole corpus with the initial hypothesis grammar. In addition, we still achieve the goal of parsing each sentence but once.</Paragraph>
    </Section>
    <Section position="2" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
3.2 Parameter Training
</SectionTitle>
      <Paragraph position="0"> In this section, we describe how the parameters of our grammar, the probabilities associated with each grammar rule, are set. Ideally, in evaluating the objective function for a particular grammar we should use its optimal parameter settings given the training data, as this is the full score that the given grammar can achieve. However, searching for optimal parameter values is extremely expensive computationally. Instead, we grossly approximate the optimal values by deterministically setting parameters based on the Viterbi parse of the training data parsed so far. We rely on the post-pass, described later, to refine parameter values.</Paragraph>
      <Paragraph position="1"> Referring to the rules in Table ??, the parameter e is set to an arbitrary small constant. The values of the parameters p(A) are set to the (smoothed) frequency of the X ~ A reduction in the Viterbi parse of the data seen so far. The remaining symbols are set to expand uniformly among their possible expansions.</Paragraph>
    </Section>
    <Section position="3" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
3.3 Constraining Moves
</SectionTitle>
      <Paragraph position="0"> Consider the move of creating a rule of the form A --* BC. This corresponds to k 3 different specific rules that might be created, where k is the current number of symbols in the grammar. As it is too computationally expensive to consider each of these rules at every point in the search, we use heuristics to constrain which moves are appraised.</Paragraph>
      <Paragraph position="1"> For the left-hand side of a rule, we always create a new symbol. This heuristic selects the optimal choice the vast majority of the time; however, under this constraint the moves described earlier in this section cannot yield arbitrary context-free languages. To partially address this, we add the move Move 3: Create a rule of the form A ---* AB\[B With this iteration move, we can construct grammars that generate arbitrary regular languages. As yet, we have not implemented moves that enable the construction of arbitrary context-free grammars; this belongs to future work.</Paragraph>
      <Paragraph position="2"> To constrain the symbols we consider on the right-hand side of a new rule, we use what we call ~riggcrs. 3 A ~rigger is a phenomenon in the Viterbi parse of a sentence that is indicative that a particular move might lead to a better grammar. For example, 3This is not to be confused with the use of the term triggers in dynamic language modeling.</Paragraph>
      <Paragraph position="3"> in Figure .9.9 the fact that the symbols Atalks and Aszo,ozv occur adjacently is indicative that it could be profitable to create a rule B ---* At~t~sAsto,olv. We have developed a set of triggers for each move in our move set, and only consider a specific move if it is triggered in the sentence currently being parsed in the incremental processing.</Paragraph>
    </Section>
    <Section position="4" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
3.4 Post-Pass
</SectionTitle>
      <Paragraph position="0"> A conspicuous shortcoming in our search framework is that the grammars in our search space are fairly unexpressive. Firstly, recall that our grammars model a sentence as a sequence of independently generated symbols; however, in language there is a large dependence between adjacent constituents. Furthermore, the only free parameters in our search are the parameters p(A); all other symbols (except S) are fixed to expand uniformly. These choices were necessary to make the search tractable.</Paragraph>
      <Paragraph position="1"> To address this issue, we use an Inside-Outside algorithm post-pass. Our methodology is derived from that described by .9). We create n new nonterminal symbols {X1,...,X,}, and create all rules of the form:</Paragraph>
      <Paragraph position="3"> A E No~d- {S, X} Nold denotes the set of nonterminal symbols acquired in the initial grammar induction phase, and X1 is taken to be the new sentential symbol. These new rules replace the first three rules listed in Table .9.9. The parameters of these rules are initiMized randomly. Using this grammar as the starting point, we run the Inside-Outside algorithm on the training data until convergence.</Paragraph>
      <Paragraph position="4"> In other words, instead of using the naive S SXIX rule to attach symbols together in parsing data, we now use the Xi rules and depend on the Inside-Outside algorithm to train these randomly initialized rules intelligently. This post-pass allows us to express dependencies between adjacent symbols. In addition, it allows us to train parameters that were fixed during the initial grammar induction phase.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="230" end_page="231" type="metho">
    <SectionTitle>
4 Previous Work
</SectionTitle>
    <Paragraph position="0"> As mentioned, this work employs the Bayesian grammar induction framework described by Solomonoff (.9; ?). However, Solomonoff does not specify a concrete search algorithm and only makes suggestions as to its nature.</Paragraph>
    <Paragraph position="1"> Similar research includes work by Cook et al.</Paragraph>
    <Paragraph position="2"> (1976) and Stolcke and Omohundro (1994). This work also employs a heuristic search within a Bayesian framework. However, a different prior probability on grammars is used, and the algorithms are only efficient enough to be applied to small data sets.  The grammar induction algorithms most successful in language modeling include the Inside-Outside algorithm (.7; ?; ?), a special case of the Expectation-Maximization algorithm (?), and work by ?). In the latter work, McCandless uses a heuristic search procedure similar to ours, but a very different search criteria. To our knowledge, neither algorithm has surpassed the performance of n-gram models in a language modeling task of substantial scale.</Paragraph>
  </Section>
  <Section position="6" start_page="231" end_page="232" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> To evaluate our algorithm, we compare the performance of our algorithm to that of n-gram models and the Inside-Outside algorithm.</Paragraph>
    <Paragraph position="1"> For n-gram models, we tried n - 1,...,10 for each domain. For smoothing a particular n-gram model, we took a linear combination of all lower order n-gram models. In particular, we follow standard practice (?; ?; ?) and take the smoothed igram probability to be a linear combination of the /-gram frequency in the training data and the smoothed (i - 1)-gram probability, that is,</Paragraph>
    <Paragraph position="3"> where c(W) denotes the count of the word sequence W in the training data. The smoothing parameters ,~i,c are trained through the Forward-Backward algorithm (?) on held-out data. Parameters Ai.e are tied together for similar c to prevent data sparsity.</Paragraph>
    <Paragraph position="4"> For the Inside-Outside algorithm, we follow the methodology described by Lari and Young. For a given n, we create a probabilistic context-free grammar consisting of all Chomsky normal form rules over the n nonterminal symbols {X1, *. * Xn } and the given terminal symbols, that is, all rules Xi ---* Xj Xk i,j, k E {1,...,n} Xi ---* a i E {1,. . .,n},a E T where T denotes the set of terminal symbols in the domain. All parameters are initialized randomly.</Paragraph>
    <Paragraph position="5"> From this starting point, the Inside-Outside algorithm is run until convergence.</Paragraph>
    <Paragraph position="6"> For smoothing, we combine the expansion distribution of each symbol with a uniform distribution, that is, we take the smoothed parameter ps(A ---* a) to be</Paragraph>
    <Paragraph position="8"> where p~ (A --~ a) denotes the unsmoothed parameter. The value n 3 + n\[TI is the number of different ways a symbol expands under the Lari and Young methodology. The parameter A is trained through the Inside-Outside algorithm on held-out data. This smoothing is also performed on the Inside-Outside post-pass of our algorithm. For each domain, we tried n -- 3,..., 10.</Paragraph>
    <Paragraph position="9"> Because of the computational demands of our algorithm, it is currently impractical to apply it to large vocabulary or large training set problems. However, we present the results of our algorithm in three medium-sized domains. In each case, we use 4500 sentences for training, with 500 of these sentences held out for smoothing. We test on 500 sentences, and measure performance by the entropy of the test data.</Paragraph>
    <Paragraph position="10"> In the first two domains, we created the training and test data artificially so as to have an ideal grammar in hand to benchmark results. In particular, we used a probabilistic grammar to generate the data.</Paragraph>
    <Paragraph position="11"> In the first domain, we created this grammar by hand; the grammar was a small English-like probabilistic context-free grammar consisting of roughly 10 nonterminal symbols, 20 terminal symbols, and 30 rules. In the second domain, we derived the grammar from manually parsed text. From a million words of parsed Wall Street Journal data from the Penn treebank, we extracted the 20 most frequently occurring symbols, and the 10 most frequently occurring rules expanding each of these symbols. For each symbol that occurs on the right-hand side of a rule but which was not one of the most frequent 20 symbols, we create a rule that expands that symbol to a unique terminal symbol. After removing unreachable rules, this yields a grammar of roughly 30 nonterminals, 120 terminals, and 160 rules. Parameters are set to reflect the frequency of the corresponding rule in the parsed corpus.</Paragraph>
    <Paragraph position="12"> For the third domain, we took English text and reduced the size of the vocabulary by mapping each word to its part-of-speech tag. We used tagged Wall Street Journal text from the Penn treebank, which has a tag set size of about fifty.</Paragraph>
    <Paragraph position="13"> In Tables ??_?.7, we summarize our results. The ideal grammar denotes the grammar used to generate the training and test data. For each algorithm, we list the best performance achieved over all n tried, and the best n column states which value realized this performance.</Paragraph>
    <Paragraph position="14"> We achieve a moderate but significant improvement in performance over n-gram models and the Inside-Outside algorithm in the first two domains, while in the part-of-speech domain we are outperformed by n-gram models but we vastly outperform the Inside-Outside algorithm.</Paragraph>
    <Paragraph position="15"> In Table ??, we display a sample of the number of parameters and execution time (on a Decstation 5000/33) associated with each algorithm. We choose n to yield approximately equivalent performance for each algorithm. The first pass row refers to the main grammar induction phase of our algorithm, and the post-pass row refers to the Inside-Outside post-pass.</Paragraph>
    <Paragraph position="16">  Notice that our algorithm produces a significantly more compact model than the n-gram model, while running significantly faster than the Inside-Outside algorithm even though we use an Inside-Outside post-pass. Part of this discrepancy is due to the fact that we require a smaller number of new nonterminal symbols to achieve equivalent performance, but we have also found that our post-pass converges more quickly even given the same number of nonterminal symbols.</Paragraph>
  </Section>
  <Section position="7" start_page="232" end_page="232" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Our algorithm consistently outperformed the Inside-Outside algorithm in these experiments. While we partially attribute this difference to using a Bayesian instead of maximum-likelihood objective function, we believe that part of this difference results from a more effective search strategy. In particular, though both algorithms employ a greedy hill-climbing strategy, our algorithm gains an advantage by being able to add new rules to the grammar.</Paragraph>
    <Paragraph position="1"> In the Inside-Outside algorithm, the gradient descent search discovers the &amp;quot;nearest&amp;quot; local minimum in the search landscape to the initial grammar. If there are k rules in the grammar and thus k parameters, then the search takes place in a fixed k-dimensional space IR ~. In our algorithm, it is possible to expand the hypothesis grammar, thus increasing the dimensionality of the parameter space that is being searched. An apparent local minimum in the space \]Rk may no longer be a local minimum in the space \]~k+l; the extra dimension may provide a pathway for further improvement of the hypothesis grammar.</Paragraph>
    <Paragraph position="2"> Hence, our algorithm should be less prone to sub-optimal local minima than the Inside-Outside algorithm. null Outperforming n-gram models in the first two domains demonstrates that our algorithm is able to take advantage of the grammatical structure present in data. However, the superiority of n-gram models in the part-of-speech domain indicates that to be competitive in modeling naturally-occurring data, it is necessary to model collocational information accurately. We need to modify our algorithm to more aggressively model n-gram information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML