File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-2002_abstr.xml
Size: 9,255 bytes
Last Modified: 2025-10-06 13:44:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-2002"> <Title>A General Technique to Train Language Models on Language Models</Title> <Section position="2" start_page="0" end_page="174" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In this article, the term language model is used to refer to any description that assigns probabilities to strings over a certain alphabet. Language models have important applications in natural language processing, and in particular, in speech recognition systems (Manning and Sch &quot;utze 1999).</Paragraph> <Paragraph position="1"> Language models often consist of a symbolic description of a language, such as a finite automaton (FA) or a context-free grammar (CFG), extended by a probability assignment to, for example, the transitions of the FA or the rules of the CFG, by which we obtain a probabilistic finite automaton (PFA) or probabilistic context-free grammar (PCFG), respectively. For certain applications, one may first determine the symbolic part of the automaton or grammar and in a second phase try to find reliable probability estimates for the transitions or rules. The current article is involved with the second problem, that of extending FAs or CFGs to become PFAs or PCFGs. We refer to this process as training.</Paragraph> <Paragraph position="2"> Training is often done on the basis of a corpus of actual language use in a certain domain. If each sentence in this corpus is annotated by a list of transitions of an FA recognizing the sentence or a parse tree for a CFG generating the sentence, then training may consist simply in relative frequency estimation. This means that we estimate probabilities of transitions or rules by counting their frequencies in the corpus, relative to the frequencies of the start states of transitions or to the frequencies of the left-hand side nonterminals of rules, respectively. By this estimation, the likelihood of the corpus is maximized.</Paragraph> <Paragraph position="3"> The technique we introduce in this article is different in that training is done on the basis not of a finite corpus, but of an input language model. Our goal is to find estimations for the probabilities of transitions or rules of the input FA or CFG such that [?] Faculty of Arts, Humanities Computing, P.O. Box 716, NL-9700 AS Groningen, The Netherlands. E-mail: markjan@let.rug.nl.</Paragraph> <Paragraph position="4"> Submission received: 20th January 2004; Revised submission received: 5th August 2004; Accepted for publication: 19th September 2004 (c) 2005 Association for Computational Linguistics Computational Linguistics Volume 31, Number 2 the resulting PFA or PCFG approximates the input language model as well as possible, or more specifically, such that the Kullback-Leibler (KL) distance (or relative entropy) between the input model and the trained model is minimized. The input FA or CFG to be trained may be structurally unrelated to the input language model.</Paragraph> <Paragraph position="5"> This technique has several applications. One is an extension with probabilities of existing work on approximation of CFGs by means of FAs (Nederhof 2000). The motivation for this work was that application of FAs is generally less costly than application of CFGs, which is an important benefit when the input is very large, as is often the case in, for example, speech recognition systems. The practical relevance of this work was limited, however, by the fact that in practice one is more interested in the probabilities of sentences than in a purely Boolean distinction between grammatical and ungrammatical sentences.</Paragraph> <Paragraph position="6"> Several approaches were discussed by Mohri and Nederhof (2001) to extend this work to approximation of PCFGs by means of PFAs. A first approach is to directly map rules with attached probabilities to transitions with attached probabilities. Although this is computationally the easiest approach, the resulting PFA may be a very inaccurate approximation of the probability distribution described by the input PCFG. In particular, there may be assignments of probabilities to the transitions of the same FA that lead to more accurate approximating language models.</Paragraph> <Paragraph position="7"> A second approach is to train the approximating FA by means of a corpus. If the input PCFG was itself obtained by training on a corpus, then we already possess training material. However, this may not always be the case, and no training material may be available. Furthermore, as a determinized approximating FA may be much larger than the input PCFG, the sparse-data problem may be more severe for the automaton than it was for the grammar.</Paragraph> <Paragraph position="8"> Hence, even if sufficient material was available to train the CFG, it may not be sufficient to accurately train the FA.</Paragraph> <Paragraph position="9"> A third approach is to construct a training corpus from the PCFG by means of a (pseudo)random generator of sentences, such that sentences that are more likely according to the PCFG are generated with greater likelihood. This has been proposed by Jurafsky et al. (1994), for the special case of bigrams, extending a nonprobabilistic technique by Zue et al. (1991). It is not clear, however, whether this idea is feasible for training of finite-state models that are larger than bigrams. The reason is that very large corpora would have to be generated in order to obtain accurate probability estimates for the PFA. Note that the number of parameters of a bigram model is bounded by the square of the size of the lexicon; such a bound does not exist for general PFAs.</Paragraph> <Paragraph position="10"> The current article discusses a fourth approach. In the limit, it is equivalent to the third approach above, as if an infinite corpus were constructed on which the PFA is trained, but we have found a way to avoid considering sentences individually. The key idea that allows us to handle an infinite set of strings generated by the PCFG is that we construct a new grammar that represents the intersection of the languages described by the input PCFG and the FA. Within this new grammar, we can compute the expected frequencies of transitions of the FA, using a fairly standard analysis of PCFGs. These expected frequencies then allow us to determine the assignment of probabilities to transitions of the FA that minimizes the KL distance between the PCFG and the resulting Nederhof Training Models on Models The only requirement is that the FA to be trained be unambiguous, by which we mean that each input string can be recognized by at most one computation of the FA. The special case of n-grams has already been formulated by Stolcke and Segal (1994), realizing an idea previously envisioned by Rimon and Herz (1991). An n-gram model is here seen as a (P)FA that contains exactly one state for each possible history of the n [?] 1 previously read symbols. It is clear that such an FA is unambiguous (even deterministic) and that our technique therefore properly subsumes the technique by Stolcke and Segal (1994), although the way that the two techniques are formulated is rather different. Also note that the FA underlying an n-gram model accepts any input string over the alphabet, which does not hold for general (unambiguous) FAs.</Paragraph> <Paragraph position="11"> Another application of our work involves determinization and minimization of PFAs. As shown by Mohri (1997), PFAs cannot always be determinized, and no practical algorithms are known to minimize arbitrary nondeterministic (P)FAs. This can be a problem when deterministic or small PFAs are required. We can, however, always compute a minimal deterministic FA equivalent to an input FA. The new results in this article offer a way to extend this determinized FA to a PFA such that it approximates the probability distribution described by the input PFA as well as possible, in terms of the KL distance.</Paragraph> <Paragraph position="12"> Although the proposed technique has some limitations, in particular, that the model to be trained is unambiguous, it is by no means restricted to language models based on finite automata or context-free grammars, as several other probabilistic grammatical formalisms can be treated in a similar manner.</Paragraph> <Paragraph position="13"> The structure of this article is as follows. We provide some preliminary definitions in Section 2. Section 3 discusses how the expected frequency of a rule in a PCFG can be computed. This is an auxiliary step in the algorithms to be discussed below. Section 4 defines a way to combine a PFA and a PCFG into a new PCFG that extends a well-known representation of the intersection of a regular and a context-free language. Thereby we merge the input model and the model to be trained into a single structure. This structure is the foundation for a number of algorithms, presented in section 5, which allow, respectively, training of an unambiguous FA on the basis of a PCFG (section 5.1), training of an unambiguous CFG on the basis of a PFA (section 5.2), and training of an unambiguous FA on the basis of a PFA (section 5.3).</Paragraph> </Section> class="xml-element"></Paper>