File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1017_metho.xml
Size: 14,936 bytes
Last Modified: 2025-10-06 14:13:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1017"> <Title>eling for speech recognition. In Sadaoki Furui and M. Mohan Sondhi, editors, Advances in Speech</Title> <Section position="3" start_page="0" end_page="128" type="metho"> <SectionTitle> 1. MOTIVATION </SectionTitle> <Paragraph position="0"> The most successful stochastic language models have been based on finite-state descriptions such as n-grams or hidden Markov models (HMMs) (Jelinek et al., 1992). However, finite-state models cannot represent the hierarchical structure of natural language and are thus ill-suited to tasks in which that structure is essential, such as language understanding or translation. It is then natural to consider stochastic versions of more powerful grammar formalisms and their grammatical inference problems. For instance, Baker (1979) generalized the parameter estimation methods for HMMs to stochastic context-free grammars (SCFGs) (Booth, 1969) as the inside-outside algorithm. Unfortunately, the application of SCFGs and the original inside-outside algorithm to natural-language modeling has been so far inconclusive (Lari and Young, 1990; Jelinek et al., 1990; Lari and Young, 1991).</Paragraph> <Paragraph position="1"> Several reasons can be adduced for the difficulties. First, each iteration of the inside-outside algorithm on a grammar with n nonterminals may require O(n3\[wl 3) time per training sentence w, while each iteration of its finite-state counterpart training an HMM with s states requires at worst O(s2lwl) time per training sentence. That complexity makes the training of suffEciently large grammars computationally impractical.</Paragraph> <Paragraph position="2"> Second, the convergence properties of the algorithm sharply deteriorate as the number of non-terminal symbols increases. This fact can be intuitively understood by observing that the algorithm searches for the maximum of a function whose number of local maxima grows with the number of nonterminals. Finally, while SCFGs do provide a hierarchical model of the language, that structure is undetermined by raw text and only by chance will the inferred grammar agree with qualitative linguistic judgments of sentence structure. For example, since in English texts pronouns are very likely to immediately precede a verb, a grammar inferred from raw text will tend to make a constituent of a subject pronoun and the following verb.</Paragraph> <Paragraph position="3"> We describe here an extension of the inside-outside algorithm that infers the parameters of a stochastic context-free grammar from a partially parsed corpus, thus providing a tighter connection between the hierarchical structure of the inferred SCFG and that of the training corpus. The algorithm takes advantage of whatever constituent information is provided by the training corpus bracketing, ranging from a complete constituent analysis of the training sentences to the unparsed corpus used for the original inside-outside algorithm. In the latter case, the new algorithm reduces to the original one.</Paragraph> <Paragraph position="4"> Using a partially parsed corpus has several advantages. First, the the result grammars yield constituent boundaries that cannot be inferred from raw text. In addition, the number of iterations needed to reach a good grammar can be reduced; in extreme cases, a good solution is found from parsed text but not from raw text. Finally, the new algorithm has better time complexity when sufficient bracketing information is provided.</Paragraph> </Section> <Section position="4" start_page="128" end_page="128" type="metho"> <SectionTitle> 2. PARTIALLY BRACKETED TEXT </SectionTitle> <Paragraph position="0"> Informally, a partially bracketed corpus is a set of sentences annotated with parentheses marking constituent boundaries that any analysis of the corpus should respect. More precisely, we start from a corpus C consisting of bracketed strings, which are pairs e = (w,B) where w is a string and B is a bracketing of w. For convenience, we will define the length of the bracketed string c by Icl = Iwl.</Paragraph> <Paragraph position="1"> Given a string w = wl ..-WlM, a span of w is a pair of integers (i,j) with 0 < i < j g \[w\[, which delimits a substring iwj = wi+y ...wj of w. The abbreviation iw will stand for iWl~ I.</Paragraph> <Paragraph position="2"> A bracketing B of a string w is a finite set of spans on w (that is, a finite set of pairs or integers (i, j) with 0 g i < j < \[w\[) satisfying a consistency condition that ensures that each span (i, j) can be seen as delimiting a string iwj consisting of a sequence of one of more. The consistency condition is simply that no two spans in a bracketing may overlap, where two spans (i, j) and (k, l) overlap if either i < k < j < l or k < i < l < j.</Paragraph> <Paragraph position="3"> Two bracketings of the same string are said to be compatible if their union is consistent. A span s is valid for a bracketing B if {s} is compatible with B.</Paragraph> <Paragraph position="4"> Note that there is no requirement that a bracketing of w describe fully a constituent structure of w. In fact, some or all sentences in a corpus may have empty bracketings, in which case the new algorithm behaves like the original one.</Paragraph> <Paragraph position="5"> To present the notion of compatibility between a derivation and a bracketed string, we need first to define the span of a symbol occurrence in a context-free derivation. Let (w,B) be a bracketed string, and c~0 ==~ al :=C/, ... =~ c~m = w be a derivation of w for (S)CFG G. The span of a symbol occurrence in (~1 is defined inductively as follows: * Ifj -- m, c U = w E E*, and the span of wi in ~j is (i- 1, i).</Paragraph> <Paragraph position="6"> * If j < m, then aj : flAT, aj+l = /3XI&quot;'Xk')', where A -* XI&quot;.Xk is a rule of G. Then the span of A in aj is (il,jk), where for each 1 < l < k, (iz,jt) is the span of Xl in aj+l- The spans in (~j of the symbol occurrences in/3 and 7 are the same as those of the corresponding symbols in ~j+l.</Paragraph> <Paragraph position="7"> A derivation of w is then compatible with a bracketing B of w if the span of every symbol occurrence in the derivation is valid in B.</Paragraph> </Section> <Section position="5" start_page="128" end_page="131" type="metho"> <SectionTitle> 3. GRAMMAR REESTIMATION </SectionTitle> <Paragraph position="0"> The inside-outside algorithm (Baker, 1979) is a reestimation procedure for the rule probabilities of a Chomsky normal-form (CNF) SCFG. It takes as inputs an initial CNF SCFG and a training corpus of sentences and it iteratively reestimates rule probabilities to maximize the probability that the grammar used as a stochastic generator would produce the corpus.</Paragraph> <Paragraph position="1"> A reestimation algorithm can be used both to refine the parameter estimates for a CNF SCFG derived by other means (Fujisaki et hi., 1989) or to infer a grammar from scratch. In the latter case, the initial grammar for the inside-outside algorithm consists of all possible CNF rules over given sets N of nonterrninals and E of terminals, with suitably assigned nonzero probabilities. In what follows, we will take N, ~ as fixed, n - IN\[, t = \[El, and assume enumerations N - {A1,... ,An} and E = {hi,... ,bt}, with A1 the grammar start symbol. A CNF SCFG over N, E can then be specified by the n~+ nt probabilities Bp,q,r of each possible binary rule Ap --* Aq Ar and Up,m of each possible unary rule Ap --* bin. Since for each p the parameters Bp,q,r and Up,rn are supposed to be the probabilities of different ways of expanding Ap, we must have for all 1 _< p _< n</Paragraph> <Paragraph position="3"> For grammar inference, we give random initial values to the parameters Bp,q,r and Up,m subject to the constraints (7).</Paragraph> <Paragraph position="4"> The intended meaning of rule probabilities in a SCFG is directly tied to the intuition of contextfreeness: a derivation is assigned a probability which is the product of the probabilities of the rules used in each step of the derivation. Context-freeness together with the commutativity of multiplication thus allow us to identify all derivations associated to the same parse tree, and we will</Paragraph> <Paragraph position="6"> speak indifferently below of derivation and analysis (parse tree) probabilities. Finally, the probability of a sentence or sentential form is the sum of the probabilities of all its analyses (equivalently, the sum of the probabilities of all of its leftmost derivations from the start symbol).</Paragraph> <Section position="1" start_page="129" end_page="130" type="sub_section"> <SectionTitle> 3.1. The Inside-Outside Algorithm </SectionTitle> <Paragraph position="0"> The basic idea of the inside-outside algorithm is to use the current rule probabilities and the training set W to estimate the expected frequencies of certain types of derivation step, and then compute new rule probability estimates as appropriate ratios of those expected frequency estimates. Since these are most conveniently expressed as relative frequencies, they are a bit loosely referred to as inside and outside probabilities. More precisely, for each w E W, the inside probability I~ (i, j) estimates the likelihood that Ap derives iwj, while the outside probability O~(i, j) estimates the likelihood of deriving sentential form owi Ap j w from the start symbol A1.</Paragraph> </Section> <Section position="2" start_page="130" end_page="131" type="sub_section"> <SectionTitle> 3.2. The Extended Algorithm </SectionTitle> <Paragraph position="0"> In adapting the inside-outside algorithm to partially bracketed training text, we must take into account the constraints that the bracketing imposes on possible derivations, and thus on possible phrases. Clearly, nonzero values for I~(i,j) or O~(i,j) should only be allowed if iwj is compatible with the bracketing of w, or, equivalently, if (i,j) is valid for the bracketing of w. Therefore, we will in the following assume a corpus C of bracketed strings c = (w, B), and will modify the standard formulas for the inside and outside probabilities and rule probability reestimation (Baker, 1979; Lari and Young, 1990; Jelinek et al., 1990) to involve only constituents whose spans are compatible with string bracketings. For this purpose, for each bracketed string c = (w, B) we define the auxiliary function</Paragraph> <Paragraph position="2"> The reestimation formulas for the extended algorithm are shown in Table 1. For each bracketed sentence c in the training corpus, the inside probabilities of longer spans of c are computed from those for shorter spans with the recurrence given by equations (1) and (2). Equation (2) calculates the expected relative frequency of derivations of iwk from Ap compatible with the bracketing B of c = (w, B). The multiplier 5(i, k) is i just in case (i, k) is valid for B, that is, when Ap can derive iwk compatibly with B.</Paragraph> <Paragraph position="3"> Similarly, the outside probabilities for shorter spans of c can be computed from the inside probabilities and the outside probabilities for longer spans with the recurrence given by equations (3) and (4). Once the inside and outside probabilities computed for each sentence in the corpus, the ^ reestimated probability of binary rules, Bp,q,r, and the reestimated probability of unary rules, (Jp,ra, are computed by the reestimation formulas (5) and (6), which are just like the original ones (Baker, 1979; Jelinek et al., 1990; Lari and Young, 1990) except for the use of bracketed strings instead of unbracketed ones.</Paragraph> <Paragraph position="4"> The denominator of ratios (5) and (6) estimates the probability that a compatible derivation of a bracketed string in C will involve at least one expansion of nonterminal Ap. The numerator of (5) estimates the probability that a compatible derivation of a bracketed string in C will involve rule Ap --* Aq At, while the numerator of (6) estimates * the probability that a compatible derivation of a string in C will rewrite Ap to b,n. Thus (5) estimates the probability that a rewrite of Ap in a compatible derivation of a bracketed string in C will use rule Ap --~ Aq At, and (6) estimates the probability that an occurrence of Ap in a compatible derivation of a string in in C will be rewritten to bin. These are the best current estimates for the binary and unary rule probabilities.</Paragraph> <Paragraph position="5"> The process is then repeated with the reestimated probabilities until the increase in the estimated probability of the training text given the model becomes negligible, or, what amounts to the same, the decrease in the cross entropy estimate (negative log probability)</Paragraph> <Paragraph position="7"> becomes negligible. Note that for comparisons with the original algorithm, we should use the cross-entropy estimate /~(W, G) of the unbracketed text W with respect to the grammar G, not (8).</Paragraph> </Section> <Section position="3" start_page="131" end_page="131" type="sub_section"> <SectionTitle> 3.3. Complexity </SectionTitle> <Paragraph position="0"> Each of the three steps of an iteration of the original inside-outside algorithm -- computation of inside probabilities, computation of outside probabilities and rule probability reestimation - takes time O(Iwl 3) for each training sentence w. Thus, the whole algorithm is O(Iw\[ 3) on each training sentence.</Paragraph> <Paragraph position="1"> However, the extended algorithm performs better when bracketing information is provided, because it does not need to consider all possible spans for constituents, but only those compatible with the training set bracketing. In the limit, when the bracketing of each training sentence comes from a complete binary-branching analysis of the sentence (a full binary bracketing), the time of each step reduces to O(\[w D. This can be seen from the following three facts about any full binary bracketing B of a string w: 1. B has o(Iwl) spans; 2. For each (i, k) in B there is exactly one split point j such that both (i, j) and (j, k) are in 3. Each valid span with respect to B must al- null ready be a member of B.</Paragraph> <Paragraph position="2"> Thus, in equation (2) for instance, the number of spans (i, k) for which 5(i, k) 0 is O(\[eD, and there is a single j between i and k for which 6(i, j) ~ 0 and 5(j,k) ~ 0. Therefore, the total time to compute all the I~(i, k) is O(Icl). A similar argument applies to equations (4) and (5). Note that to achieve the above bound as well as to take advantage of whatever bracketing is available to improve performance, the implementation must preprocess the training set appropriately so that the valid spans and their split points are efficiently enumerated.</Paragraph> </Section> </Section> class="xml-element"></Paper>