File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1024_metho.xml

Size: 22,528 bytes

Last Modified: 2025-10-06 14:13:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1024">
  <Title>S ~ AD S --~CB B ~ SC D ~ SA A ~ b B ~ a C ~ a</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. MOTIVATION
</SectionTitle>
    <Paragraph position="0"> Grammar inference is a challenging problem for statistical approaches to natural-language processing. The most successful grammar inference techniques involve stochastic finite-state language models such as hidden Markov models (HMMs) \[1\]. However, finite-state language models fail to represent the hierarchical structure of natural language. Therefore, stochastic versions of grammar formalisms structurally more expressive are worth investigating. Baker \[2\] generalized the parameter estimation methods for HMMs to stochastic context-free grammars (SCFGs) \[3\] as the inside-outside algorithm.</Paragraph>
    <Paragraph position="1"> Unfortunately, the application of SCFGs and the inside-outside algorithm to natural-language modeling \[4, 5, 6\] has so far been inconclusive.</Paragraph>
    <Paragraph position="2"> Several reasons can be adduced for the difficulties. First, each iteration of the inside-outside algorithm on a grammar with n nonterminals may require O(nalwl 3) time per training sentence w, while each iteration of its finite-state counterpart training an HMM with s states requires at worst O(s2lwD time per training sentence. Second, the convergence properties of the algorithm sharply deteriorate as the number of nonterminal symbols increases.</Paragraph>
    <Paragraph position="3"> This fact can be intuitively understood by observing that the algorithm searches for the maximum of a function whose number of local maxima grows with the number of nonterminMs. Finally, although SCFGs provide a hierarchical model of the language, that structure is undetermined by raw text and only by chance will the inferred grammar agree with qualitative linguistic judgments of sentence structure. For example, since in English texts pronouns are very likely to immediately precede a verb, a grammar inferred from raw text will tend to together the subject pronoun with the verb.</Paragraph>
    <Paragraph position="4"> We describe here an extension of the inside-outside algorithm that infers the parameters of a stochastic context-free grammar from a partially parsed corpus, thus providing a tighter connection between the hierarchical structure of the inferred SCFG and that of the training corpus. The Mgorithm takes advantage of whatever constituent information is provided by the training corpus bracketing, ranging from a complete constituent analysis of the training sentences to the unparsed corpus used for the original inside-outside algorithm. In the latter case, the new algorithm reduces to the original one.</Paragraph>
    <Paragraph position="5"> Using a partiMly parsed corpus has several important advantages. We empirically show that the use of partially parsed corpus can decrease the number of iterations needed to reach a solution. We also exhibit cases where a good solution is found from partially parsed corpus but not from raw text. Most importantly, the use of partially parsed corpus enables the Mgorithm to infer grammars that derive constituent boundaries that cannot be inferred from raw text.</Paragraph>
    <Paragraph position="6"> We first outline our extension of the inside-outside algorithm to partially parsed text, and then report preliminary experiments illustrating the advantages of the extended algorithm.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="122" type="metho">
    <SectionTitle>
2. PARTIALLY BRACKETED TEXT
</SectionTitle>
    <Paragraph position="0"> Informally, a partially bracketed corpus is a set of sentences annotated with parentheses marking constituent boundaries that any analysis of the corpus should respect. More precisely, we start from a corpus C consisting of bracketed strings, which are pairs c = (w, B) where w is a string and B is a bracketing of w. For convenience, we will define the length of the bracketed string cby\[c\[=\[w I.</Paragraph>
    <Paragraph position="1"> Given a string w = wl ...wlw \[, a span ofw is a pair of integers (i, j) with 0 _~ i &lt; j _~ \[w\[. By convention, span (i,j) delimits substring iwj = wi+l ...wj of w. We also  use the abbreviation itv for iWlto I.</Paragraph>
    <Paragraph position="2"> A bracketing B of a string w is a finite set of spans on w (that is, a finite set of pairs or integers (i, j) with 0 _&lt; i &lt; j _&lt; \[wl) satisfying a consistency condition that ensures that each span (i, j) can be seen as delimiting a (sequence of) constituents iwi. The consistency condition is simply that no two spans in a bracketing may overlap, where two spans (i, j) and (k, I) overlap if either i &lt; k &lt; j &lt; i or k &lt; i &lt; l &lt; j. We also say that two bracketings of the same string are compatible if their union is consistent.</Paragraph>
    <Paragraph position="3"> Note that there is no requirement that a bracketing of w describe fully the constituent structure of w. In fact, some or all sentences in a corpus may have empty bracketings, in which case the new algorithm behaves like the original one.</Paragraph>
    <Paragraph position="4"> To present the notion of compatibility between a derivation and a bracketed string, we need first to define the span of a symbol occurrence in a context-free derivation. Let (w, B) be a bracketed string, and a0 ~ al =::&gt; ...</Paragraph>
    <Paragraph position="5"> am = w be a derivation of w for (S)CFG G. The span of a symbol occurrence in aj is defined inductively as follows:</Paragraph>
    <Paragraph position="7"> where A ~ Xi ..- Xk is a production of G. Then the span of A in aj is (il,jk), where for each 1 &lt; 1 &lt; k, (iz,jt) is the span of Xz in aj+l. The spans in aj of the symbol occurrences in/~ and 7 are the same as those of the corresponding symbols in otj+t.</Paragraph>
    <Paragraph position="8"> A derivation of w is then compatible with a bracketing B of w if no span of a symbol occurrence in the derivation overlaps a span in B.</Paragraph>
  </Section>
  <Section position="5" start_page="122" end_page="123" type="metho">
    <SectionTitle>
3. THE INSIDE-OUTSIDE
ALGORITHM
</SectionTitle>
    <Paragraph position="0"> The inside-outside algorithm \[2\] is a reestimation procedure for the rule probabilities of a Chomsky normal-form (CNF) SCFG. It takes as inputs an initial CNF SCFG and a training corpus of sentences and it iteratively reestimates rule probabilities to maximize the probability that the grammar used as a stochastic generator would produce the corpus.</Paragraph>
    <Paragraph position="1"> A reestimation algorithm can be used both to refine the parameter estimates for a CNF SCFG derived by other means \[7\] or to infer a grammar from scratch. In the latter case, the initial grammar for the inside-outside algorithm consists of all possible CNF rules over given sets N of nonterminals and E of terminals, with suitable assigned nonzero probabilities. In what follows, we will take N, E as fixed, n = \[NI, t = I~1, and assume enumerations N = {A1,...,An} and E = {bl,...,bt}, with A1 the grammar start symbol. A CNF SCFG over N, E can then be specified by the n s + nt probabilities Bp,q,r of each possible binary rule Ap ~ Aq Ar and Up,m of each possible unary rule A n ~ bin. Since for each p the parameters Bp.q,r and Up,m are supposed to be the probabilities of different ways of expanding Ap, we must have the for all 1 _&lt; p_&lt; n</Paragraph>
    <Paragraph position="3"> For grammar inference, we give random initial values to the parameters Bp,q,r and Up,m subject to the constraints (I).</Paragraph>
    <Paragraph position="4"> The intended meaning of rule probabilities in a SCFG is directly tied to the intuition of context-freeness: a derivation is assigned a probability which is the product of the probabilities of the rules used in each step of the derivation. Context-freeness together with the commutativity of multiplication thus allow us to identify all derivations associated to the same parse tree, and we will speak indifferently below of derivation and analysis (parse tree) probabilities. Finally, the probability of a sentence or sentential form is the sum of the probabilities of all its analyses (equivalently, the sum of the probabilities of all of its leftmost derivations from the start symbol).</Paragraph>
    <Paragraph position="5"> The basic idea of the inside-outside algorithm is to use the current rule probabilities to estimate from the training text the expected frequencies of certain derivation steps, and then compute new rule probability estimates as appropriate frequency ratios. Therefore, each iteration of the algorithm starts by calculating estimates of the number of occurrences of the relevant configurations in each of the sentences tv in the training corpus W.</Paragraph>
    <Paragraph position="6"> Because the frequency estimates are most conveniently computed as ratios of other frequencies, they are a bit loosely referred to as inside and outside probabilities.</Paragraph>
    <Paragraph position="7"> In the original inside-outside algorithm, for each tv E W, the inside probability I~(i,j) estimates the likelihood that Ap derives iwj, while the outside probability O~(i,j) estimates the likelihood of deriving sentential form owi Apjw from the start symbol A1. In adapting the algorithm to partially bracketed strings we must take into account the constraints that the bracketing imposes on possible derivations, and thus on possible phrases.</Paragraph>
    <Paragraph position="8"> Clearly, nonzero values for I~(i,j) or O~(i,j) should only be allowed if iwj is compatible with the bracketing of w, or, equivalently, if (i, j) does not overlap any span  in the bracketing of w. Therefore, we will in the following assume a bracketed corpus C, which as described above is a set of bracketed strings c = (w, B), and will modify the standard formulae for the inside and outside probabilities and rule probability reestimation \[2, 4, 5\] to involve only constituents whose spans are compatible with string bracketings. For this purpose, for each bracketed string c = (w, B) we define the auxiliary function</Paragraph>
    <Paragraph position="10"> For each bracketed sentence c in the training corpus, the inside probabilities of longer spans of c can be computed from those for shorter spans by the following recurrence equations:</Paragraph>
    <Paragraph position="12"> Equation (3) computes the expected relative frequency of derivations of ~wk from Ap compatible with the bracketing B of c = (w, B). The multiplier 5(i, k) is 0 just in case (i, k) overlaps some span in B, which is exactly when Ap cannot derive iwk compatibly with B.</Paragraph>
    <Paragraph position="13"> Similarly, the outside probabilities for shorter spans of c can be computed from the inside probabilities and the outside probabilities for longer spans by the following recurrence:</Paragraph>
    <Paragraph position="15"> Once the inside and outside probabilities computed for each sentence in the corpus, the reestimated probability of binary rules, /Jp,q,r, and the reestimated probability of unary rules, (/p,m, are computed using the following reestimation formulae, which are just like the standard ones \[2, 5, 4\] except for the use of bracketed strings instead of unbracketed ones:</Paragraph>
    <Paragraph position="17"> where Pc is the probability assigned by the current model to bracketed string c</Paragraph>
    <Paragraph position="19"> and P~ is the probability assigned by the current model to the set of derivations compatible with c involving some instance of nonterminal Ap</Paragraph>
    <Paragraph position="21"> The denominator of ratios (6) and (7) estimates the probability that a compatible derivation of a bracketed string in C will involve at least one expansion of nonterminal Av. The numerator of (6) estimates the probability that a compatible derivation of a bracketed string in C will involve rule Ap --~ Aq At, while the numerator of (7) estimates the probability that a compatible derivation of a string in C will rewrite Ap to bin. Thus (6) estimates the probability that a rewrite of Ap in a compatible derivation of a bracketed string in C will use rule Ap ~ Aq At, and (7) estimates the probability that an occurrence of Ap in a compatible derivation of a string in in C will be rewritten to bin. Clearly, these are the best current estimates for the binary and unary rule probabilities. null The process is then repeated with the reestimated probabilities until the increase in the estimated probability of the training text given the model becomes negligible, or, what amounts to the same, the decrease in the cross entropy estimate (log probability)</Paragraph>
    <Paragraph position="23"> becomes negligible. Note that for comparisons with the original algorithm, we should use the cross entropy of the unbracketed text with respect to the grammar, not (8).</Paragraph>
  </Section>
  <Section position="6" start_page="123" end_page="126" type="metho">
    <SectionTitle>
4. EXPERIMENTAL EVALUATION
</SectionTitle>
    <Paragraph position="0"> The following experiments, although preliminary, give some support to our earlier suggested advantages of the inside-outside algorithm for partially bracketed corpora.</Paragraph>
    <Paragraph position="1"> We start with a formal-language example used by Lari and Young \[4\] in a previous evaluation of the inside-outside algorithm. In this case, training on a bracketed  corpus can lead to a good solution while no reasonable solution is found training on raw text only.</Paragraph>
    <Paragraph position="2"> Then, using a naturally occurring corpus and its partially bracketed version provided by the Penn Treebank, we compare the bracketings assigned by grammars inferred from raw and from bracketed training material with the Penn Treebank bracketings.</Paragraph>
    <Paragraph position="3"> Together, the experiments support the view that training on bracketed corpora can lead to better convergence, and the resulting grammars agree better with linguistic judgments of sentence structure.</Paragraph>
    <Paragraph position="4"> 4.1. Inferring the Palindrome Language We consider first an artificial language discussed by Lari and Young \[4\]. Our training corpus consists of 100 sentences in the palindrome language L over two symbols a  The initial grammar consists of all possible CNF rules over five nonterminals and the terminals a and b (135 rules), with a random assignment of initial probabilities. As shown in Figure 1, with an unbracketed training set the log probability remains almost unchanged after 40 iterations (from 1.57 to 1.43) and no useful solution is found. In contrast, with the same training set fully bracketed, the log probability of the inferred grammar computed on the raw text decreases rapidly (1.57 initially, 0.87 after 22 iterations). Similarly, the cross entropy estimate of the bracketed text with respect to the grammar improves rapidly (2.85 initially, 0.87 after 22 iterations).</Paragraph>
    <Paragraph position="5"> The inferred grammar models correctly the palindrome language. Its high probability rules (p &gt; 0.1, p/p' &gt; 30  which is a close to optimal CNF CFG for the palindrome language.</Paragraph>
    <Paragraph position="6"> The results on this grammar are quite sensitive to the size and statistics of the training corpus and the initial rule probability assignment. In fact, for a couple of choices of initial grammar and corpus, the original algorithm yields somewhat better results than the new one. However, in no experiment did the training on unparsed text achieve nearly as good a result as that shown above for parsed text.</Paragraph>
    <Paragraph position="7"> 4.2. Experiments on the ATIS Corpus We also conducted an experiment on inferring grammars for the language consisting of part-of-speech sequences of spoken-language transcriptions in the Texas Instruments subset of the Air Travel Information System (ATIS) corpus \[8\]. We take advantage of the availability of the hand-parsed version of the ATIS corpus provided by the Penn Treebank project \[9\] and use the corresponding bracketed corpus over parts of speech as training data. Out of the 770 bracketed sentences (7812 words) in the corpus, we used 700 as training data and 70 (901 words) as test set. The following is an example training string</Paragraph>
    <Paragraph position="9"> corresponding to the parsed sentence (((\[List (the fares (for ((flight) (number 891)))))) .) The initial grammar consists of all possible CNF rules (4095 rules) over 15 nonterminals (the same number as in the tree bank) and 48 terminals corresponding to the parts of speech used in the tree bank.</Paragraph>
    <Paragraph position="10"> We trained a random initial grammar twice, on the unbracketed version of the training corpus yielding grammar GR, and on the bracketed training set, yielding grammar GB.</Paragraph>
    <Paragraph position="11"> Figure 2 shows that the convergence to GB is faster than the convergence to GR. Even though the cross-entropy estimates for the raw training text with both grammars are not that different after 50 iterations (3.0 for GB, 3.02 for GR), the analyses assigned by the resulting grammars to the test set are drastically different.</Paragraph>
    <Paragraph position="12"> To evaluate objectively the quality of the analyses yielded by a grammar G, we used a Viterbi-style parser to find the most likely analyses of the test set according to G, and computed the proportion of phrases in those analyses that are compatible in the sense defined in Section 2 with the tree bank bracketings of the test set. This criterion is closely related to the &amp;quot;crossing parentheses&amp;quot; score of Black et al. \[10\]. We found that that only 35% of the constituents in the most likely GR analyses of the test set are compatible with tree bank bracketing, in contrast to 88% of the constituents in the most likely GB analysis.</Paragraph>
    <Paragraph position="13"> As a first example, GB gives the following bracketings:  It is interesting to look at some the differences between GR and GB, as seen from the most likely analyses they assign to certain sentences. For readability, we give the analyses in terms of the original words rather than part of speech tags.</Paragraph>
    <Paragraph position="14"> ((Tell me (about (the public transportation ((from SF0) (to San Francisco))))). ) However, the most likely GR analysis has nine constituents incompatible with the tree bank: (Tell ((me (((about the) public) tramsportation)) ((from SF0) ((to Sam) (Francisco . ) ) ) ) ) In this analysis, a Francisco and the final punctuation are places in a lowest-level constituent. Since final punctuation is quite often preceded by a noun, a grammar inferred from raw text will tend to bracket the noun with the punctuation mark.</Paragraph>
    <Paragraph position="15"> Even better results can be obtained by continuing the reestimation on bracketed text. After 78 iterations, 91% of the constituents of the most likely parse of the test set are compatible with the tree bank bracketing.</Paragraph>
    <Paragraph position="16">  This experiment illustrates the fact that although SCFGs provide a hierarchical model of the language, that structure is undetermined by raw text and only by chance will the inferred grammar agree with qualitative linguistic judgments of sentence structure. This problem has also been previously observed with linguistic structure inference methods based on mutual information. Magerman and Marcus \[11\] propose to alleviate this behavior by enforcing that a predetermined list of pairs of words (such as verb-preposition, pronoun-verb) are never embraced by a constituent. However, these constraints are stipulated in advance rather than being automatically derived from the training material, in contrast with what we have shown to be possible with the inside-outside algorithm for partially bracketed corpora.</Paragraph>
  </Section>
  <Section position="7" start_page="126" end_page="126" type="metho">
    <SectionTitle>
5. CONCLUSIONS AND FURTHER
WORK
</SectionTitle>
    <Paragraph position="0"> We have introduced a modification of the well-known inside-outside algorithm for inferring the parameters of a stochastic context-free grammar that can take advantage of constituent information (constituent bracketing) in a partially bracketed corpus.</Paragraph>
    <Paragraph position="1"> The method has been successfully applied to SCFG inference for formal languages and for part-of-speech sequences derived from the ATIS spoken-language corpus.</Paragraph>
    <Paragraph position="2"> The use of partially bracketed corpus can reduce the number of iterations required for convergence of parameter reestimation. In some cases, a good solution is found from a bracketed corpus but not from raw text. Most importantly, the use of partially bracketed natural corpus enables the algorithm to infer grammars specifying linguistically reasonable constituent boundaries that cannot be inferred by the inside-outside algorithm on raw text.</Paragraph>
    <Paragraph position="3"> These preliminary investigations could be extended in several ways. First, it is important to determine the sensitivity of the training algorithm to the initial probability assignments and training corpus, as well as to lack or misplacement of brackets. We have started experiments in this direction, but reasonable statistical models of bracket elision and misplacement are lacking.</Paragraph>
    <Paragraph position="4"> Second, we would like to extend our experimvnts to larger terminal vocabularies. As is well-known, this raises both computational and data sparseness problems, so clustering of terminal symbols will be essential.</Paragraph>
    <Paragraph position="5"> Finally, this work does not address a central weakness of SCFGs, their inability to represent lexical influences on distribution except by a statistically and computationally impractical proliferation of nonterminal symbols. One might instead look into versions of the current algorithm for more lexically-oriented formalisms such as stochastic lexicalized tree-adjoining grammars \[12\].</Paragraph>
  </Section>
  <Section position="8" start_page="126" end_page="126" type="metho">
    <SectionTitle>
ACKNOWLEGMENTS
</SectionTitle>
    <Paragraph position="0"> We thank Aravind Joshi and Stuart Shieber for useful discussions. The second author is partially supported by</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML