File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0102_metho.xml
Size: 18,624 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0102"> <Title>Lexical Heads, Phrase Structure and the Induction of Grammar</Title> <Section position="4" start_page="15" end_page="16" type="metho"> <SectionTitle> 3. INCORPORATING HEADEDNESS INTO LANGUAGE MODELS </SectionTitle> <Paragraph position="0"> The conclusion of the above section might lead us to is that basing phrase-structure grammar induction on minimization of entropy is a poor idea. However, in this paper we will not discuss whether statistical induction is the proper way to view language acquisition: our current goal is only to better understand why current statistical methods produce the &quot;wrong&quot; answer and to explore ways of fixing them.</Paragraph> <Paragraph position="1"> Let us look again at (A), reproduced below, and center discussion on an extended stochastic context-free grammar model in which a binary context-free rule Z ~ A B with terminal parts-of-speech on the right hand side first generates a word a chosen from a distribution pA(a), then generates a word b from a distribution p~(b). 3 If we call these two random variables A and B, then the entropy of the sequence AB is H(A) / H(B\[A) = H(A) + H(B) - I(A, B) (where H(X) is the entropy of a random variable X and I(X, Y) is the mutual information between two random variables X and Y). The point here is that using such a context free rule to model a sequence of two words reduces the entropy of the language from a model that treats the two words as independent, by precisely the mutual information between the two words.</Paragraph> <Paragraph position="3"> In English, verbs and prepositions in configuration (A) are closely coupled semantically, probably more closely than prepositions and nouns, and we would expect that the mutual information between the verb and preposition would be greater than between the preposition and noun, and greater still than between the verb and the noun.</Paragraph> <Paragraph position="4"> I(V, P) > I(P, N) > I(V, N) Under our hypothesized model, structure (A) has entropy H(V) + H(P) + H(NIP) = H(V) / H(P) / H(N) - I(P, N), which is higher than the entropy of structures (E-H), H(V) + H(P) + H(N) - I(V, P), and we wouldn't expect a learning mechanism based on such a model to settle on (A).</Paragraph> <Paragraph position="5"> However, this simple class of models only uses phrases to capture relations between adjacent words. In (A), it completely ignores the relation between the verb and the prepositional phrase, save to predict that a prepositional phrase (any prepositional phrase) will follow the verb. We modify our language model, assuming that nonterminals exhibit the distributional properties of their heads. We will write a phrase Z that is headed by a word z as (Z, z). Each grammar rule will look like either (Z', z) ~ (Z, z)(Y, y) or (Z', z) ~ (Y, y)(Z, z) (abbreviated Z' ~ Z Y and Z' ~ Y Z) and the probability model is</Paragraph> <Paragraph position="7"/> <Paragraph position="9"> Of course, this class of models is strongly equivalent to ordinary context free grammars. We could substitute, for every rule Z' :::v Z Y, a large number of word-specific rules (Z', z~) ~ (Z, zi)(Y, yj) with probabilities p(Z' ~ Z Y) z, * py (yj).</Paragraph> <Paragraph position="10"> Using our new formalism, the head properties of (A) look like</Paragraph> <Paragraph position="12"> and the entropy is H(V) + H(P\]V) + H(N\]P) = H(V) + H(P) + H(N) - I(V, P) - \[(P, N). The grammar derived from (A) is optimal under this model of language, though (C), (F), and (H) are equally good. They could be distinguished from (A) in longer sentences because they pass different head information out of the phrase. In fact, the grammar model derived from (A) is as good as any possible model that does not condition N on V. Under this class of models there is no benefit to grouping two words with high mutual information together in the same minimal phrase; it is sufficient for both to be the heads of phrases that are adjacent at some level.</Paragraph> <Paragraph position="13"> There is of course no reason why the best head-driven statistical model of a given language must coincide with a grammar derived by a linguist. The above class of models makes no mention of deletion or movement of phrases, and only information about the head of a phrase is being passed beyond that phrase's borders. The government-binding framework usually supposes that an inflection phrase is formed of inflection and the verb phrase. But the verb is likely to have a higher mutual information with the subject than inflection does. So it seems unlikely that this structure would be learned using our scheme. The effectiveness of the class of models can only be verified by empirical tests.</Paragraph> </Section> <Section position="5" start_page="16" end_page="16" type="metho"> <SectionTitle> 4. SOME EXPERIMENTS </SectionTitle> <Paragraph position="0"> We have built a stochastic, feature-based Earley parser (de Marcken, 1995) that can be trained using the inside-outside algorithm. Here we describe some tests that explore the interaction of the head-driven tanguage models described above with this parser and training method.</Paragraph> <Paragraph position="1"> For all the tests described here, we learn a grammar by starting with an exhaustive set of stochastic context-free rules of a certain form, and estimate probabilities for these rules from a test corpus. This is the same general procedure as used by (Lari and Young, 1990; Briscoe and Waegner, 1992; Pereira and Schabes, 1992) and others. For parts-of-speech Y and Z, the rules we include in our base grammar are</Paragraph> </Section> <Section position="6" start_page="16" end_page="21" type="metho"> <SectionTitle> S:=~ZP ZP:=~ZPYP ZP:=~YP ZP ZP~ ZYP ZP ~YPZ ZP~Z </SectionTitle> <Paragraph position="0"> where S is the root nonterminal. As is ususal with stochastic context-free grammars, every rule has an associated probability, and the probabilities of all the rules that expand a single nonteminal must sum to one. Furthermore, each word and phrase has an associated head word (represented as a feature value that is propagated from the Z or ZP on the right hand side of the above rules to the left hand side). The parser is given the part of speech of each word.</Paragraph> <Paragraph position="1"> For binary rules, as per equations (1) and (2), the distribution of the non-head word is conditioned on the head (a bigram). Initially, all word bigrams are initialized to uniform distributions, and context-free rule probabilities are initialized to a small random perturbation of a uniform distribution.</Paragraph> <Section position="1" start_page="16" end_page="21" type="sub_section"> <SectionTitle> 4.1. A Very Simple Sentence </SectionTitle> <Paragraph position="0"> We created a test corpus of 1000 sentences, each 3 words long with a constant part-of-speech pattern ABC. Using 8 equally probable words per part-of-speech, we chose a word distribution over the sentences with the following characteristics: I(A,B) = 1 bit. I(B,C) = 0.188 bits. I(A,C) = 0 bits.</Paragraph> <Paragraph position="1"> In other words, given knowledge of the first word in the sentence, predicting the second word is as difficult as guessing between four equally-likely words, and knowing the second word makes predicting the third as difficult as guessing between seven words. Knowing the first gives no information about the third. This is qualitatively similar to the distribution we assumed for verbs, nouns, and prepositions in configuration (A), and has entropy rate 3+(3-1)+(3--.188) : 2.604 bits 3 per word. Across 20 runs, the training algorithm converged to three different grammars: 4 4Le., after the cross-entropy had ceased to decrease on a given run, the parser settled on one of these strtlctures as the Viterbi parse of each sentences in the corpus. The cross-entropy rate of the two best grammars is lower than the source entropy rate because the corpus is finite and randomly generated, and has been be overfitted. One fact is immediately striking: even with such simple sentences and rule sets, more often than not the inside-outside algorithm converges to a suboptimal grammar. To understand why, let us ignore recursive rules (ZP :=*- ZP YP) for the moment. Then there are four possible parses of ABC (cross-entropy rate with source given below- lower is better model):</Paragraph> <Paragraph position="3"> During the first pass of the inside-outside algorithm, assuming near-uniform initial rule probabilities, each of these parses will have equal posterior probabilities. They are equally probable because they use the same number of expansions 5 and because word bigrams are uniform at the start of the parsing process. Thus, the estimated probability of a rule after the first pass is directly proportional to how many of these parse trees the rule features in. The rules that occur more than one time are:</Paragraph> <Paragraph position="5"> Therefore, on the second iteration, these three rules will have higher probabilities than the others and will cause parses J and K to be favored over I and L (with K favored over J because I(A, B) + I(A, C) > I(B, C) +I(A, C)). It is to be expected then, that the inside-outside algorithm favors the suboptimal parse K: at its start the inside-outside algorithm is guided by tree counting arguments, not mutual information between words. This suggests that the inside-outside algorithm is likely to be highly sensitive to the form of grammar and how many different analyses it permits of a sentence.</Paragraph> <Paragraph position="6"> Why, later, does the algorithm not move towards a global optimum? The answer is that the inside-outside algorithm is supremely unsuited to learning with this representation. To understand this, notice that to move from the initially favored parse (K) to one of the optimal ones (I and L), three nonterminals must have their most probable rules switched:</Paragraph> <Paragraph position="8"> SThis is why we can safely ignore recursive rules in this discussion. Any parse that involves one will have a bigger tree and be significantly less probable.</Paragraph> <Paragraph position="9"> To simplify the present analysis, let us assume the probability of S ~ CP is held constant at 1, and that the rules not listed above have probability 0. In this case, we can write the probabilities of the left three rule as pA, pS and pC and the probabihties of the right rhree rules as 1 -pA, 1 --pB and 1 - pC. Now, for a given sentence abc there are only two parses with non-zero probabilities, K and L. The probability of abc under parse K is pApBpCp(c)p(alc)p(bla), and the probabihty under parse L is (1 - pA)(1 -- pS)(1 -- pC)p(c)p(blc)p(alb). Thus, the posterior probabihty of parse K is 6 p(Klabc) pApB pC p( c )p( a\[c )p( b\[a ) pApSpCp(c)p(alc)p(b\]a) + (1 -- pA)(1 -- pS)(1 -- pC)p(c)p(b\[c)p(a\[b) where a is the mean value of p(clb)/p(cla), ~ in the above test. Figure 4.1 graphically depicts the evolution of this dynamical system. What is striking in this figure is that the inside-outside algorithm is so attracted to grammars whose terminals concentrate probability on small numbers of rules that it is incapable of performing real search. Instead, it zeros in on the nearest such grammar, only biased shghtly by its relative merits. We now have an explanation for why the inside-outside algorithm converges to the suboptimal parse K so often: the first ignorant iteration of the algorithm biases the parameters towards K, and subsequently there is an overwhelming tendency to move to the nearest deterministic grammar. This is a strong indication that the algorithm is a poor choice for estimating grammars that have competing rule hypotheses.</Paragraph> <Paragraph position="10"> 4.2. Multiple Expansions of a Nonterminal For this test, the sentences were four words long (ABCD), and we chose a word distribution with the following characteristics: I(A,B)= lbit. I(A,D)= lbit. I(C,D)= O bits.</Paragraph> <Paragraph position="11"> I(A,C)= lbit. I(B,C)= 0bits. I(B,D)= 0bits.</Paragraph> <Paragraph position="12"> It might seem that a minimal-entropy grammar for this corpus would be Sin the following derivation, understand that for word bigrams p(a\]b) p(bla ) because p(a) = p(b) = 1</Paragraph> <Paragraph position="14"> = 2 and The vectors represent the motion of the parameters from one iteration to the next when a = p(cl~) pC = .5. Notice that the upper right corner (grammar K) and the lower left (grammar L) are stationary points (local maxima), and that the region of attraction for the global optimum L is bigger than for K, but that there is still a very substantial set of starting points from which the algorithm will converge to the suboptimal grammar, o~ = 2 is plotted instead of o~ = -~ because this better depicts the asymmetry mutual information between words introduces; with c~ = { the two regions of attraction would be of almost equal area.</Paragraph> <Paragraph position="16"> since this grammar makes the head A available to predict B, C, and D. Without multiple expansions rules for AP, it is impossible to get this. But the gain of one bit in word prediction is offset by a loss of at least two bits from uncertainty in the expansion of AP. Even if p(AP ~ A BP) = p(AP ~ AP CP) = 1/2, the probability of the structure ABCD under the above grammar is onequarter that assigned by a grammar with no expansion ambiguity. So, the grammar</Paragraph> </Section> </Section> <Section position="7" start_page="21" end_page="23" type="metho"> <SectionTitle> S=~DP DP~CPD CP::v-APC AP~ABP BP~ B </SectionTitle> <Paragraph position="0"> assigns higher probabilities to the corpus, even though it fails to model the dependency between A and D. This is a general problem with SCFGs: there is no way to optimally model multiple ordered adjunction without increasing the number of nonterminals. Not surprisingly, the learning algorithm never converges to the recursive grammar during test runs on this corpus.</Paragraph> <Paragraph position="1"> What broader implication does this deficiency of SCFGs have for context-free grammar based and therefore that, for many subject and object noun phrases, the noun will never enter into a bigram relationship with the verb. Obviously sufficient mutual information between nouns and verbs, adjectives, and determiners would force the global optimum to include multiple expansions of the Noun-P category, but it seems likely (given the characteristics of the inside-outside algorithm) that before such mutual information could be inferred from text, the inside-outside algorithm would enter a local optimum that does not pass the noun feature out.</Paragraph> <Paragraph position="2"> 4.3. Testing on the Penn Treebank To test whether head-driven language models do indeed converge to linguistically-motivated grammars better than SCFGs, we replicated the experiment of (Pereira and Schabes, 1992) on the ATIS section of the Penn Treebank. The 48 parts-of-speech in the Treebank were collapsed to 25, resulting in 2550 grammar rules.</Paragraph> <Paragraph position="3"> Word head features were created by assigning numbers a common feature; other words found in any case variation in the CELEX English-language database were given a feature particular to their lemma (thus mapping car and cars to the same feature); and all other (case-sensitve) words received their own unique feature. Treebank part-of-speech specifications were not used to constrain parses. Bigrams were estimated using a backoff to a unigram (see (de Marcken, 1995)), and unigrams backing off to a uniform distribution over all the words in the ATIS corpus. The backoff parameter was not optimized. Sentences 25 words or longer were skipped.</Paragraph> <Paragraph position="4"> We ran four experiments, training a grammar with and without bracketing and with and without use of features. Without features, we are essentially replicating the two experiments run by (Pereira and Schabes, 1992), except that they use a different set of initial rules (all 4095 CNF grammar rules over 15 nonterminals and the 48 Treebank terminal categories). Every tenth sentence of the 1129 sentences in the ATIS portion of the Treebank was set aside for testing. Training was over 1060 sentences (1017 of which 57 were skipped because of length), 5895 words, testing over 98 sentences (112, 14 skipped), 911 words.</Paragraph> <Paragraph position="5"> After training, all but the 500 most probable rules were removed from the grammar, and probabilities renormalized. The statistics for these smaller grammars are given below.</Paragraph> <Paragraph position="6"> of (Pereira and Schabes, 1992), unbracketed training does improve bracketing performance (from a baseline of about 50% to 72.7% without features and 74.8% with features). Unfortunately, this performance is achieved by settling on an uninteresting right-branching rule set (save for sentence-final punctuation). Note that our figures for bracketed training match very closely to the 90.36% bracketing accuracy reported in their paper.</Paragraph> <Paragraph position="7"> Of greater interest is that although use of head features improves bracketing performance, it does so only by an insignificant amount (though obviously it greatly reduces perplexity). There are many possible explanations for this result, but the two we prefer are that either the inside-outside algorithm, as might be expected given our arguments, failed to find a grammar that propagated head features optimally, or that there was insufficient mutual information in the small corpus for our enhancement to traditional SCFGs to have much impact.</Paragraph> <Paragraph position="8"> We have replicated the above experiments on the first 2000 sentences of the Wall Street Journal section of the Treebank, which has a substantially different character than the ATIS text. However, the vocabulary is so much larger that is is not possible to gather useful statistics over such a small sample. The reason we have not tested extensively on much larger corpora is that, using head features but no bracketing constraint, statistics must be recorded for every word pair in every sentence. The number of such statistics grows quadratically with sentence length, and is prohibitive over large corpora using our current techniques. More recent experiments, however, indicate that expanding the corpus size by an order of magnitude has little affect on our results.</Paragraph> </Section> class="xml-element"></Paper>