File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1006_metho.xml
Size: 17,080 bytes
Last Modified: 2025-10-06 14:13:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1006"> <Title>Using an Annotated Corpus as a Stochastic Grammar</Title> <Section position="4" start_page="38" end_page="39" type="metho"> <SectionTitle> 2 The Model </SectionTitle> <Paragraph position="0"> As might be clear by now, a IX)P-model is characterized by a corpus of tree structures, together with a set of operations that combine subtrees from the corpus into new trees. In this section we explain more precisely what we mean by subtree, operations etc., in order to arrive at definitions of a parse and the probability of a parse with respect to a corpus. For a treatment of DOP in more formal terms we refer to (Bod, 1992a).</Paragraph> <Section position="1" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 2.1 Subtree </SectionTitle> <Paragraph position="0"> A subtree of a tree T is a connected subgraph S of T such that for every node in S holds that if it has daughter nodes, then these are equal to the daughter nodes of the corresponding node in T. It is trivial to see that a subuee is also a tree. In the following example T 1 and T2 are subtrees of T, whereas T 3</Paragraph> <Paragraph position="2"> V NP likes John NP The general definition above also includes subUees consisting of one node. Since such subtrees do not contribute to the parsing process, we exclude these pathological cases and consider as the set of sublrees the non-trivial ones consisting of more than one node. We shall use the following notation to indicate that a tree t is a non-trivial subtree of a tree in a corpus C: t e C =oer 3 T 6 C: t is a non-trivial subtree of T</Paragraph> </Section> <Section position="2" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 2.2 Operations </SectionTitle> <Paragraph position="0"> In this article we will limit ourselves to the basic operation of substitution. Other possible operations are left to future research. If t and u are trees, such that the leftmost non-terminal leaf of t is equal to the root of u, then tou is the tree that results from substituting this non-terminal leaf in t by tree u. The partial function o is called substitution. We will write (tou)ov as touov, and in general (..((tlot2)ot3)o..)otn as tlot2ot3o...otn. The restriction lePStmost in the definition is motivated by the fact that it eliminates different derivations consisting of the same subtrees.</Paragraph> </Section> <Section position="3" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 2.3 Parse </SectionTitle> <Paragraph position="0"> Tree Tis a parse of input string s with respect to a corpus C, iffthe yieldof Tis equal to s and there are subtrees tI,...,tn e C, such that T-- tlO.., otn. The set of parses of s with respect to C, is thus given by:</Paragraph> <Paragraph position="2"> The definition correctly includes the trivial case of a subtree from the corpus whose yield is equal to the complete input string.</Paragraph> </Section> <Section position="4" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 2.4 Derivation </SectionTitle> <Paragraph position="0"> A derivation of a parse T with respect to a corpus C, is a tuple of subtrees (tl ..... ta) such that tl ..... tne C and tlo...otn = T. The set of derivations of T with respect to C, is thus given by:</Paragraph> <Paragraph position="2"/> </Section> <Section position="5" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 2.5 Probability </SectionTitle> <Paragraph position="0"> Given a subtree tl e C, a function root that yields the root of a tree, and a node labeled X, the conditional probability P(t=tl / root(t)=X) denotes the probability that t/ is substituted on X. If root(Q)C/ X, tins probability is 0. If root(t1) = X, this probability can be estimated as the ratio between the number of occurrences of tl in C and the total number of occurrences of subtrees t' in C for which holds that root(f) = X. Evidently, Zi P(t=-ti I root(O=X) = 1 holds.</Paragraph> <Paragraph position="1"> The probability of a derivation (tl ..... tn) is equal to the probability that the subtrees tl ..... tn are combined. This probability can be computed as the product of the conditional probabilities of the subtrees tl ..... t o. Let lnI(x) be the leflmost non-terminal leaf of tree x, then:</Paragraph> <Paragraph position="3"> The probability of a parse is equal to the probability that any of its derivations occurs. Since the derivations are mutually exclusive, the probability of a parse T is the sum of the probabilities of all its derivations. Let Detivations(T,C) = \[ d I ..... dn}, then: P(T) = ~,i P(di). The conditional probability of a parse T given input siring s, can be computed as the ratio between the probability of T and the sum of the probabilities of all parses of s.</Paragraph> <Paragraph position="4"> The probability of a string is equal to the probability that any of its parses occurs. Since the parses are mutually exclusive, the probability of a string s can be computed as the sum of the probabilities of all its parses. Let Parse.s(s,C) = {T I ..... Tn}, then: P(s) = 2~ i P(T/). It can be shown that ~'i P(si) = 1 holds.</Paragraph> </Section> </Section> <Section position="5" start_page="39" end_page="40" type="metho"> <SectionTitle> 3 Superstrong Equivalence </SectionTitle> <Paragraph position="0"> There is an important question as to whether it is possible to create for every DOP-model a strongly equivalent stochastic CFG which also assigns the same probabifities to the parses. In order to discuss this question, we introduce the notion of superstrong equivalence. Two stochastic grammars are called superstrongly equivalent, if they are strongly equivalent (i.e. they generate the same strings with the same trees) and they generate the same probability distribution over the trees.</Paragraph> <Paragraph position="1"> The question as to whether for every DOP-model there exists a strongly equivalent stochastic CFG, is rather trivial, since every subtree can be decomposed into rewrite rules describing exactly every level of constituent structure of that subtree. The question as to whether for every DOP-model there exists a supetsC/ongly equivalent stochastic CFG, can also be answered without too much difficulty. We shall give a counter-example, showing that there exists a DOP-model for which there is no superstrongly equivalent</Paragraph> <Paragraph position="3"> The conditional probabilities of the subtrees are:</Paragraph> <Paragraph position="5"> 1 holds. The language generated by this model is {ab*}. Let us consider the probabilities of the parses of the strings a and ab. The parse of siring a can be generated by exactly one derivation: by applying subtree t3. The probability of this parse is hence equal to 1/3. The parse of ab can be generated by two derivations: by applying subtree tl, or by combining subUees t2 and t3. The probability of this parse is equal to the sum of the probabilities of its two derivations, which is equal to P(t=--tl~OOt(t)=S) +</Paragraph> <Paragraph position="7"> =4/9.</Paragraph> <Paragraph position="8"> If we now want to construct a superstrongly equivalent stochastic CFG, it should assign the same probabilities to these parses. We will show that this is impossible. A CFG which is strongly equivalent with the DOP-model above, should contain the following rewrite rules.</Paragraph> <Paragraph position="10"> There may be other rules as well, but they should not modify the language or slructures generated by the CFG above. Thus, the rewrite rule S --~ A may be added to the rules, as well as A --~ B, whereas the rewrite rule S -o ab may not be added.</Paragraph> <Paragraph position="11"> Our problem is now whether we can assign probabilities to these rules such that the probability of the parse of a equals 1/3, and the probability of the parse of ab equals 4/9. The parse of a can exhaustively be generated by applying rule (2), while the parse of ab can exhaustively be generated by applying rules (1) and (2). Thus the following should hold:</Paragraph> <Paragraph position="13"> This implies that t)(I),1/3 = 4/9, thus P(1) = 4/9 * 3 = 4/3. This means that the probability of rule (1) should be larger than I, which is not allowed. Thus, we have proved that not for every DOP-model there exists a superstrongly equivalent stochastic CFG. In (Bod, 1992b) superstrong equivalence relations between other stochastic grammars are studied.</Paragraph> </Section> <Section position="6" start_page="40" end_page="675" type="metho"> <SectionTitle> 4 Monte Carlo Parsing </SectionTitle> <Paragraph position="0"> It is easy to show that an input string can be parsed with conventional parsing techniques, by applying subtrees instead of rules to the input string (Bod, 1992a). Every subtree t can be seen as a production rule toot(O --, ~ where the non-terminals of the yield of the right hand side constitute the symbols to which new rules/subtrees are applied. Given a polynomial time parsing algoritiun, a derivation of the input string, and hence a parse, can be calculated in polynomial time. But if we calculate the probability of a parse by exhaustively calculating all its derivations, the time complexity becomes exponential, since the number of derivations of a parse of an input string grows exponentially with the length of the input string.</Paragraph> <Paragraph position="1"> Nevertheless, by applying Monte Carlo techniques Crlammersley and Handscomb, 1964), we can estimate the probability of a parse and make its error arbitrarily small in polynomial time. The essence of Monte Carlo is very simple: it estimates a probability distribution of events by taking random samples. The larger the samples we take, the higher the reliability. For DOP this means that, instead of exhaustively calculating all parses with all their derivations, we randomly calculate N parses of an input string (by taking random samples from the subtrees that can be substituted on a specific node in the parsing process). The estimated probability of a certain parse given the input string, is then equal to the number of times that parse occurred normalized with respect to N. We can estimate a probability as accurately as we want by choosing Nas large as we want, since according to the Strong Law of Large Numbers the estimated probability converges to the actual probability. From a classical result of probability theory (Chebyshev's inequality) it follows that the time complexity of achieving a maximum error e is given by O(e'2). Thus the error of probability estimation can be made arbitrarily small in polynomial time - provided that the parsing algorithm is not worse than polynomial.</Paragraph> <Paragraph position="2"> Obviously, probable parses of an input string are more likely to be generated than improbable ones. Thus, in order to estimate the maximum probability parse, it suffices to sample until stability in the top of the parse distribution occurs. The parse which is generated most often is then the maximum probability parse.</Paragraph> <Paragraph position="3"> We now show that the probability that a certain parse is generated by Monte Carlo, is exactly the probability of that parse according to the DOP-model. First, the probability that a subtree t e C is sampled at a certain point in the parsing process (where a non-terminal X is to be substituted) is equal to P( t I root(t) = X ).</Paragraph> <Paragraph position="4"> Secondly, the probability that a certain sequence tl ..... tn of subtrees that constitutes a derivation of a parse T, is sampled, is equal to the product of the conditional probabilities of these subtrees. Finally, the probability that any sequence of subtrees that constitutes a derivation of a certain parse T, is sampled, is equal to the sum of the probabilities that these derivations are sampled. This is the probability that a certain parse T is sampled, which is equivalent to the probability of T according to the DOP-model.</Paragraph> <Paragraph position="5"> We shall call a parser which applies this Monte Carlo technique, a Monte Carlo parser. With respect to the theory of computation, a Monte Carlo parser is a probabilistic algorithm which belongs to the class of Bounded error Probabilistic Polynomial time (BPP) algorithms. BPP-problems are characterized by the following: it may take exponential time to solve them exactly, but there exists an estimation algorithm with a probability of error that becomes arbitrarily small in polynomial time.</Paragraph> <Paragraph position="6"> Experiments on the ATIS corpus For our experiments we used part-of-speech sequences of spoken-language transcriptions from the Air Travel Information System (ATIS) corpus (Hemphill et al., 1990), with the labeled-bracketings of those sequences in the Penn Treebank (Marcus, 1991). The 750 labeled-bracketings were divided at random into a DOP-corpus of 675 trees and a test set of 75 part-of-speech sequences. The following tree is an example from the DOP-corpns, where for reasons of readability the lexical items are added to the part-of-speech tags. As a measure for pars/n# accuracy we took the percentage of the test sentences for which the maximum probability parse derived by the Monte Carlo parser (for a sample size N) is identical to the Treebankparse.</Paragraph> <Paragraph position="7"> It is one of the most essential features of the DOP approach, that arbitrarily large subtrees are taken into consideration. In order to test the usefulness of this feature, we performed different experiments constraining the depth of the subtrees. The depth of a tree is defmed as the length of its longest path. The following table shows the results of seven experiments. The accuracy refers to the parsing accuracy at sample size N= I00, and is rounded off to the nearest integer.</Paragraph> <Paragraph position="8"> depth accuracy</Paragraph> <Paragraph position="10"> Parsing accuracy for the ATIS corpus, sample size N= I00.</Paragraph> <Paragraph position="11"> The table shows that there is a relatively rapid inc~'~ase in parsing accuracy when enlarging the maximum depth of the subUees to 3. The accuracy keeps increasing, at a slower rate, when the depth is enlarged further. The highest accuracy is obtained by using all subtrees from the corpus: 72 out of the 75 sentences from the test set are parsed correctly.</Paragraph> <Paragraph position="12"> In the following figure, parsing accuracy is plotted against the sample size Nfor three of our experiments: the experiments where the depth of the subtrees is constrained to 2 and 3, and the experiment where the depth is unconswained. (The maximum depth in the Parsing accuracy for the ATIS corpus, with depth < 2, with depth < 3 and with unbounded depth.</Paragraph> <Paragraph position="13"> In (Pereira and Schabes, 1992), 90.36% bracketing accuracy was reported using a stochastic CFG trained on bracketings from the ATIS corpus. Though we cannot make a direct cC/~parison, our pilot experiment suggests that our model may have better performance than a stochastic CFG. However, there is still an error rate of 4%. Although there is no reason to expect 100% accuracy in the absence of any semantic or pragmatic analysis, it seems that the accuracy might be further improved. Three limitations of the current experiments are worth mentioning, Fn~t, the Treebank annotations are not rich enough.</Paragraph> <Paragraph position="14"> Although the Treebank uses a relatively rich part-of-speech system (48 terminal symbols), there are only 15 non-terwinal symbols. Especially the internal su~cmre of noun phrases is very poor. Semantic annotations are completely absent.</Paragraph> <Paragraph position="15"> Secondly, it could be that subtrees which occur only once in the corpus, give bad estimations of their actual probabilities. The question as to whether reestimation techniques would further improve the accuracy, must be considered in future research.</Paragraph> <Paragraph position="16"> Thirdly, it could be that our corpus is not large enough. This brings us to the question as to how much parsing accuracy depends on the size of the corpus. For studying this question, we performed additional experiments with different corpus sizes.</Paragraph> <Paragraph position="17"> Starting with a corpus of only 50 parse trees (randomly chosen from the initial DOP-corpus of 675 trees), we increased its size with intervals of 50. As our test set, we took the same 75 p-o-s sequences as used in the previous experiments. In the next figure the parsing accuracy, for sample size N = 100, is plotted against the corpus size, using all corpus subtrees.</Paragraph> <Paragraph position="18"> The figure shows the increase in parsing accuracy. For a corpus size of 450 trees, the accuracy reaches already 88%. After this, the growth decreases, but the accuracy is still growing at corpus size 675. Thus, we would expect a higher accuracy if the corpus is further enlarged.</Paragraph> </Section> class="xml-element"></Paper>