File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-4004_metho.xml
Size: 72,226 bytes
Last Modified: 2025-10-06 14:15:18
<?xml version="1.0" standalone="yes"?> <Paper uid="J99-4004"> <Title>Semiring Parsing</Title> <Section position="3" start_page="578" end_page="591" type="metho"> <SectionTitle> 2. Semiring Parsing </SectionTitle> <Paragraph position="0"> In this section we first describe the inputs to a semiring parser: a semiring, an item-based description, and a grammar. Next, we give the conditions under which a semiring parser gives correct results. At the end of this section we discuss three especially complicated and interesting semirings.</Paragraph> <Section position="1" start_page="578" end_page="579" type="sub_section"> <SectionTitle> 2.1 Semiring </SectionTitle> <Paragraph position="0"> In this subsection, we define and discuss semirings (see Kuich \[1997\] for an introduction). A semiring has two operations, and (r), that intuitively have most (but not necessarily all) of the properties of the conventional + and x operations on the positive integers. In particular, we require the following properties: (r) is associative and commutative; (r) is associative and distributes over G. If @ is commutative, we will say that the semiring is commutative. We assume an additive identity element, which we write as 0, and a multiplicative identity element, which we write as 1. Both addition and multiplication can be defined over finite sets of elements; if the set is empty, then the value is the respective identity element, 0 or 1. We also assume that x @ 0 = 0 (r) x = 0 for all x. In other words, a semiring is just like a ring, except that the additive operator need not have an inverse. We will write /A, (r), (r), 0,1 / to indicate a semiring over the set A with additive operator (r), multiplicative operator @, additive identity 0, and multiplicative identity 1.</Paragraph> <Paragraph position="1"> For parsers with loops, i.e., those in which an item can be used to derive itself, we will also require that sums of an infinite number of elements be well defined. In particular, we will require that the semirings be complete (Kuich 1997, 611). This means that sums of an infinite number of elements should be associative and commutative, just like finite sums, and that multiplication should distribute over infinite sums, just as it does over finite ones. All of the semirings we will deal with in this paper are complete. 2 All of the semirings we discuss here are also w-continuous. Intuitively, this means that if any partial sum of an infinite sequence is less than or equal to some value, ({TRUE, FALSE }, V, A, FALSE, TRUE) +, x, o, 1> (II~, max, x, O, 1)</Paragraph> <Paragraph position="3"> string probability prob. of best derivation number of derivations set of derivations best derivation best n derivations then the infinite sum is also less than or equal to that value. 3 This important property makes it easy to compute, or at least approximate, infinite sums. There will be several especially useful semirings in this paper, which are defined in Figure 5. We will write P~ to indicate the set of real numbers from a to b inclusive, with similar notation for the natural numbers, N. We will write E to indicate the set of all derivations in some canonical form, and 2 n to indicate the set of all sets of derivations in canonical form. There are three derivation semirings: the derivation forest semiring, the Viterbi-derivation semiring, and the Viterbi-n-best semiring. The operators used in the derivation semirings (., max, x, max, and x ) will be described Vit Vit Vit-n Vit-n later, in Section 2.5.</Paragraph> <Paragraph position="4"> The inside semiring includes all nonnegative real numbers, to be closed under addition, and includes infinity to be closed under infinite sums, while the Viterbi semiring contains only numbers up to 1, since under max this still leads to closure. The three derivation forest semirings can be used to find especially important values: the derivation forest semiring computes all derivations of a sentence; the Viterbi-derivation semiring computes the most probable derivation; and the Viterbi-n-best semiring computes the n most probable derivations. A derivation is simply a list of rules from the grammar. From a derivation, a parse tree can be derived, so the derivation forest semiring is analogous to conventional parse forests. Unlike the other semirings, all three of these semirings are noncommutative. The additive operation of these semirings is essentially union or maximum, while the multiplicative operation is essentially concatenation. These semirings are described in more detail in Section 2.5.</Paragraph> </Section> <Section position="2" start_page="579" end_page="580" type="sub_section"> <SectionTitle> 2.2 Item-based Description </SectionTitle> <Paragraph position="0"> A semiring parser requires an item-based description of the parsing algorithm, in the form given earlier. So far, we have skipped one important detail of semiring parsing. In a simple recognition system, as used in deduction systems, all that matters is whether an item can be deduced or not. Thus, in these simple systems, the order of processing items is relatively unimportant, as long as some simple constraints are met. On the other hand, for a semiring such as the inside semiring, there are important ordering constraints: we cannot compute the inside value of an item until the inside values of 3 To be more precise, all semirings we discuss here are naturally ordered, meaning that we can define a partial ordering, _U, such that x _U y if and only if there exists z such that x @ z ---- y. We call a naturally ordered complete semiring w-continuous (Kuich 1997, 612) if for any sequence Xl, x2 .... and for any constant y, if for all n, (~o<_i<_n xi U_ y, then (~i xi U_ y.</Paragraph> </Section> <Section position="3" start_page="580" end_page="580" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> all of its children have been computed.</Paragraph> <Paragraph position="1"> Thus, we need to impose an ordering on the items, in such a way that no item precedes any item on which it depends. We will assign each item x to a &quot;bucket&quot; B, writing bucket(x) = B and saying that item x is associated with B. We order the buckets in such a way that if item y depends on item x, then bucket(x) <_ bucket(y). For some pairs of items, it may be that both depend, directly or indirectly, on each other; we associate these items with special &quot;looping&quot; buckets, whose values may require infinite sums to compute. We will also call a bucket looping if an item associated with it depends on itself.</Paragraph> <Paragraph position="2"> One way to achieve a bucketing with the required ordering constraints (suggested by Fernando Pereira) is to create a graph of the dependencies, with a node for each item, and an edge from each item x to each item b that depends on it. We then separate the graph into its strongly connected components (maximal sets of nodes all reachable from each other), and perform a topological sort. Items forming singleton strongly connected components are associated with their own buckets; items forming nonsingleton strongly connected components are associated with the same looping bucket. See also Section 5.</Paragraph> <Paragraph position="3"> Later, when we discuss algorithms for interpreting an item-based description, we will need another concept. Of all the items associated with a bucket B, we will be able to find derivations for only a subset. If we can derive an item x associated with bucket B, we write x E B, and say that item x is in bucket B. For example, the goal item of a parser will almost always be associated with the last bucket; if the sentence is grammatical, the goal item will be in the last bucket, and if it is not grammatical, it will not be.</Paragraph> <Paragraph position="4"> It will be useful to assume that there is a single, variable-free goal item, and that this goal item does not occur as a condition for any rules. We could always add a \[old-goal\] new goal item ~oal\] and a rule ~oal\] where \[old-goal\] is the goal in the original description.</Paragraph> </Section> <Section position="4" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 2.3 The Grammar </SectionTitle> <Paragraph position="0"> A semiring parser also requires a grammar as input. We will need a list of rules in the grammar, and a function, R(rule), that gives the value for each rule in the grammar.</Paragraph> <Paragraph position="1"> This latter function will be semiring-specific. For instance, for computing the inside and Viterbi probabilities, the value of a grammar rule is just the conditional probability of that rule, or 0 if it is not in the grammar. For the Boolean semiring, the value is TRUE if the rule is in the grammar, FALSE otherwise. R(rule) replaces the set of rules R of a conventional grammar description; a rule is in the grammar if R(rule) ~ O.</Paragraph> </Section> <Section position="5" start_page="580" end_page="584" type="sub_section"> <SectionTitle> 2.4 Conditions for Correct Processing </SectionTitle> <Paragraph position="0"> We will say that a semiring parser works correctly if, for any grammar, input, and semiring, the value of the input according to the grammar equals the value of the input using the parser. In this subsection, we will define the value of an input according to the grammar, define the value of an input using the parser, and give a sufficient condition for a semiring parser to work correctly. From this point onwards, unless we specifically mention otherwise, we will assume that some fixed semiring, item-based description, and grammar have been given, without specifically mentioning which ones.</Paragraph> <Paragraph position="1"> 2.4.1 Value According to Grammar. Consider a derivation E, consisting of grammar rules el, e2 ..... era. We define the value of the derivation according to the grammar to Computational Linguistics Volume 25, Number 4 be simply the product (in the semiring) of the values of the rules used in E:</Paragraph> <Paragraph position="3"> Then we can define the value of a sentence that can be derived using grammar derivations E 1, E 2 ..... E k to be:</Paragraph> <Paragraph position="5"> where k is potentially infinite. In other words, the value of the sentence according to the grammar is the sum of the values of all derivations. We will assume that in each grammar formalism there is some way to define derivations uniquely; for instance, in CFGs, one way would be using left-most derivations. For simplicity, we will simply refer to derivations, rather than, for example, left-most derivations, since we are never interested in nonunique derivations.</Paragraph> <Paragraph position="6"> A short example will help clarify. We consider the following grammar:</Paragraph> <Paragraph position="8"> A --+ a R(A-+a) (2) and the input string aaa. There are two grammar derivations, the first of which ~S--+AA , ,A--+AA , , --A---+a . .A---+a --A---+a is ~ => Am ~ AAA ~ aAA ~ aaA ~ aaa, which has value R(S --+ AA) (r) R(A --+ AA) (r) R(A --+ a) (r) R(A --+ a) (r) R(A --+ a). Notice that the rules in the value are the same rules in the same order as in the derivation. The other grammar deriva~S--*AA-- ~A--*a ~A--+AA __ ~A--*a __A--*a tion is ~ ~ .4.4 ~ aA => aAA ~ aaA => aaa, which has value R(S --+ AA) (r) R(A --+ a) (r) R(A -+ AA) (r) R(A --+ a) (r) R(A ---* a). The value of the sentence is the sum of the values of the two derivations, \[R(s --+ AA) (r) R(A -+ AA) 0 a(A --+ a) (r) R(A --+ ~) (r) R(A --+ a)\] * \[a(S --+ AA) O R(A --+ a) (r) R(A --+ AA) (r) R(A -* a) (r) R(A --+ ~)\] 2.4.2 Item Derivations. Next, we define item derivations, i.e., derivations using the item-based description of the parser. We define item derivation in such a way that for a correct parser description, there is exactly one item derivation for each grammar derivation. The value of a sentence using the parser is the sum of the value of all item derivations of the goal item. Just as with grammar derivations, individual item derivations are finite, but there may be infinitely many item or grammar derivations of a sentence.</Paragraph> <Paragraph position="9"> We say that ~ Cl... cj is an instantiation of deduction rule A1 .B. Ak C1... Cj whenever the first expression is a variable-free instance of the second; that is, the first expression is the result of consistently substituting constant terms for each variable in the second. Now, we can define an item derivation tree. Intuitively, an item derivation Grammar derivation, grammar derivation tree, item derivation tree, and derivation value. tree for x just gives a way of deducing x from the grammar rules. We define an item derivation tree recursively. The base case is rules of the grammar: (r / is an item derivation tree, where r is a rule of the grammar. Also, if Dal ..... Da k, Dcl ..... Dcj are derivation trees headed by al... ak, Cl... Cj respectively, and if ~cl... cj is the instantiation of a deduction rule, then (b: D~ 1 ..... D~k/ is also a derivation tree. Notice that the De1 * *. Dq do not occur in this tree: they are side conditions, and although their existence is required to prove that cl * .. cj could be derived, they do not contribute to the value of the tree. We will write al... ak b to indicate that there is an item derivation tree of the form (b: Da, ..... Dakl. As mentioned in Section 2.2, we will write x E B if bucket(x) = B and there is an item derivation tree for x.</Paragraph> <Paragraph position="10"> We can continue the example of parsing aaa, now using the item-based CKY parser of Figure 3. There are two item derivation trees for the goal item; in Figure 6, we give the first as an example, displaying it as a tree, rather than with angle bracket notation, for simplicity.</Paragraph> <Paragraph position="11"> Notice that an item derivation is a tree, not a directed graph. Thus, an item sub-derivation could occur multiple times in a given item derivation. This means that Computational Linguistics Volume 25, Number 4 we can have a one-to-one correspondence between item derivations and grammar derivations; loops in the grammar lead to an infinite number of grammar derivations, and an infinite number of corresponding item derivations.</Paragraph> <Paragraph position="12"> A grammar including rules such as would allow derivations such as S ~ AAA ~ BAA ~ AA ~ BA ~ A ~ B ~ e.</Paragraph> <Paragraph position="13"> We would include the exact same item derivation showing A ~ B ~ ~ three times. Similarly, for a derivation such as A ~ B ~ A ~ B ~ A =~ a, we would have a corresponding item derivation tree that included multiple uses of the A --* B and B --* A rules.</Paragraph> <Paragraph position="14"> 2.4.3 Value of Item Derivation. The value of an item derivation D, V(D), is the product of the value of its rules, R(r), in the same order that they appear in the item derivation tree. Since rules occur only in the leaves of item derivation trees, the order is precisely determined. For an item derivation tree D with rule values dl, d2 ..... dj as its leaves,</Paragraph> <Paragraph position="16"> Alternatively, we can write this equation recursively as</Paragraph> <Paragraph position="18"> Continuing our example, the value of the item derivation tree of Figure 6 is R(s AA) (r) R(A a) (r) R(A AA) (r) R(A a) (r) R(A a) the same as the value of the first grammar derivation.</Paragraph> <Paragraph position="19"> Let inner(x) represent the set of all item derivation trees headed by an item x. Then the value of x is the sum of all the values of all item derivation trees headed by x. Formally,</Paragraph> <Paragraph position="21"> The value of a sentence is just the value of the goal item, V(goal).</Paragraph> <Paragraph position="22"> 2.4.4 Iso-valued Derivations. In certain cases, a particular grammar derivation and a particular item derivation will have the same value for any semiring and any rule value function R. In this case, we say that the two derivations are iso-valued. In particular, if and only if the same rules occur in the same order in both derivations, then their values will always be the same, and they are iso-valued. In Figure 6, the grammar derivation and item derivation meet this condition. In some cases, a grammar derivation and an</Paragraph> </Section> <Section position="6" start_page="584" end_page="584" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> item derivation will have the same value for any commutative semiring and any rule value function. In this case, we say that the derivations are commutatively iso-valued.</Paragraph> <Paragraph position="1"> Finishing our example, the value of the goal item given our example sentence is just the sum of the values of the two item-based derivations, \[R(S ---* AA) @ R(A --~ AA) @ R(A --~ a) @ R(A ~ a) @ R(A ---* a)\] @ \[R(S ~ AA) (r) R(A ~ a) (r) R(A -. AA) (r) R(A ~ a) (r) R(A ~ a) l This value is the same as the value of the sentence according to the grammar. 2.4.5 Conditions for Correctness. We can now specify the conditions for an item-based description to be correct.</Paragraph> <Paragraph position="2"> Theorem 1 Given an item-based description I, if for every grammar G, there exists a one-to-one correspondence between the item derivations using I and the grammar derivations, and the corresponding derivations are iso-valued, then for every complete semiring, the value of a given input wl ... wn is the same according to the grammar as the value of the goal item. (If the semiring is commutative, then the corresponding derivations need only be commutatively iso-valued.) Proof The proof is very simple; essentially, each term in each sum occurs in the other. By hypothesis, for a given input, there are grammar derivations E1 ... Ek (for 0 < k < o0) and corresponding item derivation trees D1 .. * Dk of the goal item. Since corresponding items are iso-valued, for all i, V(Ei) ~- V(Di). (If the semiring is commutative, then since the items are commutatively iso-valued, it is still the case that for all i, V(Ei) --V(Di).) Now, since the value of the string according to the grammar is just (~i V(Ei) = (~i V(Di), and the value of the goal item is E)i V(Di), the value of the string according to the grammar equals the value of the goal item. \[\] There is one additional condition for an item-based description to be usable in practice, which is that there be only a finite number of derivable items for a given input sentence; there may, however, be an infinite number of derivations of any item.</Paragraph> </Section> <Section position="7" start_page="584" end_page="588" type="sub_section"> <SectionTitle> 2.5 The Derivation Semirings </SectionTitle> <Paragraph position="0"> All of the semirings we use should be familiar, except for the derivation semirings, which we now describe. These semirings, unlike the other semirings described in Figure 5, are not commutative under their multiplicative operator, concatenation.</Paragraph> <Paragraph position="1"> In many parsers, it is conventional to compute parse forests: compact representations of the set of trees consistent with the input. We will use a related concept, derivation forests, a compact representation of the set of derivations consistent with the input, which corresponds to the parse forest for CFGs, but is easily extended to other formalisms.</Paragraph> <Paragraph position="2"> Often, we will not be interested in the set of all derivations, but only in the most probable derivation. The Viterbi-derivation semiring computes this value. Alternatively, we might want the n best derivations, which would be useful if the output of the parser were passed to another stage, such as semantic disambiguation; this value is computed by the Viterbi-n-best derivation semiring.</Paragraph> <Paragraph position="3"> Notice that each of the derivation semirings can also be used to create transducers. That is, we simply associate strings rather than grammar rules with each Computational Linguistics Volume 25, Number 4 rule value. Instead of grammar rule concatenation, we perform string concatenation. The derivation semiring then corresponds to nondeterministic transductions; the Viterbi semiring corresponds to a weighted or probabilistic transducer; and the Viterbi-n-best semiring could be used to get n-best lists from probabilistic transducers. null where a derivation is a list of rules of the grammar. 4 Sets containing one rule, such as { (X --* YZ)} for a CFG, constitute the primitive elements of the semiring. The additive operator kJ produces a union of derivations, and the multiplicative operator- produces the concatenation, one derivation concatenated with the next. The concatenation operation (.) is defined on both derivations and sets of derivations; when applied to a set of derivations, it produces the set of pairwise concatenations. The additive identity is simply the empty set, 0: union with the empty set is an identity operation. The multiplicative identity is the set containing the empty derivation, {0}: concatenation with the empty derivation is an identity operation. Derivations need not be complete. For instance, for CFGs, {(X --* YZ, Y ~ y)} is a valid element, as is {(Y --* y, X ~ x)}. In fact, {(X ~ A, B --* b)} is a valid element, although it could not occur in a valid grammar derivation, or in a correctly functioning parser. An example of concatenation of sets is {(A ~ a),(B ~ b)}. {(C ~ c),(D ~ d)} = {(A ~ a,C -+ c),(A --* a,D a), (B b, C c), (B b, D -. a)}.</Paragraph> <Paragraph position="4"> Potentially, derivation forests are sets of infinitely many items. However, it is still possible to store them using finite-sized representations. Elsewhere (Goodman 1998), we show how to implement derivation forests efficiently, using pointers, in a manner analogous to the typical implementation of parse forests, and also similar to the work of Billot and Lang (1989). Using these techniques, both union and concatenation can be implemented in constant time, and even infinite unions will be reasonably efficient. probable derivation of the sentence, given a probabilistic grammar. Elements of this semiring are a pair, a real number v and a derivation forest E, i.e., the set of derivations with score v. We define max, the additive operator, as</Paragraph> <Paragraph position="6"> In typical practical Viterbi parsers, when two derivations have the same value, one of the derivations is arbitrarily chosen. In practice, this is usually a fine solution, and one that could be used in a real-world implementation of the ideas in this paper, but from a theoretical viewpoint, the arbitrary choice destroys the associative property of the additive operator, max. To preserve associativity, we keep derivation forests of all elements that tie for beret.</Paragraph> <Paragraph position="7"> The definition for max is only defined for two elements. Since the operator is Vit associative, it is clear how to define max for any finite number of elements, but we also Vit need infinite summations to be defined. We use the supremum, sup: the supremum of a set is the smallest value at least as large as all elements of the set; that is, it is a maximum that is defined in the infinite case. We can now define max for the case of vit infinite sums. Let</Paragraph> <Paragraph position="9"> vit theory, and will not occur in practice. We define x as vit (v, E I vXit(w, D> = (v x w, E. D> where E * D represents the concatenation of the two derivation forests. best semiring, which is used for constructing n-best lists. Intuitively, the value of a string using this semiring will be the n most likely derivations of that string (unless there are fewer than n total derivations.) In practice, this is actually how a Viterbi-n-best semiring would typically be implemented. From a theoretical viewpoint, however, this implementation is inadequate, since we must also define infinite stuns and be sure that the distributive property holds. Elsewhere (Goodman 1998), we give a mathematically precise definition of the semiring that handles these cases.</Paragraph> <Paragraph position="10"> 3. Efficient Computation of Item Values Recall that the value of an item x is just V(x) = (~Deinner(x)V(D), the sum of the values of all derivation trees headed by x. This definition may require summing over exponentially many or even infinitely many terms. In this section, we give relatively efficient formulas for computing the values of items. There are three cases that must be handled. First is the base case, when x is a rule. In this case, inner(x) is trivially {(x/}, the set containing the single derivation tree x. Thus, V(x) = (~Dcinner(x) V(D) =</Paragraph> <Paragraph position="12"> The second and third cases occur when x is an item. Recall that each item is associated with a bucket, and that the buckets are ordered. Each item x is either associated with a nonlooping bucket, in which case its value depends only on the values of items in earlier buckets; or with a looping bucket, in which case its value depends potentially on the values of other items in the same bucket. In the case when the item is associated with a nonlooping bucket, if we compute items in the same order as their buckets, we can assume that the values of items al ... ak contributing to the value of item b are known. We give a formula for computing the value of item b that depends only on the values of items in earlier buckets.</Paragraph> <Paragraph position="13"> For the final case, in which x is associated with a looping bucket, infinite loops may occur, when the value of two items in the same bucket are mutually dependent, or an item depends on its own value. These infinite loops may require computation of infinite sums. Still, we can express these infinite sums in a relatively simple form, allowing them to be efficiently computed or approximated.</Paragraph> <Paragraph position="15"> Let us expand our notion of inner to include deduction rules: inner(~) is the set of all derivation trees of the form (b: (al.../(a2.../-.. (ak...11&quot; For any item derivation tree that is not a simple rule, there is some al...ak, b such that D E inner(~).</Paragraph> <Paragraph position="16"> Thus, for any item x,</Paragraph> <Paragraph position="18"> al...alC/ s.t. al'~c, ak DEinner(aI&quot;x&quot; ak) Consider item derivation trees Dal ... Dak headed by items al ... ak such that ~g~. Recall that (x: Da, .... , Dakl is the item derivation tree formed by combining each of these trees into a full tree, and notice that U (x: Dal,..., Dakl = inner(~).</Paragraph> <Paragraph position="19"> Da I ff inner( al ) .....</Paragraph> <Paragraph position="20"> Da k ff inner (ak )</Paragraph> <Paragraph position="22"> Substituting this back into Equation 6, we get</Paragraph> <Paragraph position="24"> completing the proof. \[\] Now, we address the case in which x is an item in a looping bucket. This case requires computation of an infinite sum. We will write out this infinite sum, and discuss how to compute it exactly in all cases, except for one, where we approximate it. Consider the derivable items xl... Xm in some looping bucket B. If we build up derivation trees incrementally, when we begin processing bucket B, only those trees with no items from bucket B will be available, what we will call zeroth generation derivation trees. We can put these zeroth generation trees together to form first generation trees, headed by elements in B. We can combine these first generation trees with each other and with zeroth generation trees to form second generation trees, and so on. Formally, we define the generation of a derivation tree headed by x in bucket B to be the largest number of items in B we can encounter on a path from the root to a leaf.</Paragraph> <Paragraph position="25"> Consider the set of all trees of generation at most g headed by x. Call this set inner<_~(x, B). We can define the Kg generation value of an item x in bucket B, V<_~(x, B):</Paragraph> <Paragraph position="27"> Intuitively, as g increases, for x E B, inner<~(x, B) becomes closer and closer to inner(x). That is, the finite sum of values in the former approaches the infinite sum of values in the latter. For w-continuous semirings (which includes all of the semirings considered in this paper), an infinite sum is equal to the supremum of the partial sums (Kuich 1997, 613). Thus,</Paragraph> <Paragraph position="29"> It will be easier to compute the supremum if we find a simple formula for V<_g(x, B).</Paragraph> <Paragraph position="30"> Notice that for items x E B, there will be no generation 0 derivations, so V_<0(x, B) = 0. Thus, generation 0 makes a trivial base for a recursive formula. Now, we can consider the general case: Theorem 3 For x an item in a looping bucket B, and for g ~ 1,</Paragraph> <Paragraph position="32"> al... ak s.t. al'x&quot; ak if ai ~ B if ai E B (7) The proof parallels that of Theorem 2 (Goodman 1998).</Paragraph> </Section> <Section position="8" start_page="588" end_page="590" type="sub_section"> <SectionTitle> 3.2 Solving the Infinite Summation </SectionTitle> <Paragraph position="0"> A formula for V<_g(x, B) is useful, but what we really need is specific techniques for computing the supremum, V(x) = supg V<<_g(x, B). For all w-continuous semirings, the supremum of iteratively approximating the value of a set of polynomial equations, as we are essentially doing in Equation 7, is equal to the smallest solution to the equations (Kuich 1997, 622). In particular, consider the equations:</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 25, Number 4 where V<~(x, B) can be thought of as indicating \[B\[ different variables, one for each item x in the looping bucket B. Equation 7 represents the iterative approximation of Equation 8, and therefore the smallest solution to Equation 8 represents the supremum of Equation 7.</Paragraph> <Paragraph position="3"> One fact will be useful for several semirings: whenever the values of all items x E B at generation g + 1 are the same as the values of all items in the preceding generation, g, they will be the same at all succeeding generations, as well. Thus, the value at generation g will be the value of the supremum. Elsewhere (Goodman 1998), we give a trivial proof of this fact.</Paragraph> <Paragraph position="4"> Now, we can consider various semiring-specific algorithms for computing the supremum. Most of these algorithms are well known, and we have simply extended them from specific parsers (described in Section 7) to the general case, or from one semiring to another.</Paragraph> <Paragraph position="5"> Notice in this section the wide variety of different algorithms, one for each semiring, and some of them fairly complicated. In a conventional system, these algorithms are interweaved with the parsing algorithm, conflating computation of infinite sums with parsing. The result is algorithms that are both harder to understand, and less portable to other semirings.</Paragraph> <Paragraph position="6"> We first examine the simplest case, the Boolean semiring. Notice that whenever a particular item has value TRUE at generation g, it must also have value TRUE at generation g+ 1, since if the item can be derived in at most g generations then it can certainly be derived in at most g + 1 generations. Thus, since the number of TRUE valued items is nondecreasing, and is at most IB\[, eventually the values of all items must not change from one generation to the next. Therefore, for the Boolean semiring, a simple algorithm suffices: keep computing successive generations, until no change is detected in some generation; the result is the supremum. We can perform this computation efficiently if we keep track of items that change value in generation g and only examine items that depend on them in generation g+l. This algorithm is then similar to the algorithm of Shieber, Schabes, and Pereira (1993).</Paragraph> <Paragraph position="7"> For the counting semiring, the Viterbi semiring, and the derivation forest semiring, we need the concept of a derivation subgraph. In Section 2.2 we considered the strongly connected components of the dependency graph, consisting of items that for some sentence could possibly depend on each other, and we put these possibly interdependent items together in looping buckets. For a given sentence and grammar, not all items will have derivations. We will find the subgraph of the dependency graph of items with derivations, and compute the strongly connected components of this subgraph. The strongly connected components of this subgraph correspond to loops that actually occur given the sentence and the grammar, as opposed to loops that might occur for some sentence and grammar, given the parser alone. We call this subgraph the derivation subgraph, and we will say that items in a strongly connected component of the derivation subgraph are part of a loop.</Paragraph> <Paragraph position="8"> Now, we can discuss the counting semiring (integers under + and x). In the counting semiring, for each item, there are three cases: the item can be in a loop; the item can depend (directly or indirectly) on an item in a loop; or the item does not depend on loops. If the item is in a loop or depends on a loop, its value is infinite. If the item does not depend on a loop in the current bucket, then its value becomes fixed after some generation. We can now give the algorithm: first, compute successive generations until the set of items in B does not change from one generation to the next. Next, compute the derivation subgraph, and its strongly connected components. Items in a strongly connected component (a loop) have an infi-</Paragraph> </Section> <Section position="9" start_page="590" end_page="591" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> nite number of derivations, and thus an infinite value. Compute items that depend directly or indirectly on items in loops: these items also have infinite value. Any other items can only be derived in finitely many ways using items in the current bucket, so compute successive generations until the values of these items do not change.</Paragraph> <Paragraph position="1"> The method for solving the infinite summation for the derivation forest semiring depends on the implementation of derivation forests. Essentially, that representation will use pointers to efficiently represent derivation forests. Pointers, in various forms, allow one to efficiently represent infinite circular references, either directly (Goodman 1999), or indirectly (Goodman 1998). Roughly, the algorithm we will use is to compute the derivation subgraph, and then create pointers analogous to the directed edges in the derivation subgraph, including pointers in loops whenever there is a loop in the derivation subgraph (corresponding to an infinite number of derivations). Details are given elsewhere (Goodman 1998). As in the finite case, this representation is equivalent to that of Billot and Lang (1989).</Paragraph> <Paragraph position="2"> For the Viterbi semiring, the algorithm is analogous to the Boolean case. Derivations using loops in these semirings will always have values no greater than derivations not using loops, since the value with the loop will be the same as some value without the loop, multiplied by some set of rule probabilities that are at most 1. Since the additive operation is max, these lower (or at most equal) looping derivations do not change the value of an item. Therefore, we can simply compute successive generations until values fail to change from one iteration to the next.</Paragraph> <Paragraph position="3"> Now, consider implementations of the Viterbi-derivation semiring in practice, in which we keep only a representative derivation, rather than the whole derivation forest. In this case, loops do not change values, and we use the same algorithm as for the Viterbi semiring. In an implementation of the Viterbi-n-best semiring, in practice, loops can change values, but at most n times, so the same algorithm used for the Viterbi semiring still works. Elsewhere (Goodman 1998), we describe theoretically correct implementations for both the Viterbi-derivation and Viterbi-n-best semirings that keep all values in the event of ties, preserving addition's associativity.</Paragraph> <Paragraph position="4"> The last semiring we consider is the inside semiring. This semiring is the most difficult. There are two cases of interest, one of which we can solve exactly, and the other of which requires approximations. In many cases involving looping buckets, all alx deduction rules will be of the form ~-, where al and b are items in the looping bucket, and x is either a rule, or an item in a previously computed bucket. This case corresponds to the items used for deducing singleton productions, such as those Earley's algorithm uses for rules of the form A --* B and B --+ A. In this case, Equation 8 forms a set of linear equations that can be solved by matrix inversion. In the more general case, as is likely to happen with epsilon rules, we get a set of nonlinear equations, and must solve them by approximation techniques, such as simply computing successive generations for many iterations. 5 Stolcke (1993) provides an excellent discussion of these cases, including a discussion of sparse matrix inversion, useful for speeding up some computations.</Paragraph> <Paragraph position="5"> 5 Note that even in the case where we can only use approximation techniques, this algorithm is relatively efficient. By assumption, in this case, there is at least one deduction rule with two items in the current generation; thus, the number of deduction trees over which we are summing grows exponentially with the number of generations: a linear amount of computation yields the sum of the values of exponentially many trees.</Paragraph> </Section> </Section> <Section position="4" start_page="591" end_page="596" type="metho"> <SectionTitle> 4. Reverse Values </SectionTitle> <Paragraph position="0"> The previous section showed how to compute several of the most commonly used values for parsers, including Boolean, inside, Viterbi, counting, and derivation forest values, among others. Noticeably absent from the list are the outside probabilities, which we define below. In general, computing outside probabilities is significantly more complicated than computing inside probabilities.</Paragraph> <Paragraph position="1"> In this section, we show how to compute outside probabilities from the same item-based descriptions used for computing inside values. Outside probabilities have many uses, including for reestimating grammar probabilities (Baker 1979), for improving parser performance on some criteria (Goodman 1996b), for speeding parsing in some formalisms, such as data-oriented parsing (Goodman 1996a), and for good thresholding algorithms (Goodman 1997).</Paragraph> <Paragraph position="2"> We will show that by substituting other semirings, we can get values analogous to the outside probabilities for any commutative semiring; elsewhere (Goodman 1998) we have shown that we can get similar values for many noncommutative semirings as well. We will refer to these analogous quantities as reverse values. For instance, the quantity analogous to the outside value for the Viterbi semiring will be called the reverse Viterbi value. Notice that the inside semiring values of a hidden Markov model (HMM) correspond to the forward values of HMMs, and the reverse inside values of an HMM correspond to the backwards values.</Paragraph> <Paragraph position="3"> Compare the outside algorithm (Baker 1979; Lari and Young 1990), given in Figure 7, to the inside algorithm of Figure 2. Notice that while the inside and recognition algorithms are very similar, the outside algorithm is quite a bit different. In particular, while the inside and recognition algorithms looped over items from shortest to longest, the outside algorithm loops over items in the reverse order, from longest to shortest.</Paragraph> <Paragraph position="4"> Also, compare the inside algorithm's main loop formula to the outside algorithm's main loop formula. While there is clearly a relationship between the two equations, the exact pattern of the relationship is not obvious. Notice that the outside formula is about twice as complicated as the inside formula. This doubled complexity is typical of outside formulas, and partially explains why the item-based description format is so useful: descriptions for the simpler inside values can be developed with relative ease, and then automatically used to compute the twice-as-complicated outside values. 6 6 Jumping ahead a bit, compare Equation 13 for reverse values to Equation 5 for forward values. Let k be the number of terms above the line. Notice that the reverse values equation sums over k times as many terms as the forward values equation. Parsers where all rules have k = 1 terms above the line can only Item derivation tree of \[goal\] and outer tree of \[b\].</Paragraph> <Paragraph position="5"> For a context-free grammar, using the CKY parser of Figure 3, recall that the inside probability for an item \[i, A, j\] is P(A -~ wi... wj-1). The outside probability for the same item is P(S G wl... Wi_lAWj.,. Wn). Thus, the outside probability has the property that when multiplied by the inside probability, it gives the probability that the start symbol generates the sentence using the given item, P(S G Wl .., wi_dAwj... Wn G Wl ... Wn).</Paragraph> <Paragraph position="6"> This probability equals the sum of the probabilities of all derivations using the given item. Formally, letting P(D) represent the probability of a particular derivation, and C(D, \[i, X,j\]) represent the number of occurrences of item \[i, X,j\] in derivation D (which for some parsers could be more than one if X were part of a loop), inside(i, X,j) x outside(i, X,j) = Z P(D) C(D, \[i, X,j\]) D a derivation The reverse values in general have an analogous meaning. Let C(D, x) represent the number of occurrences (the count) of item x in item derivation tree D. Then, for an item x, the reverse value Z(x) should have the property</Paragraph> <Paragraph position="8"> Notice that we have multiplied an element of the semiring, V(D), by an integer, C(D, x).</Paragraph> <Paragraph position="9"> This multiplication is meant to indicate repeated addition, using the additive operator of the semiring. Thus, for instance, in the Viterbi semiring, multiplying by a count other than 0 has no effect, since x (r) x = max(x, x) = x, while in the inside semiring, it corresponds to actual multiplication. This value represents the sum of the values of all derivation trees that the item x occurs in; if an item x occurs more than once in a derivation tree D, then the value of D is counted more than once.</Paragraph> <Paragraph position="10"> To formally define the reverse value of an item x, we must first define the outer trees outer(x). Consider an item derivation tree of the goal item, containing one or more instances of item x. Remove one of these instances of x, and its children too, leaving a gap in its place. This tree is an outer tree of x. Figure 8 shows an item derivation tree of the goal item, including a subderivation of an item b, derived from terms al .... , ak. It also shows an outer tree of b, with b and its children removed; the spot b was removed from is labeled (b).</Paragraph> <Paragraph position="11"> parse regular grammars, and tend to be less useful. Thus, in most parsers of interest, k > 1, and the complexity of (at least some) outside equations, when the sum is written out, is at least doubled. Computational Linguistics Volume 25, Number 4 For an outer tree D E outer(x), we define its value, Z(D), to be the product of the values of all rules in D, (~rCD R(r). Then, the reverse value of an item can be formally defined as</Paragraph> <Paragraph position="13"> That is, the reverse value of x is the sum of the values of each outer tree of x.</Paragraph> <Paragraph position="14"> Now, we show that this definition of reverse values has the property described by Equation 9. 7</Paragraph> <Paragraph position="16"> Next, we argue that this last expression equals the expression on the right-hand side of Equation 9, (~D V(D)C(D,x). For an item x, any outer part of an item derivation tree for x can be combined with any inner part to form a complete item derivation tree. That is, any O E outer(x) and any I E inner(x) can be combined to form an item derivation tree D containing x, and any item derivation tree D containing x can be decomposed into such outer and inner trees. Thus, the list of all combinations of outer and inner trees corresponds exactly to the list of all item derivation trees containing x. In fact, for an item derivation tree D containing C(D, x) instances of x, there are C(D, x) ways to form D from combinations of outer and inner trees. Also, notice that for D combined from O and I</Paragraph> <Paragraph position="18"> Combining Equation 11 with Equation 12, we see that</Paragraph> <Paragraph position="20"> D a derivation completing the proof. \[\] 7 We note that satisfying Equation 9 is a useful but not sufficient condition for using reverse inside values for grammar reestimation. While this definition will typically provide the necessary values for the E step of an E-M algorithm, additional work will typically be required to prove this fact; Equation 9 should be useful in such a proof.</Paragraph> <Section position="1" start_page="594" end_page="596" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> There is a simple, recursive formula for efficiently computing reverse values. Recall that the basic equation for computing forward values not involved in loops was</Paragraph> <Paragraph position="2"> ... ak s.t. al &quot;x&quot; ak At this point, for conciseness, we introduce a nonstandard notation. We will soon be using many sequences of the form 1, 2 ..... j - 2, j - 1, j + 1, j + 2 ..... k- 1, k. We denote such sequences by 1, ._4, k. By extension, we will also write f(1), zL,f(k) to indicate a sequence of the form f(1),f(2) ..... f(j- 2),f(j- 1),f(j + 1),f(j + 2) ..... f(k- 1),f(k). Now, we can give a simple formula for computing reverse values Z(x) not involved in loops: Theorem 5 For items x E B where B is nonlooping,</Paragraph> <Paragraph position="4"> unless x is the goal item, in which case Z(x) = 1, the multiplicative identity of the semiring.</Paragraph> <Paragraph position="5"> Proof The simple case is when x is the goal item. Since an outer tree of the goal item is a derivation of the goal item, with the goal item and its children removed, and since we assumed in Section 2.2 that the goal item can only appear in the root of a derivation tree, the outer trees of the goal item are all empty. Thus,</Paragraph> <Paragraph position="7"> As mentioned in Section 2.1, the value of the empty product is the multiplicative identity.</Paragraph> <Paragraph position="8"> Now, we consider the general case. We need to expand our concept of outer to include deduction rules, where outer(\]', ~-~) is an item derivation tree of the goal item with one subtree removed, a subtree headed by aj whose parent is b and whose siblings are headed by al, .-(, ak. Notice that for every outer tree D C outer(x), there is exactly one j, al ..... ak, and b such that x = aj and D E outer(\], ~): this corresponds to the deduction rule used at the spot in the tree where the subtree headed by x was deleted. Figure 9 illustrates the idea of putting together an outer tree of b with inner trees for al, .J., ak to form an outer tree of x ---- aj. Using this observation,</Paragraph> <Paragraph position="10"> Combining an outer tree with inner trees to form an outer tree.</Paragraph> <Paragraph position="11"> Now, consider all of the outer trees outer(j,~). For each item derivation tree Dal C inner(a1), ._4, Dak E inner(ak) and for each outer tree Db E outer(b), there will be one outer tree in the set outer(j, ff~--~)o Similarly, each tree in outer(j, al. &quot;b&quot; ak) can be decomposed into an outer tree in outer(b) and derivation trees for al, ._4, ak. Then,</Paragraph> <Paragraph position="13"> Substituting equation 15 into equation 14, we conclude that</Paragraph> <Paragraph position="15"> i=l,-!,k j,al.., ak,b s.t. PSt~ A x=aj completing the general case.</Paragraph> <Paragraph position="16"> Computing the reverse values for loops is somewhat more complicated, and as in the forward case, requires an infinite sum, and the use of the concept of generation.</Paragraph> </Section> <Section position="2" start_page="596" end_page="596" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> We define the generation g of an outer tree D of item x in bucket B to be the number of items in bucket B on the path between the root and the removal point, inclusive.</Paragraph> <Paragraph position="1"> We can then let Z<_g(x, B) represent the sum of the values of all trees headed by x of generation at most g. In the base case, Z_<0(x, B) = 0. For ~;-continuous semirings, Z<_g(x, B) approaches Z(x) as g approaches c~. We can give a recursive equation for Z<_~(x, B) as follows, using a proof similar to that of Theorem 5 (Goodman 1998): Theorem 6 For items x E B and g > 1,</Paragraph> <Paragraph position="3"> j,al.., ak,b s.t. ~-~ A x=aj \i=l,Z!,k /</Paragraph> </Section> </Section> <Section position="5" start_page="596" end_page="598" type="metho"> <SectionTitle> 5. Semiring Parser Execution </SectionTitle> <Paragraph position="0"> Executing a semiring parser is fairly simple. There is, however, one issue that must be dealt with before we can actually begin parsing. A semiring parser computes the values of items in the order of the buckets they fall into. Thus, before we can begin parsing, we need to know which items fall into which buckets, and the ordering of those buckets. There are three approaches to determining the buckets and ordering that we will discuss in this section. The first approach is a simple, brute-force enumeration of all items, derivable or not, followed by a topological sort. This approach will have suboptimal time and space complexity for some item-based descriptions. The second approach is to use an agenda parser in the Boolean semiring to determine the derivable items and their dependencies, and to then perform a topological sort. This approach has optimal time complexity, but typically suboptimal space complexity. The final approach is to use bucketing code specific to the item-based interpreter. This achieves optimal performance for additional programming effort.</Paragraph> <Paragraph position="1"> The simplest way to determine the bucketing is to simply enumerate all possible items for the given item-based description, grammar, and input sentence. Then, we compute the strongly connected components and a partial ordering; both steps can be done in time proportional to the number of items plus the number of dependencies (Cormen, Leiserson, and Rivest 1990, Chap. 23). For some parsers, this technique has optimal time complexity, although poor space complexity. In particular, for the CKY algorithm, the time complexity is optimal, but since it requires computing and storing all possible O(n 3) dependencies between the items, it takes significantly more space than the O(n 2) space required in the best implementation. In general, the brute-force technique raises the space complexity to be the same as the time complexity. Furthermore, for some algorithms, such as Earley's algorithm, there could be a significant time complexity added as well. In particular, Earley's algorithm may not need to examine all possible items. For certain grammars, Earley's algorithm examines only a linear number of items and a linear number of dependencies, even though there are O(n 2) possible items, and O(n 3) possible dependencies. Thus the brute-force approach would require O(n 3) time and space instead of O(n) time and space, for these grammars.</Paragraph> <Paragraph position="2"> The next approach to finding the bucketing solves the time complexity problem.</Paragraph> <Paragraph position="3"> In this approach, we first parse in the Boolean semiring, using the agenda parser described by Shieber, Schabes, and Pereira (1995), and then we perform a topological sort. The techniques that Shieber, Schabes, and Pereira use work well for the Boolean semiring, where items only have value TRUE or FALSE, but cannot be used directly for Computational Linguistics Volume 25, Number 4 for current := first bucket to last bucket if current is a looping bucket /* replace with semiring-specific code */ for x E current v\[x, 0\] = 0; for g :-- 1 to oo for each x E current, al ... ak s.t.</Paragraph> <Paragraph position="5"> Forward semiring parser interpreter.</Paragraph> <Paragraph position="6"> ai ~ current ai E current other semirings. For other semirings, we need to make sure that the values of items are not computed until after the values of all items they depend on are computed. However, we can use the algorithm of Shieber, Schabes, and Pereira to compute all of the items that are derivable, and to store all of the dependencies between the items. Then we perform a topological sort on the items. The time complexity of both the agenda parser and the topological sort will be proportional to the number of dependencies, which will be proportional to the optimal time complexity. Unfortunately, we still have the space complexity problem, since again, the space used will be proportional to the number of dependencies, rather than to the number of items.</Paragraph> <Paragraph position="7"> The third approach to bucketing is to create algorithm-specific bucketing code; this results in parsers with both optimal time and optimal space complexity. For instance, in a CKY-style parser, we can simply create one bucket for each length, and place each item into the bucket for its length. For some algorithms, such as Earley's algorithm, special-purpose code for bucketing might have to be combined with code to make sure all and only derivable items are considered (using triggering techniques described by Shieber, Schabes, and Pereira) in order to achieve optimal performance. null Once we have the bucketing, the parsing step is fairly simple. The basic algorithm appears in Figure 10. We simply loop over each item in each bucket. There are two types of buckets: looping buckets, and nonlooping buckets. If the current bucket is a looping bucket, we compute the infinite sum needed to determine the bucket's values; in a working system, we substitute semiring-specific code for this section, as described in Section 3.2. If the bucket is not a looping bucket, we simply compute all of the possible instantiations that could contribute to the values of items in that bucket. Finally, we return the value of the goal item.</Paragraph> <Paragraph position="8"> The reverse semiring parser interpreter is very similar to the forward semiring parser interpreter. The differences are that in the reverse semiring parser interpreter, we traverse the buckets in reverse order, and we use the formulas for the reverse values, rather than the forward values. Elsewhere (Goodman 1998), we give a simple inductive proof to show that both interpreters compute the correct values.</Paragraph> <Section position="1" start_page="598" end_page="598" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> There are two other implementation issues. First, for some parsers, it will be possible to discard some items. That is, some items serve the role of temporary variables, and can be discarded after they are no longer needed, especially if only the forward values are going to be computed. Also, some items do not depend on the input string, but only on the rule value function of the grammar. The values of these items can be precomputed.</Paragraph> </Section> </Section> <Section position="6" start_page="598" end_page="600" type="metho"> <SectionTitle> 6. Examples </SectionTitle> <Paragraph position="0"> In this section, we survey other results that are described in more detail elsewhere (Goodman 1998), including examples of formalisms that can be parsed using item-based descriptions, and other uses for the technique of semiring parsing.</Paragraph> <Section position="1" start_page="598" end_page="598" type="sub_section"> <SectionTitle> 6.1 Finite State Automata and Hidden Markov Models </SectionTitle> <Paragraph position="0"> Nondeterministic finite-state automata (NFAs) and HMMs turn out to be examples of the same underlying formalism, whose values are simply computed in different semirings. Other semirings lead to other interesting values. For HMMs, notice that the forward values are simply the forward inside values; the backward values are the reverse values of the inside semiring; and Viterbi values are the forward values of the Viterbi semiring. For NFAs, we can use the Boolean semiring to determine whether a string is in the language of an NFA; we can use the counting semiring to determine how many state sequences there are in the NFA for a given string; and we can use the derivation forest semiring to get a compact representation of all state sequences in an NFA for an input string. A single item-based description can be used to find all of these values.</Paragraph> </Section> <Section position="2" start_page="598" end_page="599" type="sub_section"> <SectionTitle> 6.2 Prefix Values </SectionTitle> <Paragraph position="0"> For language modeling, it may be useful to compute the prefix probability of a string.</Paragraph> <Paragraph position="1"> That is, given a string wl... Wn, we may wish to know the total probability of all sentences beginning with that string, P(S ~ wl... wnvl.., v~) k>O,vl,...,Vk where Vl ... Vk represent words that could possibly follow wl ... wn. Jelinek and Lafferty (1991) and Stolcke (1993) both give algorithms for computing these prefix probabilities. Elsewhere (Goodman 1998), we show how to produce an item-based description of a prefix parser. There are two main advantages to using an item-based description: ease of derivation, and reusability.</Paragraph> <Paragraph position="2"> First, the conventional derivations are somewhat complex, requiring a fair amount of inside-semiring-specific mathematics. In contrast, using item-based descriptions, we only need to derive a parser that has the property that there is one item derivation for each (complete) grammar derivation that would produce the prefix. The value of any prefix given the parser will then automatically be the sum of all grammar derivations that include that prefix.</Paragraph> <Paragraph position="3"> The other advantage is that the same description can be used to compute many values, not just the prefix probability. For instance, we can use this description with the Viterbi-derivation semiring to find the most likely derivation that includes this prefix. With this most likely derivation, we could begin interpretation of a sentence even before the sentence was finished being spoken to a speech recognition system. We could even use the Viterbi-n-best semiring to find the n most likely derivations that include this prefix, if we wanted to take into account ambiguities present in parses of the prefix. Computational Linguistics Volume 25, Number 4</Paragraph> </Section> <Section position="3" start_page="599" end_page="599" type="sub_section"> <SectionTitle> 6.3 Beyond Context-Free </SectionTitle> <Paragraph position="0"> There has been quite a bit of previous work on the intersection of formal language theory and algebra, as described by Kuich (1997), among others. This previous work has made heavy use of the fact that there is a strong correspondence between algebraic equations in certain noncommutative semirings, and CFGs. This correspondence has made it possible to manipulate algebraic systems, rather than grammar systems, simplifying many operations.</Paragraph> <Paragraph position="1"> On the other hand, there is an inherent limit to such an approach, namely a limit to context-free systems. It is then perhaps slightly surprising that we can avoid these limitations, and create item-based descriptions of parsers for weakly context-sensitive grammars, such as tree adjoining grammars (TAGs). We avoid the limitations of previous approaches using two techniques. One technique is to compute derivation trees, rather than parse trees, for TAGs. Computing derivation trees for TAGs is significantly easier than computing parse trees, since the derivation trees are context-free. The other trick we use is to create a set of equations for each grammar and string length rather than creating a set of equations for each grammar, as earlier formulations did. Because the number of equations grows with the string length with our technique, we can recognize strings in weakly context-sensitive languages. Goodman (1998) gives a further explication of this subject, including an item-based description for a simple TAG parser.</Paragraph> </Section> <Section position="4" start_page="599" end_page="599" type="sub_section"> <SectionTitle> 6.4 Tomita Parsing </SectionTitle> <Paragraph position="0"> Our goal in this section has been to show that item-based descriptions can be used to simply describe almost all parsers of interest. One parsing algorithm that would seem particularly difficult to describe is Tomita's graph-structured-stack LR parsing algorithm. This algorithm at first glance bears little resemblance to other parsing algorithms. Despite this lack of similarity, Sikkel (1993) gives an item-based description for a Tomita-style parser for the Boolean semiring, which is also more efficient than Tomita's algorithm. Sikkel's parser can be easily converted to our format, where it can be used for w-continuous semirings in general.</Paragraph> </Section> <Section position="5" start_page="599" end_page="599" type="sub_section"> <SectionTitle> 6.5 Graham Harrison Ruzzo (GHR) Parsing </SectionTitle> <Paragraph position="0"> Graham, Harrison, and Ruzzo (1980) describe a parser similar to Earley's, but with several speedups that lead to significant improvements. Essentially, there are three improvements in the GHR parser. First, epsilon productions are precomputed; second, unary productions are precomputed; and, finally, completion is separated into two steps, allowing better dynamic programming.</Paragraph> <Paragraph position="1"> Goodman (1998) gives a full item-based description of a GHR parser. The forward values of many of the items in our parser related to unary and epsilon productions can be computed off-line, once per grammar, which is an idea due to Stolcke (1993).</Paragraph> <Paragraph position="2"> Since reverse values require entire strings, the reverse values of these items cannot be computed until the input string is known. Because we use a single item-based description for precomputed items and nonprecomputed items, and for forward and reverse values, this combination of off-line and on-line computation is easily and compactly specified.</Paragraph> </Section> <Section position="6" start_page="599" end_page="600" type="sub_section"> <SectionTitle> 6.6 Grammar Transformations </SectionTitle> <Paragraph position="0"> We can apply the same techniques to grammar transformations that we have so far applied to parsing. Consider a grammar transformation, such as the Chomsky normal form (CNF) grammar transformation, which takes a grammar with epsilon, unary, and n-ary branching productions, and converts it into one in which all productions are of the form A --* BC or A --* a. For any sentence Wl... Wn its value under the</Paragraph> </Section> <Section position="7" start_page="600" end_page="600" type="sub_section"> <SectionTitle> Goodman Semiring Parsing </SectionTitle> <Paragraph position="0"> original grammar in the Boolean semiring (TRUE if the sentence can be generated by the grammar, FALSE otherwise) is the same as its value under a transformed grammar. Therefore, we say that this grammar transformation is value preserving under the Boolean semiring. We can generalize this concept of value preserving to other semirings.</Paragraph> <Paragraph position="1"> Elsewhere (Goodman 1998), we show that using essentially the same item-based descriptions we have used for parsing, we can specify grammar transformations. The concept of value preserving grammar transformation is already known in the intersection of formal language theory and algebra (Kuich 1997; Kuich and Salomaa 1986; Teitelbaum 1973). Our contribution is to show that these value preserving transformations can be written as simple item-based descriptions, allowing the same computational machinery to be used for grammar transformations as is used for parsing, and to some extent showing the relationship between certain grammar transformations and certain parsers, such as that of Graham, Harrison, and Ruzzo (1980). This uniform method of specifying grammar transformations is similar to, but clearer than, similar techniques used with covering grammars (Nijholt 1980; Leermakers 1989).</Paragraph> </Section> </Section> <Section position="7" start_page="600" end_page="601" type="metho"> <SectionTitle> 7. Previous Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="600" end_page="601" type="sub_section"> <SectionTitle> 7.1 Historical Work </SectionTitle> <Paragraph position="0"> The previous work in this area is extensive, including work in deductive parsing, work in statistical parsing, and work in the combination of formal language theory and algebra. This paper can be thought of as synthetic, combining the work in all three areas, although in the course of synthesis, several general formulas have been found, most notably the general formula for reverse values. A comprehensive examination of all three areas is beyond the scope of this paper, but we can touch on a few significant areas of each.</Paragraph> <Paragraph position="1"> First, there is the work in deductive parsing. This work in some sense dates back to Earley (1970), in which the use of items in parsers is introduced. More recent work (Pereira and Warren 1983; Pereira and Shieber 1987) demonstrates how to use deduction engines for parsing. Finally, both Shieber, Schabes, and Pereira (1995) and Sikkel (1993) have shown how to specify parsers in a simple, interpretable, item-based format.</Paragraph> <Paragraph position="2"> This format is roughly the format we have used here, although there are differences due to the fact that their work was strictly in the Boolean semiring.</Paragraph> <Paragraph position="3"> Work in statistical parsing has also greatly influenced this work. We can trace this work back to research in HMMs by Baum and his colleagues (Baum and Eagon 1967; Baum 1972). In particular, the work of Baum developed the concept of backward probabilities (in the inside semiring), as well as many of the techniques for computing in the inside semiring. Viterbi (1967) developed corresponding algorithms for computing in the Viterbi semiring. Baker (1979) extended the work of Baum and his colleagues to PCFGs, including to computation of the outside values (or reverse inside values in our terminology). Baker's work is described by Lari and Young (1990). Baker's work was only for PCFGs in CNF, avoiding the need to compute infinite summations. Jelinek and Lafferty (1991) showed how to compute some of the infinite summations in the inside semiring, those needed to compute the prefix probabilities of PCFGs in CNF.</Paragraph> <Paragraph position="4"> Stolcke (1993) showed how to use the same techniques to compute inside probabilities for Earley parsing, dealing with the difficult problems of unary transitions, and the more difficult problems of epsilon transitions. He thus solved all of the important problems encountered in using an item-based parser to compute the inside and outside values (forward and reverse inside values); he also showed how to compute the forward Viterbi values.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 25, Number 4 The final area of work is in formal language theory and algebra. Although it is not widely known, there has been quite a bit of work showing how to use formal power series to elegantly derive results in formal language theory, dating back to Chomsky and Sch~itzenberger (1963). The major classic results can be derived in this framework, but with the added benefit that they apply to all commutative w-continuous semirings. The most accessible introduction to this literature we have found is by Kuich (1997). There are also books by Salomaa and Soittola (1978) and Kuich and Salomaa (1986).</Paragraph> <Paragraph position="6"> One piece of work deserves special mention. Teitelbaum (1973) showed that any semiring could be used in the CKY algorithm, laying the foundation for much of the work that followed.</Paragraph> <Paragraph position="7"> In summary, this paper synthesizes work from several different related fields, including deductive parsing, statistical parsing, and formal language theory; we emulate and expand on the earlier synthesis of Teitelbaum. The synthesis here is powerful: by generalizing and integrating many results, we make the computation of a much wider variety of values possible.</Paragraph> </Section> <Section position="2" start_page="601" end_page="601" type="sub_section"> <SectionTitle> 7.2 Recent Similar Work </SectionTitle> <Paragraph position="0"> There has also been recent similar work by Tendeau (1997b, 1997a). Tendeau (1997b) gives an Earley-like algorithm that can be adapted to work with complete semirings satisfying certain conditions. Unlike our version of Earley's algorithm, Tendeau's version requires time O(n L+I) where L is the length of the longest right-hand side, as opposed to O(n 3) for the classic version, and for our description. While one could split right-hand sides of rules to make them binary branching, speeding Tendeau's version up, this would then change values in the derivation semirings. Tendeau (1997b, 1997a) introduces a parse forest semiring, similar to our derivation forest semiring, in that it encodes a parse forest succinctly. To implement this semiring, Tendeau's version of rule value functions take as their input not only a nonterminal, but also the span that it covers; this is somewhat less elegant than our version. Tendeau (1997a) gives a generic description for dynamic programming algorithms. His description is very similar to our item-based descriptions, except that it does not include side conditions. Thus, algorithms such as Earley's algorithm cannot be described in Tendeau's formalism in a way that captures their efficiency.</Paragraph> <Paragraph position="1"> There are some similarities between our work and the work of Koller, McAllester, and Pfeffer (1997), who create a general formalism for handling stochastic programs that makes it easy to compute inside and outside probabilities. While their formalism is more general than item-based descriptions, in that it is a good way to express any stochastic program, it is also less compact than ours for expressing most dynamic programming algorithms. Our formalism also has advantages for approximating infinite sums, which we can do efficiently, and in some cases exactly. It would be interesting to try to extend item-based descriptions to capture some of the formalisms covered by Koller, McAllester, and Pfeffer, including Bayes' nets.</Paragraph> </Section> </Section> class="xml-element"></Paper>