XML Viewer - p01-1022

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1022_metho.xml
Size: 21,380 bytes
Last Modified: 2025-10-06 14:07:38
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1022">
  <Title>Practical Issues in Compiling Typed Unification Grammars for Speech Recognition</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Typed Unification Grammars
</SectionTitle>
    <Paragraph position="0"> Typed Unification Grammars (TUG), like HPSG (Pollard and Sag 1994) and Gemini (Dowding et al. 1993) are a more expressive formalism in which to write formal grammars1. As opposed to atomic nonterminal symbols in a CFG, each non-terminal in a TUG is a complex feature structure (Shieber 1986) where features with values can be attached. For example, the rule: s[] a0 np:[num=N] vp:[num=N] can be considered a shorthand for 2 context free rules (assuming just two values for number): s a0 np singular vp singular s a0 np plural vp plural 1This paper specifically concerns grammars written in the Gemini formalism. However, the basic issues involved in compiling typed unification grammars to context-free grammars remain the same across formalisms.</Paragraph>
    <Paragraph position="1"> This expressiveness allows us to write grammars with a small number of rules (from dozens to a few hundred) that correspond to grammars with large numbers of CF rules. Note that the approximation need not incorporate all of the features from the original grammar in order to provide a sound approximation. In particular, in order to derive a finite CF grammar, we will need to consider only those features that have a finite number of possible values, or at least consider only finitely many of the possible values for infinitely valued features. We can use the technique of restriction (Shieber 1985) to remove these features from our feature structures. Removing these features may give us a more permissive language model, but it will still be a sound approximation.</Paragraph>
    <Paragraph position="2"> The experimental results reported in this paper are based on a grammar under development at RIACS for a spoken dialogue interface to a semi-autonomous robot, the Personal Satellite Assistant (PSA). We consider this grammar to be medium-sized, with 61 grammar rules and 424 lexical entries. While this may sound small, if the grammar were expanded by instantiating variables in all legal permutations, it would contain over a1a3a2a5a4a7a6a9a8 context-free rules.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Compilation Process
</SectionTitle>
    <Paragraph position="0"> We will be studying the compilation process to convert typed unification grammars expressed in Gemini notation into language models for use with the Nuance speech recognizer (Nuance, 2001). We are using Nuance in part because it supports context-free language models, which is not yet industry standard.2 Figure 1 illustrates the stages of processing: a typed unification grammar is first compiled to a context-free grammar. This is in turn converted into a grammar in Nuance's Grammar Specification Language (GSL), which is a form of context-free grammar in a BNF-like notation, with one rule defining each nonterminal, and allowing alternation and Kleene closure on the right-hand-side. Critically, the GSL must not contain any left-recursion, which must be eliminated before the GSL representation is produced.  The GSL representation is then compiled into a Nuance package with the nuance compiler.</Paragraph>
    <Paragraph position="1"> This package is the input to the speech recognizer. In our experience, each of the compilation stages, as well as speech recognition itself, has the potential to lead to a combinatorial explosion that exceeds practical memory or time bounds.</Paragraph>
    <Paragraph position="2"> We will now describe implementations of the first stage, generating a context-free grammar from a typed unification grammar, by two different algorithms, one defined by Kiefer and Krieger (2000) and one by Moore and Gawron, described in Moore (1998) The critical difficulty for both of these approaches is how to select the set of derived nonterminals that will appear in the final CFG.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Kiefer&amp;Krieger's Algorithm
</SectionTitle>
      <Paragraph position="0"> The algorithm of Kiefer&amp;Krieger (K&amp;K) divides this compilation step into two phases: first, the set of context-free nonterminals is determined by iterating a bottom-up search until a least fixed-point is reached; second, this least fixed-point is used to instantiate the set of context-free produc-</Paragraph>
      <Paragraph position="2"> tions.</Paragraph>
      <Paragraph position="3"> The computation of the fixed-point a11 , described in Table 1, proceeds as follows. First, a11 a13 is constructed by finding the most-general set of feature structures that occur in the lexicon a24 (lines 1-4). Each feature structure has the lexical restrictor L applied to it before being added to a11a14a13 (line 3) with the a34 a36 operator. This operator maintains the set a11a97a13 of most-general feature structures. A new feature structure is added to the set only when it is not subsumed by any current members of the set, and any current members that are subsumed by the new member are removed as the new element is added. The computation of a11 proceeds with the call to Iterate (line 6), which adds new feature structures that can be derived bottom-up.</Paragraph>
      <Paragraph position="4"> Each call to Iterate generates a new set a11a71a70a73a72a75a74 , including a11a71a70 as its base (line 8). It then adds new feature structures to a11a97a70a73a72a75a74 by instantiating every grammar rule r in a54 , the set of grammar rules.</Paragraph>
      <Paragraph position="5"> The first step in the instantiation is to unify every combination of daughters with all possible feature structures from a11a97a70 (FillDaughters, line 10). The rule restrictor is applied to each resulting feature structure (line 11) before it is added to a11a97a70a73a72a75a74 using the a34 a36 operator (line 12), similar to the lexical case. If after checking all rule applications bottom up, no new feature structures have been added to a11 a70a73a72a75a74 (line 13), then the least fixed-point had been found, and the process terminates. Otherwise, Iterate is called recursively. See Kiefer and Krieger (2000) for proof that this terminates, and finds the appropriate fixed-point.</Paragraph>
      <Paragraph position="6"> Having computed the least fixed-point a11 , the next step is to compute the set of corresponding CF productions. For each r in a54 , of the form a102</Paragraph>
      <Paragraph position="8"> all combinations of unifiable feature structures from a11 . Context-free productions a106a102 a0 a102 a74 a2a64a2a64a2a105a102 a103 will be added, where a106a102a37a23 a11 and a106a102a14a107a96a108a97a109a47a107a110a108a112a111a25a113a114a107a115a102 .3</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Moore and Gawron's Algorithm
</SectionTitle>
      <Paragraph position="0"> While K&amp;K uses subsumption to generate the set of most-general nonterminals, the algorithm of Moore and Gawron (M&amp;G), described in Moore (1998) attempts to propagate features values both bottom-up and top-down through the grammar to generate a set of nonterminals that contains no variables. Also unlike K&amp;K, the production of the CF rules and associated nonterminals is interleaved. The process consists of a preprocessing stage to eliminate singleton variables, a bottom-up propagation stage, and a top-down propagation stage.</Paragraph>
      <Paragraph position="1"> The preprocessing stage rewrites the grammar to eliminate singleton variables. This step effective replaces singleton variables with a new unique atomic symbol 'ANY'. The feature structure for each lexical item and grammar rule is rewritten such that singleton variables are unified with a special value 'ANY', and every non-singleton variable expression is embedded in a val() term. After this transformation, singleton variables will not unify with non-singleton variable expressions, only with other singletons. Additional rules are then introduced to deal with the singleton variable cases. For each daughter in a grammar rule in which a singleton variable appears, new lexical items and grammar rules are introduced which unify with that daughter in the original grammar. As an example, consider the  result a116 will always be in a117 and a116a63a118a119a116a92a120a91a121a65a121a65a121a122a116a124a123 will be a CF production in the approximation, but this may not be true if a116 was removed from a117 by a125a71a126 . Instead, the subsuming nonterminal a127a116 should be the new mother.</Paragraph>
      <Paragraph position="3"> Here, the np object of vp is underspecified for num (as English does not generally require number agreement between the verb and its object), so it will be a singleton variable. So, the following rules will be generated:</Paragraph>
      <Paragraph position="5"> After preprocessing, any variables remaining in the bodies of grammar rules will be shared variables. Singleton variable elimination by itself is very effective at shrinking the size of the CF grammar space, reducing the size of the rule space for the PSA grammar from a1a3a2a5a4a129a128 a10a64a77 a6a9a8 rules</Paragraph>
      <Paragraph position="7"> rules.</Paragraph>
      <Paragraph position="8"> The bottom-up stage starts from this grammar, and derives a new grammar by propagating feature values up from the lexicon. The process acts like a chart parser, except that indicies are not kept. When a rule transitions from an active edge to an inactive edge, a new rule with those feature instantiations is recorded. As a side-effect of this compilation, a131 -productions are eliminated.</Paragraph>
      <Paragraph position="9"> Top-down processing fires last, and performs a recursive-descent walk of the grammar starting atthe start symbol a132 , generating a new grammar that propagates features downward through the grammar. A side-effect of this computation is that useless-productions (rules not reachable from a132 ) are removed. It might still be possible that after top-down propagation there would still be variables present in the grammar. For example, if the grammar allows sentences like &amp;quot;the deer walked&amp;quot;, which are ambiguous for number, then there will be a rule in the grammar that contains a shared variable for the number feature. To address this, as top-down propagation is progressing, all remaining variables are identified and unified with a special value 'ALL'. Since each nonterminal is now ground, it is trivial to assign each nonterminal a unique atomic symbol, and rewrite the grammar as a CFG.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Comparison
</SectionTitle>
      <Paragraph position="0"> Table 2 contains a summary of some key statistics generated using both techniques. The recognition results were obtained on a test set of 250 utterances. Recognition accuracy is measured in word error rate, and recognition speed is measured in multiples of real time (RT), the length of the utterance compared with the length of the CPU time required for the recognition result4. The size of the resulting language model is measured in terms of the number of nonterminals in the grammar, and the size of the Nuance node array, a binary representation of the recursive transition network it uses to search the grammar. Ambiguity counts the average number of parses per sentence that were allowed by the CF grammar. As can be readily seen, the compilation time for the K&amp;K algorithm is dramatically lower than the M&amp;G algorithm, while producing a similarly lower recognition performance, measured in both word error rate and recognition speed.</Paragraph>
      <Paragraph position="1"> Given that the two techniques generate grammars of roughly similar sizes, the difference in performance is striking. We believe that the use of the a34 a36 in K&amp;K is partially responsible. Consider a grammar that contains a lexical item like &amp;quot;deer&amp;quot; that is underspecified for number, and will contain a singleton variable. This will lead to a nonterminal feature structure for noun phrase that is also underspecified for number, which will be more general than any noun phrase feature structures that are marked for number. The a34 a36 operator will remove those noun phrases as being less general, effectively removing the number agreement constraint between subject and verb from the context-free approximation. The use of a34a37a36 allows a single grammar rule or lexical item to have non-local effects on the approximation. As seen in Table 2, the grammar derived from the K&amp;K algorithm is much more ambiguous than the grammar derived the M&amp;G algorithm, and, as is further elaborated  in Section 4, we believe that the amount of ambiguity can be a significant factor in recognition performance.</Paragraph>
      <Paragraph position="2"> On the other hand, attention must be paid to the amount of time and memory required by the Moore algorithm. On a medium-sized grammar, this compilation step took over 3 hours, and was close to exceeding the memory capacity of our computer, with a process size of over 1GB. The approximation is only valuable if we can succeed in computing it. Finally, it should also be noted that M&amp;G's algorithm removes a131 -productions and useless-productions, while we had to add a separate postprocessing stage to K&amp;K's algorithm to get comparable results.</Paragraph>
      <Paragraph position="3"> For future work we plan to explore possible integrations of these two algorithms. One possibility is to include the singleton-elimination process as an early stage in the K&amp;K algorithm.</Paragraph>
      <Paragraph position="4"> This is a relatively fast step, but may lead to a significant increase in the size of the grammar.</Paragraph>
      <Paragraph position="5"> Another possibility is to embed a variant of the K&amp;K algorithm, and its clean separation of generating nonterminals from generating CF productions, in place of the bottom-up processing stage in M&amp;G's algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Reducing Structural Ambiguity
</SectionTitle>
    <Paragraph position="0"> It has been observed (Bratt and Stolcke 1999) that a potential difficulty with using linguistically-motivated grammars as language models is that ambiguity in the grammar will lead to multiple paths in the language model for the same recognition hypothesis. In a standard beam-search architecture, depending on the level of ambiguity, this may tend to fill the beam with multiple hypotheses for the same word sequence, and force other good hypotheses out of the beam, potentially increasing word error rate. This observation appears to be supported in practice. The original form of the PSA grammar allows an average of 1.4 parses per sentence, and while both the K&amp;K and M&amp;G algorigthm increase the level of ambiguity, the K&amp;K algorithm increases much more dramatically.</Paragraph>
    <Paragraph position="1"> We are investigating techniques to transform a CFG into one weakly equivalent but with less ambiguity. While it is not possible in general to remove all ambiguity (Hopcroft and Ullman 1979) we hope that reducing the amount of ambiguity in the resulting grammar will result in improved recognition performance.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Grammar Compactor
</SectionTitle>
      <Paragraph position="0"> The first technique is actually a combination of three related transformations:  then remove the productions for B, and rewrite B as A everywhere it occurs in the grammar.</Paragraph>
      <Paragraph position="1"> a133 Unit Rule Elimination - If there is only one production for a nonterminal A, and it has a single daughter on its right-hand side  then remove the production for a135 a70 . These transformations are applied repeatedly until they can no longer be applied. Each of these transformations may introduce opportunities for the others to apply, so the process needs to be order insensitive. This technique can be applied after the traditional reduction techniques of a131 elimination, cycle-elimination, and left-recursion elimination, since they don't introduce any new a131 -productions or any new left-recursion. Although these transformations seem rather specialized, they were surprisingly effective at reducing the size of the grammar. For the K&amp;K algorithm, the number of grammar rules was reduced from 3,246 to 2,893, a reduction of 9.2%, and for the M&amp;G algorithm, the number of rules was reduced from 4,758 to 1,837, a reduction of 61%. While these transforms do reduce the size of the grammar, and modestly reduce the level of ambiguity from 1.96 to 1.92, they did not initially appear to improve recognition performance. However, that was with the nuance parameter -node array optimization level set to the default value FULL. When set to the value MIN, the compacted grammar was approximately 60% faster, and about 9% reduction in the word error rate, suggesting that the nuance compiler is performing a similar form of compaction during node array optimization. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Immediate Recursion Detection
</SectionTitle>
      <Paragraph position="0"> Another technique to reduce ambiguity was motivated by a desire to reduce the amount of prepositional phrase attachment ambiguity in our grammar. This technique detects when a Kleene closure will be introduced into the final form of the grammar, and takes advantage of this to remove ambiguity. Consider this grammar fragment:</Paragraph>
      <Paragraph position="2"> The first rule tells us that an NP can be followed by an arbitrary number of PPs, and that the PP following the NP in the second rule will be ambiguous. In addition, any nonterminal that has an NP as its rightmost daughter can also be followed by an arbitrary number of PPs, so we can detect ambiguity following those nonterminals as well.</Paragraph>
      <Paragraph position="3"> We define a predicate follows as: A follows B iff B a0 B A or B a0a136a135 C and A follows C Now, the follows relation can be used to reduce ambiguity by modifying other productions where a B is followed by an A:  There is an exactly analogous transformation involving immediate right-recursion and a similar predicate preceeds. These transformation produce almost the same language, but can modify it by possibly allowing constructions that were not allowed in the original grammar. In our case, the initial grammar fragment above would require that at least one PP be generated within the scope of the VP, but after the transformation that is no longer required. So, while these transformations are not exact, they are still sound aproximations, as the resulting language is a superset of the original language.</Paragraph>
      <Paragraph position="4"> Unfortunately, we have had mixed results with applying these transformations. In earlier versions of our implementation, applying these transformations succeeded in improving the recognition speed up to 20%, while having some modest improvements in word error rate. But, as we improved other aspects of the compilation process, notably the grammar compaction techniques and the left-recursion elimination technique, those improvements disappeared, and the transformations actually made things worse. The problem appears to be that both transformations can introduce cycles, and the right-recursive case can introduce left-recursion even in cases where cycles are not introduced. When the introduced cycles and left-recursions are later removed, the size of the grammar is increased, which can lead to poorer recognition performance. In the earlier implementations, cycles were fortuitously avoided, probably due to the fact that there were more unique nonterminals overall. We expect that these transformations may be effective for some grammars, but not others. We plan to continue to explore refinements to these techiques to prevent them from applying in cases where cycles or left-recursion may be introduced.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Left Recursion Elimination
</SectionTitle>
    <Paragraph position="0"> We have used two left-recursion elimination techniques, the traditional one based on Paull's algorithm, as reported by Hopcroft and Ullman (1979), and one described by Moore (2000)5, based on a technique described by Johnson (1998). Our experience concurs with Moore that the left-corner transform he describes produces a more compact left-recursion free grammar than that of Paull's algorithm. For the K&amp;K approximation, we were unable to get any grammar to compile through to a working language model using Paull's algorithm (the models built with Paull's algorithm caused the recognizer to exceed memory bounds), and only succeeded with Moore's left-recursion elimination technique.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML