File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2098_metho.xml

Size: 15,883 bytes

Last Modified: 2025-10-06 14:07:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2098">
  <Title>A Context-Sensitive Model for Probabilistie LR Parsing of Spoken Language with Transformation-Based Postproeessing</Title>
  <Section position="4" start_page="0" end_page="677" type="metho">
    <SectionTitle>
2 Spontaneous Speech Parsing
</SectionTitle>
    <Paragraph position="0"> Tile Integrated Processing unit uses tile acoustic scores of the word hypotheses in tile word graph and a statistical trigram model to guide all connected parsers through the lattice using an A*-search algorithm. This is similar to the work presented by (Schmkl, 1994) and (Kompe et al., 1997). This A*-search algorithm is used by the probabilistic shift-reduce parser (see section 3) to find the best scored path through the word graph according to acoustic and hmguage model infornmtion. If the parser runs into a syntactic &amp;quot;dead end&amp;quot; in the word graph (that is a path that cannot be analyzed by tile context-fl'ee gralllmar of the shift-reduce pmser), the parser searches the best SCOled alternative path ill tile word graph, that call be parsed using tile context-fiee grammar.</Paragraph>
    <Paragraph position="1"> We extracted context fiee grammars for German, English and Japanese flom the Verbmobil treebank (German: 25,881 trees; English: 23,140 trees; Japanese: 4,534 trees) to be able to parse spontaneous utterances. The treebanks consist ol' annotated transliterations of face-to-face dialogs in the Verbmobil domains and contain utterances like * and then well you you you have hotel in./bnnation no 1 am not how about what aboul Tuesday the sixteenth actually it yeah so seven hour fiight The gramnmr of the parser covers only spontaneous speech phenomenas that are contained in the treebanks.</Paragraph>
    <Paragraph position="2"> During the developlnent o1' the parser we encountered severe problems with the size of the context-free grammar extracted from the treebanks. The German grammar extracted from a treebank containing 20,000 trees resulted in a LALR parsing table with lnore than 3,000,000 entries, which cannot be trained on only 20,000 utterances. The reason was that there are many rules in the treebank, which occur only once or twice but inl'late the context-flee grammar and thus tile size of the  size of the parsing table. For this reason we eliminate trees from our training material containing rules that occur unfrequently in the treebank and use only rules achieving a lninimal rule count. This threshold is determined experimentally in our training process.</Paragraph>
  </Section>
  <Section position="5" start_page="677" end_page="678" type="metho">
    <SectionTitle>
3 A new context sensitive approach to
</SectionTitle>
    <Paragraph position="0"> probabilistic shift-reduce parsing The work of Siemens in Verbmobil phase 1 showed that a combination of shift-reduce and unification-based parsing of word graphs works well on spontaneous speech but is not very robust on lowword-accuracy input (the word error rate of the Verbmobil speech recognizers is about 25% today).</Paragraph>
    <Paragraph position="1"> One way to gain a higher degree of robustness is to use a context-free grammar instead of an unification-based grammar, hence we decided to implement and test a context-fi'ee probabilistic LALR parser in Verbmobil phase 2.</Paragraph>
    <Section position="1" start_page="677" end_page="678" type="sub_section">
      <SectionTitle>
3.1. Previous approaches
</SectionTitle>
      <Paragraph position="0"> There am several approaches (see for example (Wright &amp; Wrigley, 1991), (Briscoe &amp; Carroll, 1993/1996), (Lavie, 1996) or (Inui et al., 1997)) to probabilistic shift-reduce parsing but only Lavie's parser, whose probabilistic model is very similar to (Briscoe &amp; Carroll, 1993), has been tested on spontaneously spoken utterances.</Paragraph>
      <Paragraph position="1"> While the model presented by (Wright &amp; Wrigley, 1991) was equivalent to the standard PCFG (probabilistic context-free grammar, see (Charniak, 1993)) model, which is not context-sensitive and thus has certain limitations in the precision that it can achieve, later work tried to implement slight context-sensitivity (as e.g. the probability of a shift/reduce-action in Briscoe and Carroll's model depends oll the current and succeeding LR parser state and the look-ahead symbol).</Paragraph>
      <Paragraph position="2"> 3.2. Bringing context to probabilistie shift-reduce parsing Like other work oi1 probabilistic parsing our model is based on the equation</Paragraph>
      <Paragraph position="4"> where /i is the part-of speech tag for word wi in analysis T.</Paragraph>
      <Paragraph position="5"> Finding a realistic approximation for P(7) is very difficult but important to achieve high parsing accuracy. Supposed we approximate P(WIT) by equation (3). Then P(WIT) is nothing more than P(~L), where L is the part-of-speech tag sequence for a given utterance W. If our goal is to select the best analysis T for a given tag sequence L we do not necessarily depend on a good approximation of P(T), but simply select the best analysis for a given L by finding a T that maximizes P(TIL ) (and not P(7)). Hence, in our model we use P(7\]L) instead of</Paragraph>
      <Paragraph position="7"> where Tk is the set of possible analyses for L. Let D be the set of all complete shift-reduce parser action sequences for L, i.e. dk is the sequence of shift- and reduce-actions that generates analysis Tk. Then we</Paragraph>
      <Paragraph position="9"> where \[d\] is the number of parser actions in d, adj is thejth parser action in d and &amp;,: is the context of tile parser while executing ad,i.</Paragraph>
    </Section>
    <Section position="2" start_page="678" end_page="678" type="sub_section">
      <SectionTitle>
3.3. Choosing a context
</SectionTitle>
      <Paragraph position="0"> &amp;quot;C, ontext&amp;quot; ill equation (5) might be everything. It can be tile classical (CurrentParserState; LookAheadSymbol)-tuple, it may also contain iuformation about the following (look-ahead) word(s), elements on the parser stack or tile most probable dialogue act of tile utterance, even semantical iuformation about roles of the syntactical head of the phrase on the top of the parser stack.</Paragraph>
      <Paragraph position="1"> The training procedure of our probabilistic parser is straightforward: I. Construct complete parser action sequences for each tree in the training set. Save all information (on every action) about the whole &amp;quot;context&amp;quot; we have chosen to use.</Paragraph>
      <Paragraph position="2">  2. Count the occurences of all actions in different subcontexts. A subcontext may be  the whole context or a (even empty) selection of features o1' the whole context. Compute the probability of a parser action regarding to the subcontext as the relative frequency of the action within lifts subcontext.</Paragraph>
      <Paragraph position="3"> The reason why we build subcontexts is that there is a relevant sparse-data-problem in Verbmobil. A treebank containing between 20,000 and 30,000 trees is too small to give reliable wtlues for larger contexts in a parsing table containing 500,000 entries or more. Hence we use the smoothing technique that is known as backing-off in statistical language modelling (Chamiak, 1993) and approximate the probability of an action a with context k using its subcontexts ci: 1&amp;quot;(alk)=C, (6) &lt;,1&amp;quot;(.I..,) with ~x~. smnming up to 1. Tile values for ~x: are determined experimentally. We have chosen three contexts for evaluation (KI and K2 also exist in our model but are irrelevant for this evaluation): * K3: LR parser state and look-ahead symbol, * K4:K3 plus phrase head of the top element of the LR parsing stack, * K5:K4 plus look-ahead word.</Paragraph>
      <Paragraph position="4"> Please see section 5.1. for tile detailed results of this evaluation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="678" end_page="680" type="metho">
    <SectionTitle>
4 Transformation-based error correction
</SectionTitle>
    <Paragraph position="0"> Parsing spontaneous speech - even in a limited domain - is a quite ambitious task for a context fi'ee granunar parser. We have a large set of non-terminals ill our grammar that also encode functional information like Head or Modifier, gralnmatical information like accusativecomplelnent or vexb-prefix besides phrase structure information. Our current grammars contain 240 non-terminals for German, 178 for English and 200 for Japanese and the lexicon is derived automatically fiom the tree bank and external resources (there were only minor efforts in improving the lexicon manually).</Paragraph>
    <Paragraph position="1"> During the development of the parser we observed a constantly declining Exacl Match rate of tile parser fiom over 80% in the early stages (with just a few hundred trees of training data) to under 50% today. The reason was that the first training samples were simple utterances on &amp;quot;appointment scheduling&amp;quot; only, while the treebank nowadays contains spontaneous utterances from two domains and that there was a growing number of inconsistencies ill the treebank due to annotation errors and a growing number of annotators. Hence we had lo develop a technique to improve the exact lnatch rate particularly with regard to the following semantics construction process that depends on correct syntactic analyses to produce a correct semantic representation of the utterance.</Paragraph>
    <Paragraph position="2"> (Brill, 1993) applied transformation-based learning lnethods to natural language processing, especially to part-of-speech tagging. He showed that it can be effective to let a system make a first guess that may be improved or corrected by following transformation-based steps. We observed many systematical errors in tile output of the probabilistic parser, hence we adopted this idea and took tile probabilistic shift-reduce parser as the guesser and tried to learn tree transformations from our training data to improve this first guess. We integrated the learned transformations into Verbmobil as shown in figure 2.</Paragraph>
    <Paragraph position="3"> The transforlnations map a tree to another tree, changing parts that had be identified as incorrect in the learning process. The output of the learning process are simple Prolog clauses of tile form  that are sorted by the number of matches on the training corpus.</Paragraph>
    <Section position="1" start_page="679" end_page="679" type="sub_section">
      <SectionTitle>
4.1 The Problem
</SectionTitle>
      <Paragraph position="0"> The task of learning transformations that are suitable to post-process the output of a probabilistic parser can be implemented as shown in figure 2:  1. train the probabilistic parser on a training set O (containing utterances and their human-annotated analyses).</Paragraph>
      <Paragraph position="1"> 2. parse all utterances of O and save the CO~Tesponding parser outputs P.</Paragraph>
      <Paragraph position="2"> 3. find the set of as-general-as-possible  transformations T that map all incorrect trees of P into corresponding correct trees in O and select the &amp;quot;optimal&amp;quot; transformation from this set.</Paragraph>
      <Paragraph position="3"> The first point has been described in section 3.3. and the second point is trivial. The as-general-aspossible tran,sfonnation is the mapping of a tree of P into a tree for the same utterance in O that achieves a high degree of generalization and fulfils certain conditions, which are explained in section 4.2.</Paragraph>
      <Paragraph position="4">  1. find the set (\] of all common subtrees of r\[) and 0.</Paragraph>
      <Paragraph position="5"> 2. find the set ;~ of all potential transformations. A transformation t is formed by substitution (0i) of one or more elements of ~) by logical variables in @ und 0 (i.e. t: 0~(@) ~ 0~(0)) 3. choose the &amp;quot;optilnal&amp;quot; transformation from ~.  Syntactical trees are represented as Prolog terms in our learning process. Since the transformation should be able to map large correct structures in &lt;/) to their (correct) counterparts in O the first point of the algorithm is done by setting (} equal to the set of all (Prolog) subtenns that are common in @ and 0 (i.e. G=subterms (C/\[)) (\]subterms (0))J It is crucial here to attach a unique identifier to each word (like &amp;quot;l-hi&amp;quot;,&amp;quot;2-Mr.&amp;quot;,&amp;quot;3-Smith&amp;quot;) because one word (like the article &amp;quot;the&amp;quot;) could occur several times in one sentence and it is important to keep those occurences separated for the second step of the learning algorithm.</Paragraph>
      <Paragraph position="6"> The second step computes all potential tree transformations by substituting one or more elements of O in q) and 0 by identical (Prolog) variables. In this regard &amp;quot;substitution&amp;quot; is an operation, that is inverse to the substitution known</Paragraph>
    </Section>
    <Section position="2" start_page="679" end_page="680" type="sub_section">
      <SectionTitle>
4.2. The Learning Algorithm
</SectionTitle>
      <Paragraph position="0"> The learning algorithm to derive the most general tree transformations for incorrect trees in O is straightforward. To find the most general transformation for a source tree @EP to be mapped into a destination tree ()cO do: subtrees (+Tree, -SubTrees) could simply be defined (in Prolog) as</Paragraph>
      <Paragraph position="2"> Trees are represented as terms like a:\[b,c\], for exalnple.</Paragraph>
      <Paragraph position="3">  flom predicate logic.</Paragraph>
      <Paragraph position="4"> Choosing tile &amp;quot;optinml&amp;quot; transformation from the space of all transl'ormations in the third step is a multi-dimensional problem. The dilnensions are: * fault tolerance * coverage of the training corpus , degree of generalization lrault tolerance is a parameter that indicates how many correction errors on the training corpus the human supervisor is willing to tolerate, i.e. how many of tile correct parser trees may be transformed into incorrect ones. Accepting transfom~ation errors may improve the grade of generalization of the transformation but for Verbmobil we decided not to be fault tolerant. A correct analysis should be kept correct in our point of view.</Paragraph>
      <Paragraph position="5"> Coverage o/&amp;quot; the training corpus means lhat if step 2 of the learning algorithm has found several possible transformations l'or a J)-O-pair the transformation tG'77 that covers the most examples in P/O shonkl be preferred because this transformation is likely to occur more often in the rtnlning system or test situation.</Paragraph>
      <Paragraph position="6"> 13esides the heuristical generalization criterion of coverage of thc training corpus we also introduced a formal one. If there are several transfornmtions that do not generate errors on the training corpus and have exactly the same lnaximuln coverage, we select the transformation which has the smallest mean distance of its logical variables to the root of the tree, because we expect the most general transformation to have its variable parts &amp;quot;near the root&amp;quot; of the trees. I)istance is measnred in levels from the root. For example, tile transformation in figure 3 has a mean root distance of the variables of ( (1 +2) + (I +3) ) / 4 = 1.75.</Paragraph>
      <Paragraph position="7"> jptt ..... utt-.</Paragraph>
      <Paragraph position="8"> IA\] eX o? @ BI _:px_ auf Figure 3 Using this learning algorithm we generate a set of optimal transformations for many errors the parser produced on the set of training utterances. There are still some utterances for which no valid transforlnation can be found because all potential transforlnations would generate errors on the training corpus, what we are not willing to accept.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML