XML Viewer - w94-0109

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/w94-0109_metho.xml
Size: 16,362 bytes
Last Modified: 2025-10-06 14:13:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="W94-0109">
  <Title>Integrating Symbolic and Statistical Approaches in Speech and Natural Language Applications</Title>
  <Section position="3" start_page="69" end_page="73" type="metho">
    <SectionTitle>
2. APPLICATION OF TECHNIQUES
</SectionTitle>
    <Paragraph position="0"> The bulk of our work in integrating symbolic and statistical approaches has been in the development of the &amp;quot;Gister&amp;quot; system (Rohlicek, et. al 1992), which is designed to extract information from voice communications. We developed and tested the algorithms using off-the-air  commercial air traffic control recordings, where the goal was to identify the flights present and determine the scenario (e.g. takeoff, landing). We have also extended the system to extract more specific information from the ATC commands, such as direction orders and tower clearances. Figure One shows the overall boxology of the Gisting system.</Paragraph>
    <Paragraph position="1"> There are several characteristics of the domain that make it amenable to the unique combination of techniques we employed. First, the language is stereotypical, that is there are few variations in the way the information can be expressed, and we have available expertise on how the information is expressed: it is regulated by the FAA and described in FAA manuals. Second, the signal is very noisy, so that traditional techniques don't work very well (recognition results show a 25-30% word error rate). Our goal was to leverage our a priori knowledge of the domain to reduce the uncertainty inherent in the problem. The most obvious place to start was to improve speech recognition by introducing domain specific information in the language model.</Paragraph>
    <Section position="1" start_page="70" end_page="72" type="sub_section">
      <SectionTitle>
2.1 Language Modeling
</SectionTitle>
      <Paragraph position="0"> The role of the language model in the speech recognition component is to constrain the possibilities of what word can come next and to mark each possibility with its probability: the likelihood that it will occur in a particular context. A common approach to language modeling is to use statistically based Markov-chain language models (ngram models). While this approach has been shown to be effective for speech recognition, there is, in general, more structure present in natural language than n-gram models can capture. In particular n-grams do not explicitly capture long distance dependencies. For example, a private plane identifier consists of the name of a plane type, some digits, and one or two letter words (e.g. &amp;quot;Sessna six one two one kilo&amp;quot;). Because of the frequency of digits in this domain, an n-gram will find that the most likely thing to follow a digit is another digit; the relationship between the first elements of the phrase (the plane type) and the last (a letter word) is lost.</Paragraph>
      <Paragraph position="1"> In our approach we integrated phrase grammars (which were already being used to extract information from the results of recognition) with n-grams, thereby introducing as much linguistic structure and prior statistical information as is available while maintaining a robust full-coverage statistical language model for recognition.</Paragraph>
      <Paragraph position="2"> As shown in Figure Two, there are two main inputs to the model construction portion of the system: a transcribed speech training set and a phrase-structure grammar. The phrase-structure grammar is used to partially parse the training text. The output of this is: (1) a top-level version of the original text with subsequences of words replaced by the non-terminals that accept those subsequences; and (2) a set of parse trees for the instances of those nonterminals.</Paragraph>
      <Paragraph position="3"> We first describe the parser and grammar and then discuss how we use them for language modeling.</Paragraph>
      <Paragraph position="4"> For both the language modeling and information extraction (the shaded boxes in Figure 2), we are using the partial parser Sparser (McDonald 1992). Sparser is a bottom-up chart parser which uses a semantic phrase structure grammar (i.e. the nonterminals are semantic categories, such as HEADING or FLIGHT-ID, rather than traditional syntactic categories, such as CLAUSE or NOUN-PHRASE).</Paragraph>
      <Paragraph position="5"> Sparser makes no assumption that the chart will be complete, i.e. that a top level category will cover all of the input, or even that all terminals will be covered by categories, effectively allowing unknown words to be ignored. Rather it simply builds constituent structure for those phrases that are in its grammar.</Paragraph>
      <Paragraph position="6"> Our approach to creating the rules was typical of symbolic approaches: we wrote rules using our knowledge of the ATC domain gained from experts and manuals, ran them on a portion of our data, inspected the results, rewrote the rules, and iterated. In the case of flight IDs, we could apply more extensive evaluation techniques since each utterance in our corpus was already annotated with this  information. However, for other kinds of statements, such as controller orders or pilot replies, there was no master &amp;quot;answer&amp;quot; list against which to evaluate. We had only two measures to use to evaluate our grammar:l the overall coverage (what percentage of the words was covered by some category in the grammar), and the specific coverage, which can only be determined by inspecting the results by hand and noticing when some command occurred that was not picked up by the parser. Note that since in this domain we know that there is relatively little variation, so that sampling the data can be assumed to be sufficient to determine coverage, which is not the case in less constrained domains. Figure 3 shows a small set of examples of the rules:  clrd/land &gt; (&amp;quot;cleared&amp;quot; &amp;quot;to&amp;quot; land-action) clrd/takeoff &gt; (&amp;quot;cleared&amp;quot; &amp;quot;to&amp;quot; takeoff-action)) clrd/takeoff &gt; (&amp;quot;cleared&amp;quot; &amp;quot;for&amp;quot; takeoff-action ))) tower-clearance &gt; (runway clrd/land) tower-clearance &gt; (runway clrd/takeoff ))  3: Phrase structure rules for tower clearance The n-gram model was trained not with the original transcripts, but rather with transcripts where the targeted phrases defined in our grammar were replaced by their nonterminal categories. Note that in this case, where goal is to model aircraft identifiers and a small set of air traffic control commands, other phrases like the identification of the controller, traffic information, etc., are left as words to be modeled by the n-gram. Examples of the original transcripts and the n-gram training are shown below:  For the specific phrases we are interested in, we use the parse trees are used to obtain statistics for the estimation of production probabilities for the rules in the grammar.</Paragraph>
      <Paragraph position="7"> Since we assume that the production probabilities depend on their context, a simple count is insufficient. Smoothed maximum likelihood production probabilities are estimated based on context dependent counts. The context is defined as the sequence of rules and positions on the right-hand sides of the rules leading from the root of the parse tree to the non-terminal at the leaf. The probability of a parse therefore takes into account that the expansion of a category may depend on its parents.</Paragraph>
      <Paragraph position="8"> For example, in the above grammar (Figure 3), the expansion of TAKEOFF-ACTION may be different depending on whether it is part of rule 5 or rule 6. Therefore, the &amp;quot;context&amp;quot; of a production is a sequence of rules and positions that have been used up to that point, where the &amp;quot;position&amp;quot; is where in the RHS of the rule the nonterminal is. For example, in the parse shown below (Figure 4), the context of R2 (TAKEOFF-ACTION &gt; &amp;quot;takeoff&amp;quot;) is rule 6/position 3, rule 8/position 2. (See Meteer &amp; Rohlicek 1993 for a more detailed discussion of the probabilities required evaluate the probabifity of a parse.)</Paragraph>
      <Paragraph position="10"> In order to use a phrase-structure grammar directly in a time-synchronous recognition algorithm, it is necessary to construct a finite-state network representation. 2 If there is no recursion in the grammar, then this procedure is straightforward: for each rule, each possible context corresponds to a separate subnetwork. The subnetworks for different rules are nested. Figure 6 shows the expansion of the rules in Figure 3.</Paragraph>
      <Paragraph position="11"> I Note that given the narrowness of the domain, the issue in processing transcripts is rarely correctness, but rather coverage: do the rules capture all of the alternative ways the information carl be expressed.</Paragraph>
      <Paragraph position="12"> 2 The phrase grammar formalism is context free; however, in practice, we limited the grammar to finite state so that it can be more easily integrated into the recognizer. We are considering various means of finite state approximations in order to use a more powerful grammar, but haven't found sufficient need in this domain to press the issue.</Paragraph>
      <Paragraph position="13">  There have been several attempts to use probability estimates with context free grammars. The most common technique is using the Inside-Outside algorithm (e.g.</Paragraph>
      <Paragraph position="14"> Pereira &amp; Schabes 1992, Mark, et al. 1992) to infer a grammar over bracketed texts or to obtain Maximum-Likelihood estimates for a highly ambiguous grammar. However, most require a full coverage grammar, whereas we assume that only a selective portion of the text will be covered by the grammar. A second difference is that they use a syntactic grammar, which results in the parse being highly ambiguous (thus requiring the use of the Inside-Outside algorithm). We use a semantic grammar, with which there are rarely multiple interpretations for a single utterance in this domain.</Paragraph>
    </Section>
    <Section position="2" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
2.2 Information extraction
</SectionTitle>
      <Paragraph position="0"> The information extraction component of the system employs purely symbolic techniques, using same grammar defined for language modeling (as in Figure 3) with associated routines for creating referents as a side affect of firing a rule. Since the uncertainty of the problem lies in the fact that the recognition is errorful, once a grammar has been developed on one set of transcripts one can achieve nearly perfect extraction of flight IDs and commands, since they are the most regular (and regulated) portions of the utterances. In fact, because of this, we were able to use the results of the parser on the transcripts to provide an answer key for evaluation. While this is not a completely accurate test, since there may be cases where a command is expressed in a way that is outside the competence of the grammar, it does make evaluation tractable, since the time it would take to mark the transcripts by hand would be prohibitive. (See Meteer &amp; Rohlicek 1994 for a more detailed description of the information extraction portion of the system and the precision and recall results.)</Paragraph>
    </Section>
    <Section position="3" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
2.3 Scenario classification
</SectionTitle>
      <Paragraph position="0"> Another component of the Gisting system is scenario classification: given a dialog between pilot and controller, determine the overall scenario being followed. An important aspect of the problem is that classification is performed on the output of the speech recognizer. We used a standard statistical technique for classification, a decision tree constructed using the CART methodology (Breiman, et.al). Decision trees have the advantage that they simultaneously select what the most discriminating features are (from some given feature set, which in the case of text classification is generally the words), and build the model.</Paragraph>
      <Paragraph position="1"> Decision trees are interesting predictors, in that they often find features that are telling, but that an expert would not necessarily have thought of. For example if one scenario is more likely to include a radio frequency, then the word &amp;quot;point&amp;quot; may turn out to be very discriminating. When applying classification to the output of recognition, one must choose not only features that are distinguishing, but also ones that are easily recognized, so that they will be reliably in the output. One must be also careful to cross validate results on a test set to avoid overtraining: finding features that are peculiar to the training. For example if in the collected data, one airline had many more takeoffs than landings, then that airline may be picked as a discriminator, even though it is not a good discriminator in general (all the planes that take off eventually land.</Paragraph>
      <Paragraph position="2"> We used integrated symbolic methods into classification by using the parser and grammar to augment the input to the classifier with semantic features, as shown in the example below. 3 This is the same process as that which created the 3 Note that for clarity this example is from the transcriptions. not from the output of re, cognition. In the Gisting system, the  N-gram training, only in this case, nonterminals are inserted rather than replacing phrases. Note that some of the categories merely emphasize something already available from a lexical item, such as &amp;quot;takeoff' and &amp;quot;takeoff-action&amp;quot;, whereas others capture information that is only implicit, such as the fact that &amp;quot;two two right&amp;quot; is a runway.</Paragraph>
      <Paragraph position="3"> COMMERCIAL-AIRPLANE nera thirty four ninety eight TURN-ORDER turn right heading two seven zero CLRD/TAKEOFF RUNWAY two two fight CLRDITAKEOFF cleared for TAKEOFF-ACTION takeoff In a sense, the parser provided equivalence classes for phrases, since, for example, the nonterminal RUNWAY was added when any one of the several runways were mentioned.</Paragraph>
      <Paragraph position="4"> As in the information extraction component, we used the parser to determine the correct scenario based on the transcripts, which in this case provided training for the model. Remember, the uncertainty in this problem is introduced by the poor recognition results; the domain is sufficiently narrow that scenarios can be classified deterministically. For each dialog (which we determine using the speaker and hearer fields in the transcript), the system parsed transmissions until an unambiguous command is found (for example &amp;quot;cleared to land&amp;quot; and &amp;quot;contact ground&amp;quot; are only given when a plane is landing), then marked all the transmissions in that dialog as to the scenario. There will be some cases that are uncertain, for example, if only part of the transcription is available, and these cases are marked &amp;quot;unknown&amp;quot; and presented to the user who may be able to find some more subtle clue to the scenario.</Paragraph>
    </Section>
    <Section position="4" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
2.4 Event Spotting
</SectionTitle>
      <Paragraph position="0"> We are also applying these techniques in other applications. In particular, we have recently performed experiments in Event Spotting, which is an extension of word spotting where the goal is to determine the location 'of phrases, rather than single keywords. We used the parser/extraction portion of the system to find examples of phrase types in the corpus and to evaluate the results, as well as in the language model of the recognizer. In an experiment detecting time and date phrases in the Switchboard corpus (which is conversational telephone quality data), we saw an increase in detection rate over strictly bi-gram or phoneme loop language models (Jeanrenaud, et al. 1994).</Paragraph>
      <Paragraph position="1"> from the recognizer and the annotated transcription; testing is just on the Ist best recognition output.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML