File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1006_intro.xml

Size: 12,758 bytes

Last Modified: 2025-10-06 14:02:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1006">
  <Title>Attention Shifting for Parsing Speech [?]</Title>
  <Section position="2" start_page="0" end_page="8" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Success in language modeling has been dominated by the linear n-gram for the past few decades. A number of syntactic language models have proven to be competitive with the n-gram and better than the most popular n-gram, the trigram (Roark, 2001; Xu et al., 2002; Charniak, 2001; Hall and Johnson, 2003). Language modeling for speech could well be the first real problem for which syntactic techniques are useful.</Paragraph>
    <Paragraph position="1"> John ate the pizza on a plate with a fork .</Paragraph>
    <Paragraph position="2">  One reason that we expect syntactic models to perform well is that they are capable of modeling long-distance dependencies that simple n-gram [?] This research was supported in part by NSF grants 9870676 and 0085940.</Paragraph>
    <Paragraph position="3"> models cannot. For example, the model presented by Chelba and Jelinek (Chelba and Jelinek, 1998; Xu et al., 2002) uses syntactic structure to identify lexical items in the left-context which are then modeled as an n-gram process. The model presented by Charniak (Charniak, 2001) identifies both syntactic structural and lexical dependencies that aid in language modeling. While there are n-gram models that attempt to extend the left-context window through the use of caching and skip models (Goodman, 2001), we believe that linguistically motivated models, such as these lexical-syntactic models, are more robust.</Paragraph>
    <Paragraph position="4"> Figure 1 presents a simple example to illustrate the nature of long-distance dependencies. Using a syntactic model such as the the Structured Language Model (Chelba and Jelinek, 1998), we predict the word fork given the context {ate, with} where a trigram model uses the context {with, a}. Consider the problem of disambiguating between . . . plate with a fork and . . . plate with effort. The syntactic model captures the semantic relationship between the words ate and fork. The syntactic structure allows us to find lexical contexts for which there is some semantic relationship (e.g., predicateargument). null Unfortunately, syntactic language modeling techniques have proven to be extremely expensive in terms of computational effort. Many employ the use of string parsers; in order to utilize such techniques for language modeling one must preselect a set of strings from the word-lattice and parse each of them separately, an inherently inefficient procedure. Of the techniques that can process word-lattices directly, it takes significant computation to achieve the same levels of accuracy as the n-best reranking method. This computational cost is the result of increasing the search space evaluated with the syntactic model (parser); the larger space resulting from combining the search for syntactic structure with the search for paths in the word-lattice.</Paragraph>
    <Paragraph position="5"> In this paper we propose a variation of a probabilistic word-lattice parsing technique that increases  parser that generates a set of candidate parses over strings in a word-lattice, while the second stage rescores these candidate edges using a lexicalized syntactic language model (Charniak, 2001). Under this paradigm, the first stage is not only responsible for selecting candidate parses, but also for selecting paths in the word-lattice. Due to computational and memory requirements of the lexicalized model, the second stage parser is capable of rescoring only a small subset of all parser analyses. For this reason, the PCFG prunes the set of parser analyses, thereby indirectly pruning paths in the word lattice.</Paragraph>
    <Paragraph position="6"> We propose adding a meta-process to the first-stage that effectively shifts the selection of word-lattice paths to the second stage (where lexical information is available). We achieve this by ensuring that for each path in the word-lattice the first-stage parser posits at least one parse.</Paragraph>
    <Paragraph position="8"> The noisy channel model for speech is presented in Equation 1, where A represents the acoustic data extracted from a speech signal, and W represents a word string. The acoustic model P(A|W) assigns probability mass to the acoustic data given a word string and the language model P(W) defines a distribution over word strings. Typically the acoustic model is broken into a series of distributions conditioned on individual words (though these are based on false independence assumptions).</Paragraph>
    <Paragraph position="10"> The result of the acoustic modeling process is a set of string hypotheses; each word of each hypothesis is assigned a probability by the acoustic model.</Paragraph>
    <Paragraph position="11"> Word-lattices are a compact representation of output of the acoustic recognizer; an example is presented in Figure 2. The word-lattice is a weighted directed acyclic graph where a path in the graph corresponds to a string predicted by the acoustic recognizer. The (sum) product of the (log) weights on the graph (the acoustic probabilities) is the probability of the acoustic data given the string. Typically we want to know the most likely string given the acoustic data.</Paragraph>
    <Paragraph position="13"> In Equation 3 we use Bayes' rule to find the optimal string given P(A|W), the acoustic model, and P(W), the language model. Although the language model can be used to rescore  the word-lattice, it is typically used to select a single hypothesis. We focus our attention in this paper to syntactic language modeling techniques that perform complete parsing, meaning that parse trees are built upon the strings in the word-lattice.</Paragraph>
    <Section position="1" start_page="1" end_page="8" type="sub_section">
      <SectionTitle>
2.1 n-best list reranking
</SectionTitle>
      <Paragraph position="0"> Much effort has been put forth in developing efficient probabilistic models for parsing strings (Caraballo and Charniak, 1998; Goldwater et al., 1998; Blaheta and Charniak, 1999; Charniak, 2000; Charniak, 2001); an obvious solution to parsing word-lattices is to use n-best list reranking. The n-best list reranking procedure, depicted in Figure 3, utilizes an external language model that selects a set of strings from the word-lattice. These strings are analyzed by the parser which computes a language model probability. This probability is combined  To rescore a word-lattice, each arch is assigned a new score (probability) defined by a new model (in combination with the acoustic model).</Paragraph>
      <Paragraph position="2"> with the acoustic model probability to reranked the strings according to the joint probability P(A,W).</Paragraph>
      <Paragraph position="3"> There are two significant disadvantages to this approach. First, we are limited by the performance of the language model used to select the n-best lists. Usually, the trigram model is used to select n paths through the lattice generating at most n unique strings. The maximum performance that can be achieved is limited by the performance of this extractor model. Second, of the strings that are analyzed by the parser, many will share common substrings. Much of the work performed by the parser is duplicated for these substrings. This second point is the primary motivation behind parsing word-lattices (Hall and Johnson, 2003).</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
2.2 Multi-stage parsing
</SectionTitle>
      <Paragraph position="0"> In Figure 4 we present the general overview of a multi-stage parsing technique (Goodman, 1997; Charniak, 2000; Charniak, 2001). This process  1. Parse word-lattice with PCFG parser 2. Overparse, generating additional candidates 3. Compute inside-outside probabilities 4. Prune candidates with probability threshold  is know as coarse-to-fine modeling, where coarse models are more efficient but less accurate than fine models, which are robust but computationally expensive. In this particular parsing model a PCFG best-first parser (Bobrow, 1990; Caraballo and Charniak, 1998) is used to search the unconstrained space of parses P over a string. This first stage performs overparsing which effectively allows it to generate a set of high probability candidate parses p prime . These parses are then rescored using a lexicalized syntactic model (Charniak, 2001). Although the coarse-to-fine model may include any number of intermediary stages, in this paper we consider this two-stage model.</Paragraph>
      <Paragraph position="1"> There is no guarantee that parses favored by the second stage will be generated by the first stage. In other words, because the first stage model prunes the space of parses from which the second stage rescores, the first stage model may remove solutions that the second stage would have assigned a high probability.</Paragraph>
      <Paragraph position="2"> In (Hall and Johnson, 2003), we extended the multi-stage parsing model to work on word-lattices. The first-stage parser, Table 1, is responsible for positing a set of candidate parses over the wordlattice. Were we to run the parser to completion it would generate all parses for all strings described by the word-lattice. As with string parsing, we stop the first stage parser early, generating a subset of all parses. Only the strings covered by complete parses are passed on to the second stage parser. This indirectly prunes the word-lattice of all word-arcs that were not covered by complete parses in the first stage.</Paragraph>
      <Paragraph position="3"> We use a first stage PCFG parser that performs a best-first search over the space of parses, which means that it depends on a heuristic &amp;quot;figure-ofmerit&amp;quot; (FOM) (Caraballo and Charniak, 1998). A good FOM attempts to model the true probability of a chart edge</Paragraph>
      <Paragraph position="5"> ). Generally, this probability is impossible to compute during the parsing process as it requires knowing both the inside and outside probabilities (Charniak, 1993; Manning and Sch&amp;quot;utze, 1999). The FOM we describe is an approximation to the edge probability and is computed using an estimate of the inside probability times an approximation to the outside probability</Paragraph>
      <Paragraph position="7"> incrementally during bottom-up parsing. The normalized acoustic probabilities from the acoustic recognizer are included in this calculation.</Paragraph>
      <Paragraph position="9"> The outside probability is approximated with a bitag model and the standard tag/category boundary model (Caraballo and Charniak, 1998; Hall and Johnson, 2003). Equation 4 presents the approximation to the outside probability. Part-of-speech tags T</Paragraph>
      <Paragraph position="11"> are the candidate tags to the left and right of the constituent N</Paragraph>
      <Paragraph position="13"> . The fwd() and bkwd() functions are the HMM forward and backward probabilities calculated over a lattice containing the part-of-speech tag, the word, and the acoustic scores from the word-lattice to the left and right of the constituent, respectively. p(N</Paragraph>
      <Paragraph position="15"> ) are the boundary statistics which are estimated from training data (details of this model can be found in (Hall and Johnson, 2003)).</Paragraph>
      <Paragraph position="17"> The best-first search employed by the first stage parser uses the FOM defined in Equation 5, where e is a normalization factor based on path length C(j, k). The normalization factor prevents small constituents from consistently being assigned a  An alternative to the inside and outside probabilities are the Viterbi inside and outside probabilities (Goldwater et al., 1998; Hall and Johnson, 2003).</Paragraph>
      <Paragraph position="18"> higher probability than larger constituents (Goldwater et al., 1998).</Paragraph>
      <Paragraph position="19"> Although this heuristic works well for directing the parser towards likely parses over a string, it is not an ideal model for pruning the word-lattice. First, the outside approximation of this FOM is based on a linear part-of-speech tag model (the bitag). Such a simple syntactic model is unlikely to provide realistic information when choosing a word-lattice path to consider. Second, the model is prone to favoring subsets of the word-lattice causing it to posit additional parse trees for the favored sublattice rather than exploring the remainder of the word-lattice. This second point is the primary motivation for the attention shifting technique presented in the next section.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML