File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1006_metho.xml
Size: 10,394 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1006"> <Title>Attention Shifting for Parsing Speech [?]</Title> <Section position="3" start_page="8" end_page="8" type="metho"> <SectionTitle> 3 Attention shifting </SectionTitle> <Paragraph position="0"> We explore a modification to the multi-stage parsing algorithm that ensures the first stage parser posits at least one parse for each path in the word-lattice.</Paragraph> <Paragraph position="1"> The idea behind this is to intermittently shift the attention of the parser to unexplored parts of the word lattice.</Paragraph> <Paragraph position="2"> has non-zero outside probability. By definition of The notion of attention shifting is motivated by the work on parser FOM compensation presented in (Blaheta and Charniak, 1999).</Paragraph> <Paragraph position="3"> the outside probability, used edges are constituents that are part of a complete parse; a parse is complete if there is a root category label (e.g., S for sentence) that spans the entire word-lattice. In order to identify used edges, we compute the outside probabilities for each parse edge (efficiently computing the outside probability of an edge requires that the inside probabilities have already been computed). In the third step of this algorithm we clear the agenda, removing all partial analyses evaluated by the parser. This forces the parser to abandon analyses of parts of the word-lattice for which complete parses exist. Following this, the agenda is populated with edges corresponding to the unused words, priming the parser to consider these words. To ensure the parser builds upon at least one of these unused edges, we further modify the parsing algorithm: null * Only unused edges are added to the agenda.</Paragraph> <Paragraph position="4"> * When building parses from the bottom up, a parse is considered complete if it connects to a used edge.</Paragraph> <Paragraph position="5"> These modifications ensure that the parser focuses on edges built upon the unused words. The second modification ensures the parser is able to determine when it has connected an unused word with a previously completed parse. The application of these constraints directs the attention of the parser towards new edges that contribute to parse analyses covering unused words. We are guaranteed that each iteration of the attention shifting algorithm adds a parse for at least one unused word, meaning that it will take at most |A |iterations to cover the entire lattice, where A is the set of word-lattice arcs. This guarantee is trivially provided through the constraints just described. The attention-shifting parser continues until there are no unused words remaining and each parsing iteration runs until it has found a complete parse using at least one of the unused words.</Paragraph> <Paragraph position="6"> As with multi-stage parsing, an adjustable parameter determines how much overparsing to perform on the initial parse. In the attention shifting algorithm an additional parameter specifies the amount of overparsing for each iteration after the first. The new parameter allows for independent control of the attention shifting iterations.</Paragraph> <Paragraph position="7"> After the attention shifting parser populates a parse chart with parses covering all paths in the lattice, the multi-stage parsing algorithm performs additional pruning based on the probability of the parse edges (the product of the inside and outside probabilities). This is necessary in order to constrain the size of the hypothesis set passed on to the second stage parsing model.</Paragraph> <Paragraph position="8"> The Charniak lexicalized syntactic language model effectively splits the number of parse states (an edges in a PCFG parser) by the number of unique contexts in which the state is found. These contexts include syntactic structure such as parent and grandparent category labels as well as lexical items such as the head of the parent or the head of a sibling constituent (Charniak, 2001). State splitting on this level causes the memory requirement of the lexicalized parser to grow rapidly.</Paragraph> <Paragraph position="9"> Ideally, we would pass all edges on to the second stage, but due to memory limitations, pruning is necessary. It is likely that edges recently discovered by the attention shifting procedure are pruned. However, the true PCFG probability model is used to prune these edges rather than the approximation used in the FOM. We believe that by considering parses which have a relatively high probability according to the combined PCFG and acoustic models that we will include most of the analyses for which the lexicalized parser assigns a high probability.</Paragraph> </Section> <Section position="4" start_page="8" end_page="8" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> The purpose of attention shifting is to reduce the amount of work exerted by the first stage PCFG parser while maintaining the same quality of language modeling (in the multi-stage system). We have performed a set of experiments on the NIST '93 HUB-1 word-lattices. The HUB-1 is a collection of 213 word-lattices resulting from an acoustic recognizer's analysis of speech utterances. Professional readers reading Wall Street Journal articles generated the utterances.</Paragraph> <Paragraph position="1"> The first stage parser is a best-first PCFG parser trained on sections 2 through 22, and 24 of the Penn WSJ treebank (Marcus et al., 1993). Prior to training, the treebank is transformed into speech-like text, removing punctuation and expanding numerals, etc.</Paragraph> <Paragraph position="2"> Overparsing is performed using an edge pop multiplicative factor. The parser records the number of edge pops required to reach the first complete parse. The parser continues to parse a until multiple of the number of edge pops required for the first parse are popped off the agenda.</Paragraph> <Paragraph position="3"> The second stage parser used is a modified version of the Charniak language modeling parser described in (Charniak, 2001). We trained this parser Brian Roark of AT&T provided a tool to perform the speech normalization.</Paragraph> <Paragraph position="4"> An edge pop is the process of the parser removing an edge from the agenda and placing it in the parse chart. on the BLLIP99 corpus (Charniak et al., 1999); a corpus of 30million words automatically parsed using the Charniak parser (Charniak, 2000).</Paragraph> <Paragraph position="5"> In order to compare the work done by the n-best reranking technique to the word-lattice parser, we generated a set of n-best lattices. 50-best lists were extracted using the Chelba A* decoder . A 50-best lattice is a sublattice of the acoustic lattice that generates only the strings found in the 50-best list. Additionally, we provide the results for parsing the full acoustic lattices (although these work measurements should not be compared to those of n-best reranking).</Paragraph> <Paragraph position="6"> We report the amount of work, shown as the cumulative # edge pops, the oracle WER for the word-lattices after first stage pruning, and the WER of the complete multi-stage parser. In all of the word-lattice parsing experiments, we pruned the set of posited hypothesis so that no more than 30,000 local-trees are generated . We chose this threshold due to the memory requirements of the second stage parser. Performing pruning at the end of the first stage prevents the attention shifting parser from reaching the minimum oracle WER (most notable in the full acoustic word-lattice experiments). While the attention-shifting algorithm ensures all word-lattice arcs are included in complete parses, forward-backward pruning, as used here, will eliminate some of these parses, indirectly eliminating some of the word-lattice arcs.</Paragraph> <Paragraph position="7"> To illustrate the need for pruning, we computed the number of states used by the Charniak lexicalized syntactic language model for 30,000 local trees. An average of 215 lexicalized states were generated for each of the 30,000 local trees. This means that the lexicalized language model, on average, computes probabilities for over 6.5 million states when provided with 30,000 local trees.</Paragraph> <Paragraph position="8"> We recreated the results of the Charniak language model parser used for reranking in order to measure the amount of work required. We ran the first stage parser with 4-times overparsing for each string in The n-best lists were provided by Brian Roark (Roark, .</Paragraph> <Paragraph position="9"> the n-best list. The LatParse result represents running the word-lattice parser on the n-best lattices performing 100-times overparsing in the first stage. The AttShift model is the attention shifting parser described in this paper. We used 10-times overparsing for both the initial parse and each of the attention shifting iterations. When run on the n-best lattice, this model achieves a comparable WER, while reducing the amount of parser work sixfold (as compared to the regular word-lattice parser).</Paragraph> <Paragraph position="10"> In Table 3 we present the results of the word-lattice parser and the attention shifting parser when run on full acoustic lattices. While the oracle WER is reduced, we are considering almost half as many edges as the standard word-lattice parser. The increased size of the acoustic lattices suggests that it may not be computationally efficient to consider the entire lattice and that an additional pruning phase is necessary.</Paragraph> <Paragraph position="11"> The most significant constraint of this multi-stage lattice parsing technique is that the second stage process has a large memory requirement. While the attention shifting technique does allow the parser to propose constituents for every path in the lattice, we prune some of these constituents prior to performing analysis by the second stage parser. Currently, pruning is accomplished using the PCFG model. One solution is to incorporate an intermediate pruning stage (e.g., lexicalized PCFG) between the PCFG parser and the full lexicalized model. Doing so will relax the requirement for aggressive PCFG pruning and allows for a lexicalized model to influence the selection of word-lattice paths.</Paragraph> </Section> class="xml-element"></Paper>