File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2237_metho.xml
Size: 21,591 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2237"> <Title>Using Chunk Based Partial Parsing of Spontaneous Speech in Unrestricted Domains for Reducing Word Error Rate in Speech Recognition</Title> <Section position="5" start_page="1453" end_page="1453" type="metho"> <SectionTitle> 3 Reranking of Speech Recognizer </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1453" end_page="1453" type="sub_section"> <SectionTitle> Nbest Lists </SectionTitle> <Paragraph position="0"> State-of-the-art speech recognizers, such as the JANUS recognizer (Waibel et al., 1996) whose output we used for our system, typically generate lattices of word hypotheses. From these lattices, Nbest lists can be computed automatically, such that it is ensured that the ordering of hypotheses in these lists corresponds to the internal ranking of the speech recognizer.</Paragraph> <Paragraph position="1"> As an example, we present a reference utterance (i.e., &quot;what was actually said&quot;) and two hypotheses from the Nbest list, given with their rank: KEF: YOU WEREN'T BORN JUST TO SOAK UP SUN 1: YOU WF.JtEN'T BORN JUSTICE SO CUPS ON 190: YOU WEREN'T BORN JUST TO SOAK UP SUN This is a typical example, in that it is frequently the case that hypotheses which are ranked further down the list, are actually closer to the true (reference) utterance (i.e., the WER would be lower). 5 So, if we had an oracle that could tell the speech recognizer to always pick the hypothesis with the lowest WER from the Nbest list (instead of the top find the correct hypothesis in the lattice.</Paragraph> <Paragraph position="2"> ranked hypothesis), the global performance could be improved significantly. 6 In the speech recognizer architecture, the search module is guided mostly by very local phenomena, both in the acoustic models (a context of several phones), and in the language models (a context of several words). Also, the recognizer does not make use of any syntactic (or constituent-based) howledge. null Thus, the intuitive idea is to generate representations that allow for a discriminative judgment between different hypotheses in the Nbest list, so that eventually a more plausible candidate can be identified, if, as it is the case in the following example, the resulting chunk structure is more likely to be well-formed than that of the first ranked hypothesis: 1: \[np YOU\] \[vc ~.J~.$I'T BORN\] \[np JUSTICE\] \[advp SO\] \[np CUPS\] \[advp ON\] 190: \[np YOU\] \[vc WFJtEN'T BORN\] \[advp JUST\] \[vc TO SOAK UP\] \[np SUN\] We use two main scores to assess this plausibility: (i) a chunk coverage score (percentage of input string which gets parsed), and (ii) a chunk language model score, which is using a standard n-gram model based on the chunk sequences. The latter should give worse scores in cases like hypothesis (1) in our example, where we encounter the vc-np-advp-np-advp sequence, as opposed to hypothesis (190) with the more natural vc-advp-vc-np sequence.</Paragraph> </Section> </Section> <Section position="6" start_page="1453" end_page="1454" type="metho"> <SectionTitle> 4 System Architecture </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1453" end_page="1454" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> Figure 1 shows the global system architecture.</Paragraph> <Paragraph position="1"> The Nbest lists are generated from lattices that are produced by the JANUS speech recognizer (Walbel et al., 1996). First, the hypothesis duplicates with respect to silence and noise words are removed from the Nbest lists 7, next the word stream is tagged with Brill's part of speech (POS) tagger (Brill, 1994), Version 1.14, adapted to the SWITCHBOARD Corpus. Then, the token stream is &quot;cleaned up&quot; in the preprocessing pipe, which then serves as the input of the POS based chunk parser. Finally, the chunk representations generated by the parser are used to compute scores which are the basis of the rescoring component that eventually generates new reranked Nbest lists.</Paragraph> <Paragraph position="2"> In the following, we describe the major components of the system in more detail.</Paragraph> </Section> <Section position="2" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 4.2 Preprocesslng Pipe </SectionTitle> <Paragraph position="0"> This preprocessing pipe consists of a number of filter components that serve the purpose of simplifying the input for subsequent components, without loss of essential information. Multiple word repetitions and non-content interjections or adverbs (e.g., &quot;actually&quot;) are removed from the input, some short forms are expanded (e.g., &quot;we'll&quot; -+ &quot;we will&quot;), and frequent word sequences are combined into a single token (e.g., % lot of&quot; --~ &quot;a_lot_of&quot;). Longer turns are segmented into short clauses, which are defined as consisting of at least a subject and an inflected verbal form.</Paragraph> </Section> <Section position="3" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 4.3 Chunk Parser </SectionTitle> <Paragraph position="0"> The chunk parser is a chart based context free parser, originally developed for the purpose of semantic frame parsing (Ward, 1991). For our purposes, we define the chunks to be the relevant concepts in the underlying grammar. We use 20 different chunks that consist of part of speech sequences (there are 40 different POS tags in the version of Brill's tagger that we are using). Since the grammar is non-recursive, no attachments of constituents are made, and, also due to its small size, parsing is extremely fast (more than 2000 tokens per second), s The parser takes the POS sequence from the tagged input, parses it in chunks, and finally, these POSchunks are combined again with the words from the input stream.</Paragraph> </Section> <Section position="4" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 4.4 Nbest Rescorer </SectionTitle> <Paragraph position="0"> The rescorer's task is to take an Nbest list generated from the speech recognizer and to label each element in this list (=hypothesis) with a new score which should correspond to the true WER of the respective hypothesis; these new scores are then used for the reranking of the Nbest list. Thus, in the optimal case, the hypothesis with lowest WER would move to the top of the reranked Nbest list.</Paragraph> <Paragraph position="1"> The three main components of the rescorer are:</Paragraph> </Section> </Section> <Section position="7" start_page="1454" end_page="1454" type="metho"> <SectionTitle> 1. Score Calculation: </SectionTitle> <Paragraph position="0"> There are three types of scores used: (a) normalized score from the recognizer (with respect to the acoustic and language models used internally): highest score = lowest rank number in the original Nbest list (b) chunk coverage scores: derived from the relative coverage of the chunk parser for each hypothesis: highest score = complete coverage, no skipped words in the hypothesis null (c) chunk language model score: this is a standard n-gram score, derived from the sequence of chunks in each hypothesis (as opposed to the sequence of words in the recognizer): high score = high probability for the chunk sequence; the chunk language model was computed on the chunk parses of the LDC 9 SWITCHBOARD transcripts (about 3 million words total; we computed standard 3-gram and 5-gram backoff models). null 2. Reranking Neural Network: We are using a standard three layer backpropagation neural network. The input units are the scores described here, the output unit should be a good predictor of the true WER of the hypothesis. For training of the neural net, the data was split randomly into a training and a test set.</Paragraph> </Section> <Section position="8" start_page="1454" end_page="1454" type="metho"> <SectionTitle> 3. Cutoff Filter: Initial experiments and data </SectionTitle> <Paragraph position="0"> analysis showed clearly that in short utterances (less than 5-10 words) the potential reduction in WER is usually low: many of these utterances are (almost) correctly recognized in the first place. For this reason, this filter prevents application of reranking to these short utterances. null</Paragraph> </Section> <Section position="9" start_page="1454" end_page="1456" type="metho"> <SectionTitle> 5 Experiment: System Performance </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 5.1 Data </SectionTitle> <Paragraph position="0"> The data we used for system training, testing, and evaluation were drawn from the SWITCHBOARD and CALLHOME LVCSR 1deg evaluation in spring 1996 (Finke and Zeppenfeld, 1996). In total, 374 utterances were used that were randomly split to form a training and test set. For these utterances, Nbest lists of length 300 were created from speech recognizer lattices. 11 The word error rates (WER) of these sets are given in Table 1. While the true WER corresponds to the WER of the first hypothesis (--top ranked), the optimal WER is computed under the assumption that an oracle would always pick the hypothesis with the lowest WER in every Nbest list. The difference between the average true WER and the optimal WER is 13.1%; this gives the maximum margin of improvement that reranking can possibly achieve on this data set. Another interesting figure is the expected WER gain, when a random process would rerank the Nbest lists and just pick any hypothesis to be the (new) top one.</Paragraph> <Paragraph position="1"> For the test set, this expected WER gain is -4.9% (i.e., the WER would drop by 4.9%).</Paragraph> </Section> <Section position="2" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 5.2 Global System Speed </SectionTitle> <Paragraph position="0"> The system runtime, starting from the POS-tagger through all components up to the final evaluation of WER gain for the 103 utterances of the test set (ca.</Paragraph> <Paragraph position="1"> 8400 hypotheses, 145000 tokens) is less than 10 minutes on a DEC Alpha workstation (200 MHz, 192MB RAM), i.e., the throughput is more than 10 utterances per minute (or 840 hypotheses per minute).</Paragraph> </Section> <Section position="3" start_page="1454" end_page="1454" type="sub_section"> <SectionTitle> 5.3 Part Of Speech Tagger </SectionTitle> <Paragraph position="0"> We are using Brill's part of speech tagger as an important preprocessing component of our system (Brill, 1994). As our evaluations prove, the performance of this component is quite crucial to the whole different test sets system's performance, in particular to the segmentation module and to the POS based chunk parser. Since the original tagger was trained on written corpora (Wall Street Journal, Brown corpus), we had to adapt it and retrain it on SWITCHBOARD data. The tagset was slightly modified and adapted, to accommodate phenomena of spoken language (e.g., hesitation words, fillers), and to facilitate the task of the segmentation module (e.g., by tagging clausal and non-clausal coordinators differently). After the adaptive training, the POS accuracy is 91.2% on general SWITCHBOARD 12 and 88.3% on a manually tagged subset of the training data we used for our experiments. 13 Fortunately, some of these tagging errors are irrelevant with respect to the POS based chunk grammar: the tagger's performance with respect to this grammar is 92.8% on general SWITCHBOARD, and 90.6% for the manually tagged subset from our training set.</Paragraph> </Section> <Section position="4" start_page="1454" end_page="1456" type="sub_section"> <SectionTitle> 5.4 Chunk Parser </SectionTitle> <Paragraph position="0"> The evaluation of the chunk parser's accuracy was done on the following data sets: (i) 20 utterances (5 references and 15 speech recognizer hypotheses) (20utts); (ii) the same data, but with manual corrections of POS tags and short clause segment boundaries (20utts-corr).</Paragraph> <Paragraph position="1"> For each word appearing in the chunk parser's output (including the skipped words14), it was determined, whether it belonged to the correct chunk, or whether it had to be classified into one of these three evaluations.</Paragraph> <Paragraph position="2"> 13These numbers are significantly lower than those achievable by taggers for written language~ we conjecture that one reason for this lower performance is due to the more refined tagset we use which causes a higher amount of ambiguity for some frequent words.</Paragraph> <Paragraph position="3"> 14Skipped words are words that could not be parsed into any chunks.</Paragraph> <Paragraph position="4"> net experiments for two test sets (in absolute %) The results of this evaluation are given in Table 2. We see that an optimally preprocessed input is indeed crucial for the accuracy of the parser: it increases from 87.4% to 97.0%. 15</Paragraph> </Section> <Section position="5" start_page="1456" end_page="1456" type="sub_section"> <SectionTitle> 5.5 Nbest Rescorer </SectionTitle> <Paragraph position="0"> The task of the Nbest list rescorer is performed by a neural net, trained on chunk coverage, chunk language model, and speech recognizer scores, with the true WER as target value. We ran experiments to test various combinations of the following parameters: type of chunk language model (3-gram vs.</Paragraph> <Paragraph position="1"> 5-gram); chunk score parameters (e.g., penalty factors for skipped words, length normalization parameters); hypothesis length cutoffs (for the cutoff filter); number of hidden units; number of training epochs.</Paragraph> <Paragraph position="2"> The net with the best performance on the test set has one hidden unit, and is trained for 10 epochs. A length cutoff of 8 words is used, i.e., only hypotheses whose average length was >_ 8 are actually considered as reranking candidates. A 3-gram chunk language model proved to be slightly better than a 5-gram model.</Paragraph> <Paragraph position="3"> Table 3 gives the results for the entire test set and a subset of 21 hypotheses (eval21) which had at least a potential gain of three word errors (when comparing the first ranked hypothesis with the hypothesis which has the fewest errors), le We also calculated the cumulative average WER before and after reranking, over the size of the Nbest list for various hypotheses. 17 Figure 2 shows the plots of these two graphs for the example utterance in section 3 (&quot;you weren't born just to soak up sun&quot;). We see very clearly, that in this example not only has the new first hypothesis a significant WER gain compared to the old one, but that in general hypotheses with lower WER moved towards the top of the Nbest list.</Paragraph> <Paragraph position="4"> Is (Abney, 1996) reports a comparable per word accuracy of his CASS2 chunk parser (92.1%).</Paragraph> <Paragraph position="5"> 1aWhile the latter set was obtained post hoc (using the known WEB.), it is conceivable to approximate this biased selection, when fairly reliable confidence annotations from the speech recognizer are available (Chase, 1997).</Paragraph> <Paragraph position="6"> 17Average of the WEB. from hypotheses 1 to k in the Nbest ilst.</Paragraph> <Paragraph position="7"> you weren't born justice so cups on you weren't born just to sew cups on you weren't born justice vocal song you weren't born just to soak up sun you weren't foreign just to sew cups on you weren't born justice so courts on you weren't born just to sew carp song you weren't boring just to soak up son exactly corresponds to the reference) A more detailed account of 8 hypotheses from the same example utterance is given in tables 4 (which lists the recognizer hypotheses) and 5 (where various scores, WER, and the ranks before and after the reranking procedure are provided). It can be seen that while the new first best hypothesis is not the one with the lowest WER, it does have a lower WEB, than the originally first ranked hypothesis (25.0% vs. 62.5%).</Paragraph> </Section> </Section> <Section position="10" start_page="1456" end_page="1457" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Using the neural net with the characteristics described in the previous section, we were able to get a positive effect in WER reduction on a non-biased test set. While this effect is quite small, one has to keep in mind that the (constituent-like) chunk representations were the only source of information for our reranking system, in addition to the internal scores of the speech recognizer. It can be expected that including more sources of knowledge, like the plausibility of correct verb-argument structures (the correct match of subcategorization frames), and the likelihood of selectional restrictions between the verbal heads and their head noun arguments would further improve these results.</Paragraph> <Paragraph position="1"> ranks before and after reranking of 8 hypotheses from an example utterance The second observation we make when looking at the markedly positive results of the eval21 set concerns the potential benefit of selecting good candidates for reranking in the first place.</Paragraph> </Section> <Section position="11" start_page="1457" end_page="1457" type="metho"> <SectionTitle> 7 Comparison: Human Study </SectionTitle> <Paragraph position="0"> One of our motivations for using syntactic representations for the task of Nbest list reranking was the intuition that frequently, by just reading through the list of hypotheses, one can eliminate highly implausible candidates or favor more plausible ones.</Paragraph> <Paragraph position="1"> To put this intuition to test, we conducted a small experiment where human subjects were asked to look at pairs of speech recognizer hypotheses drawn from the Nbest lists and to decide which of these they considered to be &quot;more well-formed&quot;. Well-formedness was judged in terms of (i) structure (syntax) and (ii) meaning (semantics). 128 hypothesis pairs were extracted from the training set (the top ranked hypothesis and the hypothesis with lowest WER), and presented in random order to the subjects.</Paragraph> <Paragraph position="2"> 4 subjects participated in the study and table 6 gives the results of its evaluation: WER gain is measured the same way as in our system evaluation -- here, it corresponds to the average reduction in WER, when the well-formedness judgements of the human subjects were to be used to rerank the respective hypothesis-pairs.</Paragraph> <Paragraph position="3"> While the maximum WER gain for these 128 hypothesis-pairs is 15.2%, the expected WER gain (i.e., the WER gain of a random process) is 7.6%.</Paragraph> <Paragraph position="4"> Whereas the difference between both methods to a random choice is highly significant (syntax: a =</Paragraph> <Paragraph position="6"> two methods is not (a = 0.05,t = -1.273,df = 6) 19 . The latter is most likely due to the fact that there were only few hypotheses that were judged differently in terms of syntactic or semantic well-formedness by one subject: on average, only 6% of 18These results were obtained using the one-sided t-test. tOTwo-sided t-test.</Paragraph> <Paragraph position="7"> the hypothesis-pairs received a different judgement by one subject.</Paragraph> </Section> <Section position="12" start_page="1457" end_page="1458" type="metho"> <SectionTitle> 8 Future Work </SectionTitle> <Paragraph position="0"> From our results and experiments, we conclude that there are several directions of future work which are promising to pursue: * improvement of the POS tagger: Since the performance of this component was shown to be of essential importance for later stages of the system, we expect to see benefits from putting efforts into further training.</Paragraph> <Paragraph position="1"> * alternative language models: An idea for improvement here is to integrate skipped words into the LM (similar to the modeling of noise in speech). In this way we get rid of the skipping penalties we were using so far and which blurred the statistical nature of the model.</Paragraph> <Paragraph position="2"> * identifying good reranking candidates: So far, the only and exclusive heuristics we are using for determining when to rerank and when not to, is to use the length-cutoff filter to exclude short utterances from being considered in the final reranking procedure. (Chase, 1997) showed that there are a number of potentially useful &quot;features&quot; from various sources within the recognizer which can predict, at least to a certain extent, the &quot;confidence&quot; that the recognizer has about a particular hypothesis. Hypotheses which have a higher WER on average also exhibit a higher word gain potential, and therefore these predictions appear to be promising indeed.</Paragraph> <Paragraph position="3"> * adding argument structure representations: The chunk representation in our system only gives an idea about which constituents there are in a clause and what their ordering is. A richer model has to include also the dependencies between these chunks. Exploiting statistics about subcategorization frames of verbs and selectional restrictions would be a way to enhance the available representations.</Paragraph> </Section> class="xml-element"></Paper>