File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1030_evalu.xml
Size: 15,835 bytes
Last Modified: 2025-10-06 13:59:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1030"> <Title>Head-Driven Parsing for Word Lattices</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> The word lattice parser was evaluated with several metrics -- WER, labelled precision and recall, crossing brackets, and time and space resource usage. Following Roark (2001), we conducted evaluations using two experimental sets -- strings and word lattices. We optimized settings (thresholds, variable beam function, base beam value) for parsing using development test data consisting of strings for which we have annotated parse trees.</Paragraph> <Paragraph position="1"> The parsing accuracy for parsing word lattices was not directly evaluated as we did not have annotated parse trees for comparison. Furthermore, standard parsing measures such as labelled precision and recall are not directly applicable in cases where the number of words differs between the proposed parse tree and the gold standard. Results show scores for parsing strings which are lower than the original implementation of Collins (1999). The WER scores for this, the first application of the Collins (1999) model to parsing word lattices, are comparable to other recent work in syntactic language modelling, and better than a simple trigram model trained on the same data.</Paragraph> <Paragraph position="2"> 3Parse trees are commonly scored with the PARSEVAL set of metrics (Black et al., 1991).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Parsing Strings </SectionTitle> <Paragraph position="0"> The lattice parser can parse strings by creating a single-path lattice from the input (all word transitions are assigned an input score of 1.0). The lattice parser was trained on sections 02-21 of the Wall Street Journal portion of the Penn Treebank (Taylor et al., 2003) Development testing was carried out on section 23 in order to select model thresholds and variable beam functions. Final testing was carried out on section 00, and the PARSEVAL measures (Black et al., 1991) were used to evaluate the performance.</Paragraph> <Paragraph position="1"> The scores for our experiments are lower than the scores of the original implementation of model II (Collins, 1999). This difference is likely due in part to differences in POS tagging. Tag accuracy for our model was 93.2%, whereas for the original implementation of Collins (1999), model II achieved tag accuracy of 96.75%. In addition to different tagging strategies for unknown words, mentioned above, we restrict the tag-set considered by the parser for each word to those suggested by a simple first-stage tagger.4 By reducing the tag-set considered by the parsing model, we reduce the search space and increase the speed. However, the simple tagger used to narrow the search also introduces tagging error.</Paragraph> <Paragraph position="2"> The utility of the overparsing extension can be seen in Table 1. Each of the PARSEVAL measures improves when overparsing is used.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Parsing Lattices </SectionTitle> <Paragraph position="0"> The success of the parsing model as a language model for speech recognition was measured both by parsing accuracy (parsing strings with annotated reference parses), and by WER. WER is measured by parsing word lattices and comparing the sentence yield of the highest scoring parse tree to the reference transcription (using NIST SCLITE for alignment and error calculation).5 We assume the parsing performance achieved by parsing strings carries over approximately to parsing word lattices.</Paragraph> <Paragraph position="1"> Two different corpora were used in training the parsing model on word lattices: a0 sections 02-21 of the WSJ Penn Treebank (the same sections as used to train the model for parsing strings) [1 million words] parse tree scores for each sentence hypothesis, and choose the sentence with the best sum of parse tree scores. We choose the yield of the parse tree with the highest score. Summation is too computationally expensive given the model -- we do not even generate all possible parse trees, but instead restrict generation using dynamic programming.</Paragraph> <Paragraph position="2"> = labelled precision/recall. CB is the average number of crossing brackets per sentence. 0 CB, a0 2 CB are the percentage of sentences with 0 or a0 2 crossing brackets respectively. Ref is model II of (Collins, 1999). a0 section &quot;1987&quot; of the BLLIP corpus (Charniak et al., 1999) [20 million words] The BLLIP corpus is a collection of Penn Treebank-style parses of the three-year (1987-1989) Wall Street Journal collection from the ACL/DCI corpus (approximately 30 million words).6 The parses were automatically produced by the parser of Charniak (2001). As the memory usage of our model corresponds directly to the amount of training data used, we were restricted by available memory to use only one section (1987) of the total corpus. Using the BLLIP corpus, we expected to get lower quality parse results due to the higher parse error of the corpus, when compared to the manually annotated Penn Treebank. The WER was expected to improve, as the BLLIP corpus has much greater lexical coverage.</Paragraph> <Paragraph position="3"> The training corpora were modified using a utility by Brian Roark to convert newspaper text to speech-like text, before being used as training input to the model. Specifically, all numbers were converted to words (60 a1 sixty) and all punctuation was removed. null We tested the performance of our parser on the word lattices from the NIST HUB-1 evaluation task of 1993. The lattices are derived from a set of utterances produced from Wall Street Journal text -- the same domain as the Penn Treebank and the BLLIP training data. The word lattices were previously pruned to the 50-best paths by Brian Roark, using the A* decoding of Chelba (2000). The word lattices of the HUB-1 corpus are directed acyclic graphs in the HTK Standard Lattice Format (SLF), consisting of a set of vertices and a set of edges.</Paragraph> <Paragraph position="4"> Vertices, or nodes, are defined by a time-stamp and labelled with a word. The set of labelled, weighted edges, represents the word utterances. A word w is hypothesized over edge e if e ends at a vertex v labelled w. Edges are associated with transition probabilities and are labelled with an acoustic score and a language model score. The lattices of the HUB- null using a 20 thousand word vocabulary and 40 million word training sample. The word lattices have a unique start and end point, and each complete path through a lattice represents an utterance hypothesis.</Paragraph> <Paragraph position="5"> As the parser operates in a left-to-right manner, and closure is performed at each node, the input lattice edges must be processed in topological order. Input lattices were sorted before parsing. This corpus has been used in other work on syntactic language modelling (Chelba, 2000; Roark, 2001; Hall and Johnson, 2003).</Paragraph> <Paragraph position="6"> The word lattices of the HUB-1 corpus are annotated with an acoustic score, a, and a trigram probability, lm, for each edge. The input edge score stored in the word lattice is:</Paragraph> <Paragraph position="8"> where a is the acoustic score and lm is the trigram score stored in the lattice. The total edge weight in the parser is a scaled combination of these scores with the parser score derived with the model parameters: null</Paragraph> <Paragraph position="10"> where w is the edge weight, and s is the score assigned by the parameters of the parsing model. We optimized performance on a development subset of test data, yielding a a1 1a6 16 and b a1 1.</Paragraph> <Paragraph position="11"> There is an important difference in the tokenization of the HUB-1 corpus and the Penn Treebank format. Clitics (i.e., he's, wasn't) are split from their hosts in the Penn Treebank (i.e., he 's, was n't), but not in the word lattices. The Tree-bank format cannot easily be converted into the lattice format, as often the two parts fall into different parse constituents. We used the lattices modified by Chelba (2000) in dealing with this problem -- contracted words are split into two parts and the edge scores redistributed. We followed Hall and Johnson (2003) and used the Treebank tokenization for measuring the WER. The model was tested with and without overparsing.</Paragraph> <Paragraph position="12"> We see from Table 2 that overparsing has little effect on the WER. The word sequence most easily parsed by the model (i.e., generating the first complete parse tree) is likely also the word sequence found by overparsing. Although overparsing may have little effect on WER, we know from the experiments on strings that overparsing increases parse accuracy. This introduces a speed-accuracy tradeoff: depending on what type of output is required from the model (parse trees or strings), the additional time and resource requirements of overparsing may or may not be warranted.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Parsing N-Best Lattices vs. N-Best Lists </SectionTitle> <Paragraph position="0"> The application of the model to 50-best word lattices was compared to rescoring the 50-best paths individually (50-best list parsing). The results are presented in Table 2.</Paragraph> <Paragraph position="1"> The cumulative number of edges added to the chart per word for n-best lists is an order of magnitude larger than for corresponding n-best lattices, in all cases. As the WERs are similar, we conclude that parsing n-best lists requires more work than parsing n-best lattices, for the same result. Therefore, parsing lattices is more efficient. This is because common substrings are only considered once per lattice. The amount of computational savings is dependent on the density of the lattices -- for very dense lattices, the equivalent n-best list parsing will parse common substrings up to n times. In the limit of lowest density, a lattice may have paths without overlap, and the number of edges per word would be the same for the lattice and lists.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Time and Space Requirements </SectionTitle> <Paragraph position="0"> The algorithms and data structures were designed to minimize parameter lookup times and memory usage by the chart and parameter set (Collins, 2004).</Paragraph> <Paragraph position="1"> To increase parameter lookup speed, all parameter values are calculated for all levels of back-off at training time. By contrast, (Collins, 1999) calculates parameter values by looking up event counts at run-time. The implementation was then optimized using a memory and processor profiler and debugger. Parsing the complete set of HUB-1 lattices (213 sentences, a total of 3,446 words) on average takes approximately 8 hours, on an Intel Pentium 4 (1.6GHz) Linux system, using 1GB memory.</Paragraph> <Paragraph position="2"> Memory requirements for parsing lattices is vastly greater than equivalent parsing of a single sentence, as chart size increases with the number of divergent paths in a lattice. Additional analysis of resource issues can be found in Collins (2004).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.5 Comparison to Previous Work </SectionTitle> <Paragraph position="0"> The results of our best experiments for lattice- and list-parsing are compared with previous results in is 3.4%. For the pruned 50-best lattices, the oracle WER is 7.8%. We see that by pruning the lattices using the trigram model, we already introduce additional error. Because of the memory usage and time required for parsing word lattices, we were unable to test our model on the original &quot;acoustic&quot; HUB-1 lattices, and are thus limited by the oracle WER of the 50-best lattices, and the bias introduced by pruning using a trigram model. Where available, we also present comparative scores of the sentence error rate (SER) -- the percentage of sentences in the test set for which there was at least one recognition error.</Paragraph> <Paragraph position="1"> Note that due to the small (213 samples) size of the HUB-1 corpus, the differences seen in SER may not be significant.</Paragraph> <Paragraph position="2"> We see an improvement in WER for our parsing model alone (a a1 b a1 0) trained on 1 million words of the Penn Treebank compared to a trigram model trained on the same data -- the &quot;Treebank Trigram&quot; noted in Table 3. This indicates that the larger context considered by our model allows for performance improvements over the trigram model alone. Further improvement is seen with the combination of acoustic, parsing, and trigram scores (a a1 1a6 16a0 b a1 1). However, the combination of the parsing model (trained on 1M words) with the lattice trigram (trained on 40M words) resulted in a higher WER than the lattice trigram alone. This indicates that our 1M word training set is not sufficient to permit effective combination with the lattice trigram. When the training of the head-driven parsing model was extended to the BLLIP 1987 corpus (20M words), the combination of models (a a1 1a6 16a0 b a1 1) achieved additional improvement in WER over the lattice trigram alone.</Paragraph> <Paragraph position="3"> The current best-performing models, in terms of WER, for the HUB-1 corpus, are the models of Roark (2001), Charniak (2001) (applied to n-best lists by Hall and Johnson (2003)), and the SLM of Chelba and Jelinek (2000) (applied to n-best lists by Xu et al. (2002)). However, n-best list parsing, as seen in our evaluation, requires repeated analysis of common subsequences, a less efficient process than directly parsing the word lattice.</Paragraph> <Paragraph position="4"> The reported results of (Roark, 2001) and (Chelba, 2000) are for parsing models interpolated with the lattice trigram probabilities. Hall and John7The WER of the hypothesis which best matches the true utterance, i.e., the lowest WER possible given the hypotheses set.</Paragraph> <Paragraph position="5"> sentence error rate. WER = word error rate. &quot;Speech-like&quot; transformations were applied to all training corpora. Xu (2002) is an implementation of the model of Chelba (2000) for n-best list parsing. Hall (2003) is a lattice-parser related to Charniak (2001).</Paragraph> <Paragraph position="6"> son (2003) does not use the lattice trigram scores directly. However, as in other works, the lattice trigram is used to prune the acoustic lattice to the 50 best paths. The difference in WER between our parser and those of Charniak (2001) and Roark (2001) applied to word lists may be due in part to the lower PARSEVAL scores of our system. Xu et al.</Paragraph> <Paragraph position="7"> (2002) report inverse correlation between labelled precision/recall and WER. We achieve 73.2/76.5% LP/LR on section 23 of the Penn Treebank, compared to 82.9/82.4% LP/LR of Roark (2001) and 90.1/90.1% LP/LR of Charniak (2000). Another contributing factor to the accuracy of Charniak (2001) is the size of the training set -- 20M words larger than that used in this work. The low WER of Roark (2001), a top-down probabilistic parsing model, was achieved by training the model on 1 million words of the Penn Treebank, then performing a single pass of Expectation Maximization (EM) on a further 1.2 million words.</Paragraph> </Section> </Section> class="xml-element"></Paper>