File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1030_metho.xml
Size: 9,892 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1030"> <Title>Head-Driven Parsing for Word Lattices</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Word Lattice Parser </SectionTitle> <Paragraph position="0"> Parsing models based on headword dependency relationships have been reported, such as the structured language model of Chelba and Jelinek (2000).</Paragraph> <Paragraph position="1"> These models use much less conditioning information than the parsing models of Collins (1999), and do not provide Penn Treebank format parse trees as output. In this section we outline the adaptation of the Collins (1999) parsing model to word lattices.</Paragraph> <Paragraph position="2"> The intended action of the parser is illustrated in Figure 1, which shows parse trees built directly upon a word lattice.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Parameterization </SectionTitle> <Paragraph position="0"> The parameterization of model II of Collins (1999) is used in our word lattice parser. Parameters are tice. Different paths through the lattice are simultaneously parsed. The example shows two final parses, one of low probability (S a0 ) and one of high probability (S).</Paragraph> <Paragraph position="1"> maximum likelihood estimates of conditional probabilities -- the probability of some event of interest (e.g., a left-modifier attachment) given a context (e.g., parent non-terminal, distance, headword). One notable difference between the word lattice parser and the original implementation of Collins (1999) is the handling of part-of-speech (POS) tagging of unknown words (words seen fewer than 5 times in training). The conditioning context of the parsing model parameters includes POS tagging.</Paragraph> <Paragraph position="2"> Collins (1999) falls back to the POS tagging of Ratnaparkhi (1996) for words seen fewer than 5 times in the training corpus. As the tagger of Ratnaparkhi (1996) cannot tag a word lattice, we cannot back off to this tagging. We rely on the tag assigned by the parsing model in all cases.</Paragraph> <Paragraph position="3"> Edges created by the bottom-up parsing are assigned a score which is the product of the inside and outside probabilities of the Collins (1999) model.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Parsing Algorithm </SectionTitle> <Paragraph position="0"> The algorithm is a variation of probabilistic online, bottom-up, left-to-right Cocke-Kasami-Younger parsing similar to Chappelier and Rajman (1998).</Paragraph> <Paragraph position="1"> Our parser produces trees (bottom-up) in a right-branching manner, using unary extension and binary adjunction. Starting with a proposed headword, left modifiers are added first using right-branching, then right modifiers using left-branching.</Paragraph> <Paragraph position="2"> Word lattice edges are iteratively added to the agenda. Complete closure is carried out, and the next word edge is added to the agenda. This process is repeated until all word edges are read from the lattice, and at least one complete parse is found.</Paragraph> <Paragraph position="3"> Edges are each assigned a score, used to rank parse candidates. For parsing of strings, the score for a chart edge is the product of the scores of any child edges and the score for the creation of the new edge, as given by the model parameters. This score, defined solely by the parsing model, will be referred to as the parser score. The total score for chart edges for the lattice parsing task is a combination of the parser score, an acoustic model score, and a trigram model score. Scaling factors follow those of (Chelba and Jelinek, 2000; Roark, 2001).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Smoothing and Pruning </SectionTitle> <Paragraph position="0"> The parameter estimation techniques (smoothing and back-off) of Collins (1999) are reimplemented.</Paragraph> <Paragraph position="1"> Additional techniques are required to prune the search space of possible parses, due to the complexity of the parsing algorithm and the size of the word lattices. The main technique we employ is a variation of the beam search of Collins (1999) to restrict the chart size by excluding low probability edges. The total score (combined acoustic and language model scores) of candidate edges are compared against edge with the same span and category. Proposed edges with score outside the beam are not added to the chart. The drawback to this process is that we can no longer guarantee that a model-optimal solution will be found. In practice, these heuristics have a negative effect on parse accuracy, but the amount of pruning can be tuned to balance relative time and space savings against precision and recall degradation (Collins, 1999). Collins (1999) uses a fixed size beam (10a0 000). We experiment with several variable beam (^b) sizes, where the beam is some function of a base beam (b) and the edge width (the number of terminals dominated by an edge). The base beam starts at a low beam size and increases iteratively by a specified increment if no parse is found. This allows parsing to operate quickly (with a minimal number of edges added to the chart). However, if many iterations are required to obtain a parse, the utility of starting with a low beam and iterating becomes questionable (Goodman, 1997). The base beam is limited to control the increase in the chart size. The selection of the base beam, beam increment, and variable beam function is governed by the familiar speed/accuracy trade-off.1 The variable beam function found to allow fast convergence with minimal loss of accuracy is:</Paragraph> <Paragraph position="3"> 1Details of the optimization can be found in Collins (2004).</Paragraph> <Paragraph position="4"> Charniak et al. (1998) introduce overparsing as a technique to improve parse accuracy by continuing parsing after the first complete parse tree is found.</Paragraph> <Paragraph position="5"> The technique is employed by Hall and Johnson (2003) to ensure that early stages of parsing do not strongly bias later stages. We adapt this idea to a single stage process. Due to the restrictions of beam search and thresholds, the first parse found by the model may not be the model optimal parse (i.e., we cannot guarantee best-first search). We therefore employ a form of overparsing -- once a complete parse tree is found, we further extend the base beam by the beam increment and parse again. We continue this process as long as extending the beam results in an improved best parse score.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Expanding the Measures of Success </SectionTitle> <Paragraph position="0"> Given the task of simply generating a transcription of speech, WER is a useful and direct way to measure language model quality for ASR. WER is the count of incorrect words in hypothesis ^W per word in the true string W . For measurement, we must assume prior knowledge of W and the best alignment of the reference and hypothesis strings.2 Errors are categorized as insertions, deletions, or substitutions.</Paragraph> <Paragraph position="1"> It is important to note that most models -- Mangu et al. (2000) is an innovative exception -- minimize sentence error. Sentence error rate is the percentage of sentences for which the proposed utterance has at least one error. Models (such as ours) which optimize prediction of test sentences Wt , generated by the source, minimize the sentence error. Thus even though WER is useful practically, it is formally not the appropriate measure for the commonly used language models. Unfortunately, as a practical measure, sentence error rate is not as useful -- it is not as fine-grained as WER.</Paragraph> <Paragraph position="2"> Perplexity is another measure of language model quality, measurable independent of ASR performance (Jelinek, 1997). Perplexity is related to the entropy of the source model which the language model attempts to estimate.</Paragraph> <Paragraph position="3"> These measures, while informative, do not capture success of extraction of high-level information from speech. Task-specific measures should be used in tandem with extensional measures such as perplexity and WER. Roark (2002), when reviewing</Paragraph> <Paragraph position="5"> tools/) by NIST is the most commonly used alignment tool.</Paragraph> <Paragraph position="6"> parsing for speech recognition, discusses a modelling trade-off between producing parse trees and producing strings. Most models are evaluated either with measures of success for parsing or for word recognition, but rarely both. Parsing models are difficult to implement as word-predictive language models due to their complexity. Generative random sampling is equally challenging, so the parsing correlate of perplexity is not easy to measure. Traditional (i.e., n-gram) language models do not produce parse trees, so parsing metrics are not useful. However, Roark (2001) argues for using parsing metrics, such as labelled precision and recall,3 along with WER, for parsing applications in ASR. Weighted WER (Weber et al., 1997) is also a useful measurement, as the most often ill-recognized words are short, closed-class words, which are not as important to speech understanding as phrasal head words. We will adopt the testing strategy of Roark (2001), but find that measurement of parse accuracy and WER on the same data set is not possible given currently available corpora. Use of weighted WER and development of methods to simultaneously measure WER and parse accuracy remain a topic for future research.</Paragraph> </Section> class="xml-element"></Paper>