File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1005_metho.xml
Size: 21,362 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1005"> <Title>A TAG-based noisy channel model of speech repairs</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A noisy channel model of repairs </SectionTitle> <Paragraph position="0"> We follow Shriberg (1994) and most other work on speech repairs by dividing a repair into three parts: the reparandum (the material repaired), the interregnum that is typically either empty or consists of a ller, and the repair. Figure 1 shows these three parts for a typical repair.</Paragraph> <Paragraph position="1"> Most current probabilistic language models are based on HMMs or PCFGs, which induce linear or tree-structured dependencies between words. The relationship between reparandum and repair seems to be quite di erent: the repair is a \rough copy&quot; of the reparandum, often incorporating the same or very similar words in roughly the same word order. That is, they seem to involve \crossed&quot; dependencies between the reparandum and the repair, shown in Figure 1. Languages with an unbounded number of crossed dependencies cannot be described by a context-free or nite-state grammar, and crossed dependencies like these have been used to argue natural languages . ..a ight to Boston, for the repair depicted in Figure 1.</Paragraph> <Paragraph position="2"> are not context-free Shieber (1985). Mildly context-sensitive grammars, such as Tree Adjoining Grammars (TAGs) and Combinatory Categorial Grammars, can describe such crossing dependencies, and that is why TAGs are used here.</Paragraph> <Paragraph position="3"> Figure 2 shows the combined model's dependency structure for the repair of Figure 1. Interestingly, if we trace the temporal word string through this dependency structure, aligning words next to the words they are dependent on, we obtain a \helical&quot; type of structure familiar from genome models, and in fact TAGs are being used to model genomes for very similar reasons.</Paragraph> <Paragraph position="4"> The noisy channel model described here involves two components. A language model denes a probability distribution P(X) over the source sentences X, which do not contain repairs. The channel model de nes a conditional probability distribution P(Y jX) of surface sentences Y , which may contain repairs, given source sentences. In the work reported here, X is a word string and Y is a speech transcription not containing punctuation or partial words. We use two language models here: a bigram language model, which is used in the search process, and a syntactic parser-based language model Charniak (2001), which is used to rescore a set of the most likely analysis obtained using the bigram model. Because the language model is responsible for generating the well-formed sentence X, it is reasonable to expect that a language model that can model more global properties of sentences will lead to better performance, and the results presented here show that this is the case. The channel model is a stochastic TAG-based transducer; it is responsible for generating the repairs in the transcript Y , and it uses the ability of TAGs to straight-forwardly model crossed dependencies.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Informal description </SectionTitle> <Paragraph position="0"> Given an observed sentence Y we wish to nd the most likely source sentence bX, where:</Paragraph> <Paragraph position="2"> This is the same general setup that is used in statistical speech recognition and machine translation, and in these applications syntax-based language models P(Y ) yield state-of-the-art performance, so we use one such model here. The channel model P(Y jX) generates sentences Y given a source X. A repair can potentially begin before any word of X. When a repair has begun, the channel model incrementally processes the succeeding words from the start of the repair. Before each succeeding word either the repair can end or else a sequence of words can be inserted in the reparandum. At the end of each repair, a (possibly null) interregnum is appended to the reparandum.</Paragraph> <Paragraph position="3"> The intuition motivating the channel model design is that the words inserted into the reparandum are very closely related those in the repair. Indeed, in our training data over 60% of the words in the reparandum are exact copies of words in the repair; this similarity is strong evidence of a repair. The channel model is designed so that exact copy reparandum words will have high probability.</Paragraph> <Paragraph position="4"> We assume that X is a substring of Y , i.e., that the source sentence can be obtained by deleting words from Y , so for a xed observed sentence there are only a nite number of possible source sentences. However, the number of source sentences grows exponentially with the length of Y , so exhaustive search is probably infeasible.</Paragraph> <Paragraph position="5"> TAGs provide a systematic way of formalizing the channel model, and their polynomial-time dynamic programming parsing algorithms can be used to search for likely repairs, at least when used with simple language models like a bigram language model. In this paper we rst identify the 20 most likely analysis of each sentence using the TAG channel model together with a bigram language model. Then each of these analysis is rescored using the TAG channel model and a syntactic parser based language model.</Paragraph> <Paragraph position="6"> The TAG channel model's analysis do not reect the syntactic structure of the sentence being analyzed; instead they encode the crossed dependencies of the speech repairs. If we want to use TAG dynamic programming algorithms to e ciently search for repairs, it is necessary that the intersection (in language terms) of the TAG channel model and the language model itself be describable by a TAG. One way to guarantee this is to use a nite state language model; this motivates our use of a bigram language model.</Paragraph> <Paragraph position="7"> On the other hand, it seems desirable to use a language model that is sensitive to more global properties of the sentence, and we do this by reranking the initial analysis, replacing the bi-gram language model with a syntactic parser based model. We do not need to intersect this parser based language model with our TAG channel model since we evaluate each analysis separately.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The TAG channel model </SectionTitle> <Paragraph position="0"> The TAG channel model de nes a stochastic mapping of source sentences X into observed sentences Y . There are several ways to dene transducers using TAGs such as Shieber and Schabes (1990), but the following simple method, inspired by nite-state transducers, su ces for the application here. The TAG denes a language whose vocabulary is the set of pairs ( [f;g) ( [f;g), where is the vocabulary of the observed sentences Y . A string Z in this language can be interpreted as a pair of strings (Y;X), where Y is the concatenation of the projection of the rst components of Z and X is the concatenation of the projection of the second components. For example, the string Z = a:a ight: ight to:; Boston:; uh:; I:; mean:; to:to Denver:Denver on:on Friday:Friday corresponds to the observed string Y = a ight to Boston uh I mean to Denver on Friday and the source string X = a ight to Denver on Friday.</Paragraph> <Paragraph position="1"> Figure 3 shows the TAG rules used to generate this example. The nonterminals in this grammar are of the form Nwx, Rwy:wx and I, where wx is a word appearing in the source string and wy is a word appearing in the observed string. Informally, the Nwx nonterminals indicate that the preceding word wx was analyzed as not being part of a repair, while the Rwy:wx that the preceding words wy and wx were part of a repair. The nonterminal I generates words in the interregnum of a repair. Encoding the preceding words in the TAGs nonterminals permits the channel model to be sensitive to lexical properties of the preceding words. The start symbol is N$, where '$' is a distinguished symbol used to indicate the beginning and end of sentences.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Estimating the repair channel </SectionTitle> <Paragraph position="0"> model from data The model is trained from the dis uency and POS tagged Switchboard corpus on the LDC Penn tree bank III CD-ROM (speci cally, the les under dysfl/dps/swbd). This version of the corpus annotates the beginning and ending positions of repairs as well as llers, editing terms, asides, etc., which might serve as the interregnum in a repair. The corpus also includes punctuation and partial words, which are ignored in both training and evaluation here since we felt that in realistic applications these would not be available in speech recognizer output. The transcript of the example of Figure 1 would look something like the following:</Paragraph> <Paragraph position="2"> In this transcription the repair is the string from the opening bracket \[&quot; to the interruption point \+&quot;; the interregnum is the sequence of braced strings following the interregnum, and the repair is the string that begins at the end of the interregnum and ends at the closing bracket \]&quot;. The interregnum consists of the braced</Paragraph> <Paragraph position="4"> expressions immediately following the interruption point. We used the dis uency tagged version of the corpus for training rather than the parsed version because the parsed version does not mark the interregnum, but we need this information for training our repair channel model. Testing was performed using data from the parsed version since this data is cleaner, and it enables a direct comparison with earlier work.</Paragraph> <Paragraph position="5"> We followed Charniak and Johnson (2001) and split the corpus into main training data, held-out training data and test data as follows: main training consisted of all sw[23]*.dps les, held-out training consisted of all sw4[5-9]*.dps les and test consisted of all sw4[0-1]*.mrg les.</Paragraph> <Paragraph position="6"> We now describe how the weights on the TAG productions described in subsection 2.2 are estimated from this training data. In order to estimate these weights we need to know the TAG derivation of each sentence in the training data.</Paragraph> <Paragraph position="7"> In order to uniquely determine this we need the not just the locations of each reparandum, interregnum and repair (which are annotated in the corpus) but also the crossing dependencies between the reparandum and repair words, as indicated in Figure 1.</Paragraph> <Paragraph position="8"> We obtain these by aligning the reparandum and repair strings of each repair using a minimum-edit distance string aligner with the following alignment costs: aligning identical words costs 0, aligning words with the same POS tag costs 2, an insertion or a deletion costs 4, aligning words with POS tags that begin with the same letter costs 5, and an arbitrary substitution costs 7. These costs were chosen so that a substitution will be selected over an insertion followed by a deletion, and the lower cost for substitutions involving POS tags beginning with the same letter is a rough and easy way of establishing a preference for aligning words whose POS tags come from the same broad class, e.g., it results in aligning singular and plural nouns, present and past participles, etc. While we did not evaluate the quality of the alignments since they are not in themselves the object of this exercise, they seem to be fairly good.</Paragraph> <Paragraph position="9"> From our training data we estimate a number of conditional probability distributions. These estimated probability distributions are the linear interpolation of the corresponding empirical distributions from the main sub-corpus using various subsets of conditioning variables (e.g., bigram models are mixed with unigram models, etc.) using Chen's bucketing scheme Chen and Goodman (1998). As is commonly done in language modelling, the interpolation coe cients are determined by maximizing the likelihood of the held out data counts using EM. Special care was taken to ensure that all distributions over words ranged over (and assigned non-zero probability to) every word that occurred in the training corpora; this turns out to be important as the size of the training data for the di erent distributions varies greatly.</Paragraph> <Paragraph position="10"> The rst distribution is de ned over the words in source sentences (i.e., that do not contain reparandums or interregnums).</Paragraph> <Paragraph position="11"> Pn(repairjW) is the probability of a repair beginning after a word W in the source sentence X; it is estimated from the training sentences with reparandums and interregnums removed.</Paragraph> <Paragraph position="12"> Here and in what follows, W ranges over [ f$g, where '$' is a distinguished beginning-ofsentence marker. For example, Pn(repairj ight) is the probability of a repair beginning after the word ight. Note that repairs are relatively rare; in our training data Pn(repair) 0:02, which is a fairly strong bias against repairs.</Paragraph> <Paragraph position="13"> The other distributions are de ned over aligned reparandum/repair strings, and are estimated from the aligned repairs extracted from the training data. In training we ignored all overlapping repairs (i.e., cases where the reparandum of one repair is the repair of another). (Naturally, in testing we have no such freedom.) We analyze each repair as consisting of n aligned word pairs (we describe the interregnum model later). Mi is the ith reparandum word and Ri is the corresponding repair word, so both of these range over [ f;g.</Paragraph> <Paragraph position="14"> We de ne M0 and R0 to be source sentence word that preceded the repair (which is '$' if the repair begins at the beginning of a sentence). We de ne M0i and R0i to be the last non-; reparandum and repair words respectively, i.e.,</Paragraph> <Paragraph position="16"> erwise. Finally, Ti;i = 1::: n + 1, which indicates the type of repair that occurs at position i, ranges over fcopy; subst; ins; del; nonrepg, where Tn+1 = nonrep (indicating that the repair has ended), and for i = 1::: n, Ti = copy if</Paragraph> <Paragraph position="18"> The distributions we estimate from the aligned repair data are the following.</Paragraph> <Paragraph position="19"> Pr(TijM0i 1;R0i 1) is the probability of seeing repair type Ti following the reparandum word M0i 1 and repair word R0i 1; e.g., Pr(nonrepjBoston;Denver) is the probability of the repair ending when Boston is the last reparandum word and Denver is the last repair word.</Paragraph> <Paragraph position="21"> that Mi is the word that is inserted into the reparandum (i.e., Ri = ;) given that some word is substituted, and that the preceding reparandum and repair words are M0i 1 and R0i. For example Pr(tomorrowjins;Boston;Denver) is the probability that the word tomorrow is inserted into the reparandum after the words Boston and Denver, given that some word is inserted.</Paragraph> <Paragraph position="23"> ability that Mi is the word that is substituted in the reparandum for R0i, given that some word is substituted. For example, Pr(Bostonjsubst;to;Denver) is the probability that Boston is substituted for Denver, given that some word is substituted.</Paragraph> <Paragraph position="24"> Finally, we also estimated a probability distribution Pi(W) over interregnum strings as follows. Our training corpus annotates what we call interregnum expressions, such as uh and I mean. We estimated a simple unigram distribution over all of the interregnum expressions observed in our training corpus, and also extracted the empirical distribution of the number of interregnum expressions in each repair. Interregnums are generated as follows. First, the number k of interregnum expressions is chosen using the empirical distribution. Then k interregnum expressions are independently generated from the unigram distribution of interregnum expressions, and appended to yield the interregnum string W.</Paragraph> <Paragraph position="25"> The weighted TAG that constitutes the channel model is straight forward to de ne using these conditional probability distributions. Note that the language model generates the source string X. Thus the weights of the TAG rules condition on the words in X, but do not generate them.</Paragraph> <Paragraph position="26"> There are three di erent schema de ning the initial trees of the TAG. These correspond to analyzing a source word as not beginning a repair (e.g., 1 and 3 in Figure 3), analyzing a source word as beginning a repair (e.g., 2), and generating an interregnum (e.g., 5).</Paragraph> <Paragraph position="27"> Auxiliary trees generate the paired reparandum/repair words of a repair. There are ve different schema de ning the auxiliary trees corresponding to the ve di erent values that Ti can take. Note that the nonterminal Rm;r expanded by the auxiliary trees is annotated with the last reparandum and repair words M0i 1 and R0i 1 respectively, which makes it possible to condition the rule's weight on these words.</Paragraph> <Paragraph position="28"> Auxiliary trees of the form ( 1) generate reparandum words that are copies of the corresponding repair words; the weight on such trees is Pr(copyjM0i 1;R0i 1). Trees of the form ( 2) substitute a reparandum word for a repair word; their weight is Pr(substjM0i 1;R0i 1)Pr(Mijsubst;M0i 1;R0i).</Paragraph> <Paragraph position="29"> Trees of the form ( 3) end a repair; their weight is Pr(nonrepj;M0i 1;R0i 1). Auxiliary trees of the form ( 3) end a repair; they are weighted Pr(nonrepjM0i 1;R0i 1). Auxiliary trees of the form ( 4) permit the repair word R0i 1 to be deleted in the reparandum; the weight of such a tree is Pr(deljM0i 1;R0i 1). Finally, auxiliary trees of the form ( 5) generate a reparandum The TAG just described is not probabilistic; informally, it does not include the probability costs for generating the source words. However, it is easy to modify the TAG so it does include a bigram model that does generate the source words, since each nonterminal encodes the preceding source word. That is, we multiply the weights of each TAG production given earlier that introduces a source word Ri by Pn(RijRi 1). The resulting stochastic TAG is in fact exactly the intersection of the channel model TAG with a bigram language model.</Paragraph> <Paragraph position="30"> The standard n5 bottom-up dynamic programming parsing algorithm can be used with this stochastic TAG. Each di erent parse of the observed string Y with this grammar corresponds to a way of analyzing Y in terms of a hypothetical underlying sentence X and a number of di erent repairs. In our experiments below we extract the 20 most likely parses for each sentence. Since the weighted grammar just given does not generate the source string X, the score of the parse using the weighted TAG is P(Y jX).</Paragraph> <Paragraph position="31"> This score multiplied by the probability P(X) of the source string using the syntactic parser based language model, is our best estimate of the probability of an analysis.</Paragraph> <Paragraph position="32"> However, there is one additional complication that makes a marked improvement to the model's performance. Recall that we use the standard bottom-up dynamic programming TAG parsing algorithm to search for candidate parses. This algorithm has n5 running time, where n is the length of the string. Even though our sentences are often long, it is extremely unlikely that any repair will be longer than, say, 12 words. So to increase processing speed we only compute analyses for strings of length 12 or less. For every such substring that can be analyzed as a repair we calculate the repair odds, i.e., the probability of generating this substring as a repair divided by the probability of generating this substring via the non-repair rules, or equivalently, the odds that this substring constitutes a repair. The substrings with high repair odds are likely to be repairs.</Paragraph> <Paragraph position="33"> This more local approach has a number of advantages over computing a global analysis.</Paragraph> <Paragraph position="34"> First, as just noted it is much more e cient to compute these partial analyses rather than to compute global analyses of the entire sentence. Second, there are rare cases in which the same substring functions as both repair and reparandum (i.e., the repair string is itself repaired again). A single global analysis would not be able to capture this (since the TAG channel model does not permit the same substring to be both a reparandum and a repair), but we combine these overlapping repair substring analyses in a post-processing operation to yield an analysis of the whole sentence. (We do insist that the reparandum and interregnum of a repair do not overlap with those of any other repairs in the same analysis).</Paragraph> </Section> </Section> class="xml-element"></Paper>