File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1016_metho.xml
Size: 24,036 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1016"> <Title>Edit Detection and Parsing for Transcribed Speech</Title> <Section position="3" start_page="1" end_page="4" type="metho"> <SectionTitle> 2 Identifying EDITED words </SectionTitle> <Paragraph position="0"> The Switchboard corpus annotates disfluencies such as restarts and repairs using the terminology of Shriberg [15]. The disfluencies include repetitions and substitutions, italicized in (1a) and (1b) respectively.</Paragraph> <Paragraph position="1"> (1) a. Ireally,I really like pizza.</Paragraph> <Paragraph position="2"> b. Whydidn'the,why didn't she stay home? Restarts and repairs are indicated by disfluency tags '[', '+' and ']' in the disfluency POS-tagged Switchboard corpus, and by EDITED nodes in the tree-tagged corpus. This section describes a procedure for automatically identifying words corrected by a restart or repair, i.e., words that Indeed, [17] suggests that lled pauses tend to indicate clause boundaries, and thus may be a help in parsing. null are dominated by an EDITED node in the tree-tagged corpus.</Paragraph> <Paragraph position="3"> This method treats the problem of identifying EDITED nodes as a word-token classi cation problem, where each word token is classi ed as either edited or not. The classi er applies to words only; punctuation inherits the classi cation of the preceding word. A linear classi er trained by a greedy boosting algorithm [16] is used to predict whether a word token is edited. Our boosting classi er is directly based on the greedy boosting algorithm described by Collins [7]. This paper contains important implementation details that are not repeated here. We chose Collins' algorithm because it o ers good performance and scales to hundreds of thousands of possible feature combinations.</Paragraph> <Section position="1" start_page="1" end_page="3" type="sub_section"> <SectionTitle> 2.1 Boosting estimates of linear </SectionTitle> <Paragraph position="0"> classi ers This section describes the kinds of linear classi ers that the boosting algorithm infers. Abstractly, we regard each word token as an event characterized by a nite tuple of random vari- null is the orthographic form of the word and X is the set of all words observed in the training section of the corpus. Our classi ers use m = 18 conditioning variables. The following subsection describes the conditioning variables in more detail; they include variables indicating the POS tag of the preceding word, the tag of the following word, whether or not the word token appears in a \rough copy&quot; as explained below, etc. The goal of the classi er is to predict the value of Y given values for X</Paragraph> <Paragraph position="2"> classi er makes its predictions based on the occurence of combinations of conditioning variable/value pairs called features. A feature F is a set of variable-value pairs hX</Paragraph> <Paragraph position="4"> It turns out that many pairs of features are extensionally equivalent, i.e., take the same values on each nes an associated random boolean variable</Paragraph> <Paragraph position="6"> where (X=x) takes the value 1 if X = x and 0 otherwise. That is, F</Paragraph> <Paragraph position="8"> The prediction made by the classi er is sign(Z)=Z=jZj, i.e., [?]1 or +1 depending on the sign of Z.</Paragraph> <Paragraph position="9"> Intuitively, our goal is to adjust the vector of feature weights ~ =(</Paragraph> <Paragraph position="11"> ) to minimize the expected misclassi cation rate E[(sign(Z) 6= Y )]. This function is di cult to minimize, so our boosting classi er minimizes the expected Boost loss E[exp([?]YZ)]. As Singer and Schapire [16] point out, the misclassi cation rate is bounded above by the Boost loss, so a low value for the Boost loss implies a low misclassi cation rate.</Paragraph> <Paragraph position="12"> Our classi er estimates the Boost loss as</Paragraph> <Paragraph position="14"> on the empirical training corpus distribution.</Paragraph> <Paragraph position="15"> The feature weights are adjusted iteratively; one weight is changed per iteration. The feature whose weight is to be changed is selected greedily to minimize the Boost loss using the algorithm described in [7]. Training continues for 25,000 iterations. After each iteration the misclassi cation rate on the development</Paragraph> <Paragraph position="17"> ment corpus distribution. While each iteration lowers the Boost loss on the training corpus, a graph of the misclassi cation rate on the development corpus versus iteration number is a noisy U-shaped curve, rising at later iterations due to overlearning. The value of ~ returned word token in our training data. We developed a method for quickly identifying such extensionally equivalent feature pairs based on hashing XORed random bitmaps, and deleted all but one of each set of extensionally equivalent features (we kept a feature with the smallest number of conditioning variables).</Paragraph> <Paragraph position="18"> by the estimator is the one that minimizes the misclass ciation rate on the development corpus; typically the minimum is obtained after about 12,000 iterations, and the feature weight vector ~ contains around 8000 nonzero feature weights (since some weights are adjusted more than once).</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 2.2 Conditioning variables and features </SectionTitle> <Paragraph position="0"> This subsection describes the conditioning variablesusedintheEDITED classi er. Many of the variables are de ned in terms of what we call a rough copy. Intuitively, a rough copy identi es repeated sequences of words that might be restarts or repairs. Punctuation is ignored for the purposes of de ning a rough copy, although conditioning variables indicate whether the rough copy includes punctuation. A rough copy in a tagged string of words is a substring of the form are identical, 3. (the free nal) consists of zero or more sequences of a free nal word (see below) followed by optional punctuation, and 4. g (the interregnum) consists of sequences of an interregnum string (see below) followed by optional punctuation.</Paragraph> <Paragraph position="1"> The set of free- nal words includes all partial words (i.e., ending in a hyphen) and a small set of conjunctions, adverbs and miscellanea, such as and, or, actually, so, etc. The set of interregnum strings consists of a small set of expressions such as uh, you know, I guess, I mean, etc. We search for rough copies in each sentence starting from left to right, searching for longer copies rst. After we nd a rough copy, we restart searching for additional rough copies following the free nal string of the previous copy. We say that a word token is in a rough copy i it appears in either the source or the free nal. (2) is an example of a rough copy.</Paragraph> <Paragraph position="2"> We used a smoothing parameter as described in [7], which we estimate by using a line-minimization routine to minimize the classi er's minimum misclassi cation rate on the development corpus.</Paragraph> <Paragraph position="3"> In fact, our de nition of rough copy is more complex. For example, if a word token appears in an interregnum Table 1 lists the conditioning variables used in our classi er. In that table, subscript integers refer to the relative position of word tokens relative to the current word; e.g. T</Paragraph> <Paragraph position="5"> the POS tag of the following word. The subscript f refers to the tag of the rst word of the free nal match. If a variable is not de ned for a particular word it is given the special value 'NULL'; e.g., if a word is not in a rough copy then variables such as N</Paragraph> <Paragraph position="7"> all take the value NULL. Flags are boolean-valued variables, while numeric-valued variables are bounded to a value between 0 and 4 (as well as NULL, if appropriate). The three variables</Paragraph> <Paragraph position="9"> are intended to help the classi er capture very short restarts or repairs that may not involve a rough copy. The flags C</Paragraph> <Paragraph position="11"> indicate whether the orthographic form and/or tag of the next word (ignoring punctuation) are thesameasthoseofthecurrentword. T</Paragraph> <Paragraph position="13"> followed by an interregnum string; in that case</Paragraph> <Paragraph position="15"> is the POS tag of the word following that interregnum.</Paragraph> <Paragraph position="16"> As described above, the classi er's features are sets of variable-value pairs. Given a tuple of variables, we generate a feature for each tuple of values that the variable tuple assumes in the training data. In order to keep the feature set managable, the tuples of variables we consider are restricted in various ways. The most important of these are constraints of the form 'if X j is included among feature's variables, then so</Paragraph> <Paragraph position="18"/> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.3 Empirical evaluation </SectionTitle> <Paragraph position="0"> For the purposes of this research the Switchboard corpus, as distributed by the Linguistic Data Consortium, was divided into four sections and the word immediately following the interregnum also appears in a (di erent) rough copy, then we say that the interregnum word token appears in a rough copy. This permits us to approximate the Switchboard annotation convention of annotating interregna as EDITED if they appear in iterated edits.</Paragraph> <Paragraph position="1"> (or subcorpora). The training subcorpus consists of all les in the directories 2 and 3 of the parsed/merged Switchboard corpus. Directory 4 is split into three approximately equal-size sections. (Note that the les are not consecutively numbered.) The rst of these ( les sw4004.mrg to sw4153.mrg) is the testing corpus. All edit detection and parsing results reported herein are from this subcorpus. The les sw4154.mrg to sw4483.mrg are reserved for future use. The les sw4519.mrg to sw4936.mrg are the development corpus. In the complete corpus three parse trees were su ciently ill formed in that our tree-reader failed to read them. These trees received trivial modi cations to allow them to be read, e.g., adding the missing extra set of parentheses around the complete tree.</Paragraph> <Paragraph position="2"> We trained our classi er on the parsed data les in the training and development sections, and evaluated the classifer on the test section.</Paragraph> <Paragraph position="3"> Section 3 evaluates the parser's output in conjunction with this classi er; this section focuses on the classi er's performance at the individual word token level. In our complete application, the classi er uses a bitag tagger to assign each word a POS tag. Like all such taggers, our tagger has a nonnegligible error rate, and these tagging could conceivably a ect the performance of the classi er. To determine if this is the case, we report classi er performance when trained both on \Gold Tags&quot; (the tags assigned by the human annotators of the Switchboard corpus) and on \Machine Tags&quot; (the tags assigned by our bitag tagger). We compare these results to a baseline \null&quot; classi er, which never identies a word as EDITED. Our basic measure of performance is the word misclassi cation rate (see Section 2.1). However, we also report precision and recall scores for EDITED words alone.</Paragraph> <Paragraph position="4"> All words are assigned one of the two possible labels, EDITED or not. However, in our evaluation we report the accuracy of only words other than punctuation and lled pauses. Our logic here is much the same as that in the statistical parsing community which ignores the location of punctuation for purposes of evaluation [3,5, 6] on the grounds that its placement is entirely conventional. The same can be said for lled pauses in the switchboard corpus.</Paragraph> <Paragraph position="5"> Our results are given in Table 2. They show that our classi er makes only approximately 1/3 of the misclassi cation errors made by the null classi er (0.022 vs. 0.059), and that using the POS tags produced by the bitag tagger does not have much e ect on the classi er's performance (e.g., EDITED recall decreases from 0.678 to 0.668).</Paragraph> </Section> </Section> <Section position="4" start_page="4" end_page="6" type="metho"> <SectionTitle> 3 Parsing transcribed speech </SectionTitle> <Paragraph position="0"> We now turn to the second pass of our two-pass architecture, using an \o -the-shelf&quot; statistical parser to parse the transcribed speech after having removed the words identi ed as edited by the rst pass. We rst de ne the evaluation metric we use and then describe the results of our experiments.</Paragraph> <Section position="1" start_page="4" end_page="6" type="sub_section"> <SectionTitle> 3.1 Parsing metrics </SectionTitle> <Paragraph position="0"> In this section we describe the metric we use to grade the parser output. As a rst desideratum we want a metric that is a logical extension of that used to grade previous statistical parsing work. We have taken as our starting point what we call the \relaxed labeled precision/recall&quot; metric from previous research (e.g. [3,5]). This metric is characterized as follows.</Paragraph> <Paragraph position="1"> For a particular test corpus let N be the total number of nonterminal (and non-preterminal) constituents in the gold standard parses. Let M be the number of such constituents returned by the parser, and let C be the number of these that are correct (as de ned below). Then pre- null In 2 and 3 above we introduce an equivalence relation r between string positions. We de ne r to be the smallest equivalence relation satisfying a r b for all pairs of string positions a and b separated solely by punctuation symbols. The parsing literature uses r rather than = because it is felt that two constituents should be considered equal if they disagree only in the placement of, say, a comma (or any other sequence of punctuation), where one constituent includes the punctuation and the other excludes it.</Paragraph> <Paragraph position="2"> Our new metric, \relaxed edited labeled precision/recall&quot; is identical to relaxed labeled precision/recall except for two modi cations. First, in the gold standard all non-terminal subconstituents of an EDITED node are removed and the terminal constituents are made immediate children of a single EDITED node. Furthermore, two or more EDITED nodes with no separating non-edited material between them are merged into a single EDITED node. We call this version a \simpli ed gold standard parse.&quot; All precision recall measurements are taken with respected to the simpli ed gold standard.</Paragraph> <Paragraph position="3"> Second, we replace r with a new equivalence relation e which we de ne as the smallest equivalence relation containing</Paragraph> <Paragraph position="5"> in the gold standard parse.</Paragraph> <Paragraph position="6"> and PRT are considered to be identical as well. We considered but ultimately rejected de ning .</Paragraph> <Paragraph position="7"> We give a concrete example in Figure 1. The rst row indicates string position (as usual in parsing work, position indicators are between words). The second row gives the words of the sentence. Words that are edited out have an \E&quot; above them. The third row indicates the equivalence relation by labeling each string position with the smallest such position with which it is equivalent.</Paragraph> <Paragraph position="8"> There are two basic ideas behind this de nition. First, we do not care where the EDITED nodes appear in the tree structure produced by the parser. Second, we are not interested in the ne structure of EDITED sections of the string, justthefactthattheyare EDITED.Thatwe do care which words are EDITED comes into our gure of merit in two ways. First, (noncontiguous) EDITED nodes remain, even though their substructure does not, and thus they are counted in the precision and recall numbers. Secondly (and probably more importantly), failure to decide on the correct positions of edited nodes can cause collateral damage to neighboring constituents by causing them to start or stop in the wrong place. This is particularly relevant because according to our de nition, while the positions at the beginning and ending of an edit node are equivalent, the interior positions are not (unless related by the punctuation rule). than the simpli ed gold standard. We rejected this because the e relation would then itself be dependent on the parser's output, a state of a airs that might allow complicated schemes to improve the parser's performance as measured by the metric.</Paragraph> <Paragraph position="9"> See Figure 1.</Paragraph> </Section> <Section position="2" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 3.2 Parsing experiments </SectionTitle> <Paragraph position="0"> The parser described in [3] was trained on the Switchboard training corpus as speci ed in section 2.1. The input to the training algorithm was the gold standard parses minus all EDITED nodes and their children.</Paragraph> <Paragraph position="1"> We tested on the Switchboard testing sub-corpus (again as speci ed in Section 2.1). All parsing results reported herein are from all sentences of length less than or equal to 100 words and punctuation. When parsing the test corpus we carried out the following operations: 1. create the simpli ed gold standard parse by removing non-terminal children of an EDITED node and merging consecutive EDITED nodes.</Paragraph> <Paragraph position="2"> 2. remove from the sentence to be fed to the parser all words marked as edited by an edit detector (see below).</Paragraph> <Paragraph position="3"> 3. parse the resulting sentence.</Paragraph> <Paragraph position="4"> 4. add to the resulting parse EDITED nodes containing the non-terminal symbols removed in step 2. The nodes are added as high as possible (though the de nition of equivalence from Section 3.1 should make the placement of this node largely irrelevant). null 5. evaluate the parse from step 4 against the simpli ed gold standard parse from step 1. We ran the parser in three experimental situations, each using a di erent edit detector in step 2. In the rst of the experiments (labeled \Gold Edits&quot;) the \edit detector&quot; was simply the simpli ed gold standard itself. This was to see how well the parser would do it if had perfect information about the edit locations.</Paragraph> <Paragraph position="5"> In the second experiment (labeled \Gold Tags&quot;), the edit detector was the one described in Section 2 trained and tested on the part-of-speech tags as speci ed in the gold standard trees. Note that the parser was not given the gold standard part-of-speech tags. We were interested in contrasting the results of this experiment with that of the third experiment to gauge what improvement one could expect from using a more sophisticated tagger as input to the edit detector.</Paragraph> <Paragraph position="6"> In the third experiment (\Machine Tags&quot;) we used the edit detector based upon the machine generated tags.</Paragraph> <Paragraph position="7"> The results of the experiments are given in the performance of this parser when trained and tested on Wall Street Journal text [3]. It is the \Machine Tags&quot; results that we consider the \true&quot; capability of the detector/parser combination: 85.3% precision and 86.5% recall.</Paragraph> </Section> <Section position="3" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 3.3 Discussion </SectionTitle> <Paragraph position="0"> The general trends of Table 3 are much as one might expect. Parsing the Switchboard data is much easier given the correct positions of the EDITED nodes than without this information.</Paragraph> <Paragraph position="1"> The di erence between the Gold-tags and the Machine-tags parses is small, as would be expected from the relatively small di erence in the performance of the edit detector reported in Section 2. This suggests that putting signi cant e ort into a tagger for use by the edit detector is unlikely to produce much improvement.</Paragraph> <Paragraph position="2"> Also, as one might expect, parsing conversational speech is harder than Wall Street Journal text, even given the gold-standard EDITED nodes.</Paragraph> <Paragraph position="3"> Probably the only aspect of the above numbers likely to raise any comment in the parsing community is the degree to which precision numbers are lower than recall. With the exception of the single pair reported in [3] and repeated above, no precision values in the recent statistical-parsing literature [2,3,4,5,14] have ever been lower than recall values. Even this one exception is by only 0.1% and not statistically signi cant.</Paragraph> <Paragraph position="4"> We attribute the dominance of recall over precision primarily to the influence of edit-detector mistakes. First, note that when given the gold standard edits the di erence is quite small (0.3%). When using the edit detector edits the di erence increases to 1.2%. Our best guess is that because the edit detector has high precision, and lower recall, many more words are left in the sentence to be parsed. Thus one nds more nonterminal constituents in the machine parses than in the gold parses and the precision is lower than the recall.</Paragraph> </Section> </Section> <Section position="5" start_page="6" end_page="6" type="metho"> <SectionTitle> 4 Previous research </SectionTitle> <Paragraph position="0"> While there is a signi cant body of work on nding edit positions [1,9,10,13,17,18], it is di cult to make meaningful comparisons between the various research e orts as they di er in (a) the corpora used for training and testing, (b) the information available to the edit detector, and (c) the evaluation metrics used. For example, [13] uses a subsection of the ATIS corpus, takes as input the actual speech signal (and thus has access to silence duration but not to words), and uses as its evaluation metric the percentage of time the program identi es the start of the interregnum (see Section 2.2). On the other hand, [9,10] use an internally developed corpus of sentences, work from a transcript enhanced with information from the speech signal (and thus use words), but do use a metric that seems to be similar to ours. Undoubtedly the work closest to ours is that of Stolcke et al. [18], which also uses the transcribed Switchboard corpus. (However, they use information on pause length, etc., that goes beyond the transcript.) They categorize the transitions between words into more categories than we do. At rst glance there might be a mapping between their six categories and our two, with three of theirs corresponding to EDITED words and three to not edited. If one accepts this mapping they achieve an error rate of 2.6%, down from their NULL rate of 4.5%, as contrasted with our error rate of 2.2% down from our NULL rate of 5.9%. The di erence in NULL rates, however, raises some doubts that the numbers are truly measuring the same thing.</Paragraph> <Paragraph position="1"> There is also a small body of work on parsing disfluent sentences [8,11]. Hindle's early work [11] does not give a formal evaluation of the parser's accuracy. The recent work of Schubert and Core [8] does give such an evaluation, but on a di erent corpus (from Rochester Trains project). Also, their parser is not statistical and returns parses on only 62% of the strings, and 32% of the strings that constitute sentences. Our statistical parser naturally parses all of our corpus. Thus it does not seem possible to make a meaningful comparison between the two systems. null</Paragraph> </Section> class="xml-element"></Paper>