File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1018_metho.xml

Size: 20,216 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1018">
  <Title>Detecting Structural Metadata with Decision Trees and Transformation-Based Learning</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Structural Metadata
</SectionTitle>
    <Paragraph position="0"> We consider three main types of structural metadata: sentence-like units, conversational llers and edit dis uencies. These structures were chosen primarily because of the availability of annotated conversational speech data from the Linguistic Data Consortium (Strassel, 2003) and standard scoring tools (NIST, 2003).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Sentence Units
</SectionTitle>
      <Paragraph position="0"> Conversational speech lacks the clear sentence boundaries of written text. Instead, we detect SUs (variously referred to as sentence, semantic, and slash units), which are linguistic units maximally equivalent to sentences that are used to mark segmentation boundaries in conversational speech where utterances often end without forming grammatical sentences in the sense expected in written text. SUs can be sub-categorized according to their discourse role. In our data, annotations distinguish statement, question, backchannel, incomplete SU and SU-internal clause boundaries. Here, we ignore the SU-internal boundaries, and merge all but the incomplete SU categories in characterizing SU events.</Paragraph>
      <Paragraph position="1">  tected by our system.</Paragraph>
      <Paragraph position="2"> Filled Pauses ah, eh, er, uh, um Discourse Markers actually, anyway, basically, I mean, let's see, like, now, see, so, well, you know, you see</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Conversational Fillers
</SectionTitle>
      <Paragraph position="0"> Conversational llers include lled pauses (hesitation sounds such as uh , um and er ), discourse markers (e.g. well , you know ), and explicit editing terms. De ning an all-inclusive set of English lled pauses and discourse markers is a problematic task. Our system detects only a limited set of lled pauses and discourse markers, listed in Table 1, which cover a large majority of cases (Strassel, 2003). An explicit editing term is a ller occurring within an edit dis uency, described further below. For example, the discourse marker I mean serves as an explicit editing term in the following edit dis uency: I didn't tell her that, I mean, I couldn't tell her that he was already gone.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Edit Dis uencies
</SectionTitle>
      <Paragraph position="0"> Edit dis uencies largely encompass three separate phenomena: repetition, repair and restart (Shriberg, 1994).</Paragraph>
      <Paragraph position="1"> A repetition occurs when a speaker repeats the most recently spoken portion of an utterance to hold off the ow of speech. A repair happens when the speaker attempts to correct a mistake that he or she just made. Finally, in a restart, the speaker abandons a current utterance completely and starts a new one.</Paragraph>
      <Paragraph position="2"> Previous studies characterize edit dis uencies using a structure with different segments (Shriberg, 1994; Nakatani and Hirschberg, 1994). The rst part of this structure is called the reparandum, a string of words that gets repeated or corrected. The reparandum is immediately followed by a non-lexical boundary event termed the interruption point (IP). The IP marks the point where the speaker interrupts a uent utterance. Optionally, there may be a lled pause or explicit editing term. The nal part of the edit dis uency structure is called the alteration, which is a repetition or revised copy of the reparandum. In the case of a restart, the alteration is empty. In Table 2, reparanda are enclosed in parentheses, IPs are represented by + , optional llers are in braces, and alterations are in boldface.</Paragraph>
      <Paragraph position="3"> Annotation of complex edit dis uencies, where a disuency occurs within an alteration, can be dif cult. The data used here is annotated with a attened structure that treats these cases as simple dis uencies with multiple IPs (Strassel, 2003). IPs within a complex dis uency are detected separately, and contiguous sequences of edit words associated with these IPs are referred to as a deletable region.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Previous Work
</SectionTitle>
    <Paragraph position="0"> In an early study on automatic dis uency detection a deterministic parser and correction rules were used to clean up edit dis uencies (Hindle, 1983). However theirs was not a truly automatic system as it relied on hand-annotated edit signals to locate IPs. Bear et al. (1992) explored pattern matching, parsing and acoustic cues and concluded that multiple sources of information would be needed to detect edit dis uencies. A decision-tree-based system that took advantage of various acoustic and lexical features to detect IPs was developed in (Nakatani and Hirschberg, 1994).</Paragraph>
    <Paragraph position="1"> Shriberg et al. (1997) applied machine prediction of IPs with decision trees to the broader Switchboard corpus by generating decision trees with a variety of prosodic features. Stolcke et al. (1998) then expanded the prosodic tree model with a hidden event language model (LM) to identify sentence boundaries, lled pauses and IPs in different types of edit dis uencies. The hidden event LM used in their work adapted Hidden Markov Model (HMM) algorithms to an n-gram LM paradigm to represent non-lexical events such as IPs and sentence boundaries as hidden states. Liu et al. (2003) built on this framework and extended prosodic features and the hidden event LM to predict edit IPs on both human transcripts and STT system output. Their system also detected the onset of the reparandum by employing rule-based pattern matching once edit IPs have been detected.</Paragraph>
    <Paragraph position="2"> Edit dis uency detection systems that rely exclusively on word-based information have been presented by Heeman et al. (Heeman et al., 1996) and Charniak and Johnson (Charniak and Johnson, 2001). Common to both of these approaches is a focus on repeated or similar sequences of words and information about the words themselves and the length and similarity of the sequences.</Paragraph>
    <Paragraph position="3"> Our approach is most similar to (Liu et al., 2003), since we also detect boundary events such as IPs rst and use them as signals when identifying the reparandum in a later stage. The motivation to detect IPs rst is that  speech before an IP is uent and is likely to be free of any prosodic or lexical irregularities that can indicate the occurrence of an edit dis uency. Like Liu et al., we use a decision tree trained with prosodic features and a hidden event language model for the IP detection task. However, we incorporate SU detection in those models as well. We use part-of-speech (POS) tags and pattern match features in decision tree training whereas Liu et al. (2003) developed language models for them. We explore three different methods of combining the hidden event language model and the decision tree model, namely linear interpolation, joint tree-based modeling and an HMM-based approach. Moreover, our system uses the transformation-based learning algorithm rather than hand-crafted rules for the second stage of edit region detection.</Paragraph>
    <Paragraph position="4"> Another key difference between our system and most previous work is the prediction target. Our system incorporates detecting word boundary events such as SUs and IPs, locating onsets of edit regions, and identifying lled pauses, discourse markers and explicit editing terms. We believe that such a comprehensive detection scheme allows our system to better model dependencies between these events, which will lead to an improvement in the overall detection performance.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Overall Architecture
</SectionTitle>
      <Paragraph position="0"> As shown in Figure 1, our system detects dis uencies in a two-step process. First, for each word boundary in the given transcription, a decision tree predicts one of the four boundary events IP, SU, ISU (incomplete SU), and the null event. Then in the second stage, rules learned via the transformation-based learning (TBL) algorithm are applied to the data containing predicted boundary events and other lexical information to identify edits and llers. Following edit region and ller prediction, the system output was post-processed to eliminate edit region predictions not associated with IP predictions as well as IP predictions for which no edit region or ller was detected. An analysis of post-processing alternatives conrmed that this strategy reduced insertion errors.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Detecting Boundary Events
</SectionTitle>
      <Paragraph position="0"> In order to detect boundary events, we trained a CARTstyle decision tree (Breiman et al., 1984) with various prosodic and lexical features. Decision trees are well-suited for this task because they provide a convenient way to integrate both symbolic and numerical features in prediction. Furthermore, a trained decision tree is highly explainable by its nature, which allows us to gain additional insight into the utilities of and the interactions between multiple information sources.</Paragraph>
      <Paragraph position="1"> Prosodic features generated for decision tree training included the following: Word and rhyme1 durations.</Paragraph>
      <Paragraph position="2"> Rhyme duration differences between two neighboring words.</Paragraph>
      <Paragraph position="3"> F0 statistics (minimum, mean, maximum, slope) over a word.</Paragraph>
      <Paragraph position="4"> Differences in F0 statistics between two neighboring words.</Paragraph>
      <Paragraph position="5"> Energy statistics over a word and its rhyme.</Paragraph>
      <Paragraph position="6"> Silence duration following a word.</Paragraph>
      <Paragraph position="7"> A ag indicating start and end of a speaker turn and speaker overlap.</Paragraph>
      <Paragraph position="8"> Ordinal position of a word in a turn.</Paragraph>
      <Paragraph position="9"> Energy and F0 features were generated with the Entropic System ESPS/Waves package and the F0 stylization tool developed in (Scurrency1onmez et al., 1998). Word and rhyme duration were normalized by phone duration statistics (mean and variance) calculated over all available training data. F0 and energy features were normalized for each individual speaker's baseline. A turn boundary was hypothesized for word boundaries with silences longer than four seconds.</Paragraph>
      <Paragraph position="10"> Since inclusion of features that do not contribute to the classi cation of data can degrade the performance of a decision tree, we selected only the prosodic features whose exclusion from the training process led to a decrease in boundary event detection accuracy on the development data by utilizing the leave-one-out method. Lexical features consisted of POS tag groups, word and POS tag pattern matches, and a ag indicating existence 1In our work, a rhyme was de ned to contain the nal vowel of a word and any consonants following the nal vowel. of ller words to the right of the current word boundary. The POS tag features were produced by rst predicting the tags with Ratnaparkhi's Maximum Entropy Tagger (Ratnaparkhi, 1996) and then clustered by hand into a smaller number of groups based on their syntactic role. The clustering was performed to speed up decision tree training as well as to reduce the impact of tagger errors. Word pattern match features were generated by comparing words over the range of up to four words across the word boundary in consideration. Grouped POS tags were compared in a similar way, but the range was limited to at most two tags across the boundary since a wider comparison range would have resulted in far more matches than would be useful due to the low number of available POS tag groups. When words known to be identi ed frequently as llers existed after the boundary, they were skipped and the range of pattern matching was extended accordingly.</Paragraph>
      <Paragraph position="11"> Another useful cue for boundary event detection is the existence of word fragments. Since word fragments occur when the speaker cuts short the word being spoken, they are highly indicative of IPs. However currently available STT systems do not recognize word fragments. As our goal is to build an automatic detection system, our system was not designed to use any features related to word fragments. However, for a control case, we conducted an experiment with reference transcripts using a single frag word token to show the potential for improved performance of a system capable of recognizing fragments. In addition to the decision tree model, we also employed a hidden event language model to predict boundary events. A hidden event LM is the same as a typical n-gram LM except that it models non-lexical events in the n-gram context by counting special non-word tokens representing such events. The hidden event LM estimates the joint distribution P(W;E) of words W and events E.</Paragraph>
      <Paragraph position="12"> Once the model has been trained, a forward-backward algorithm can be used to calculate P(EjW), or the posterior probability of an event given the preceding word sequence (Stolcke et al., 1998; Stolcke and Shriberg, 1996). The SRI Language Modeling Toolkit (SRILM) (Stolcke, 2002) was used to train a trigram open-vocabulary language model with Kneser-Ney discounting (Kneser and Ney, 1995) on data that had boundary events (SU, ISU, and IP) inserted in the word stream. Posterior probabilities of boundary events for every word boundary were then estimated with SRILM's capability for computing hidden event posteriors.</Paragraph>
      <Paragraph position="13"> While the hidden event LM alone can be used to detect boundary events, prior work has shown that it bene ts from also using prosodic cues, so we combined the language model and the decision tree model in three different ways. In the rst approach, which we call the joint tree model, the boundary event posterior probability from the hidden event LM is jointly modeled with other features in the decision tree to make predictions about the boundary events. In the second approach, referred to as the linearly interpolated model, a decision is made based on the combined posterior probability Ptree(EjA;W) + (1 )PLM(EjW); where A corresponds to the acoustic-prosodic features and the weighting factor can be chosen empirically to maximize target performance, i.e. bias the prediction toward the more accurate model. In the third approach, the decision tree features, words and boundary events are jointly modeled via an integrated HMM (Shriberg et al., 2000). This approach augments the hidden event LM by modeling decision tree features as emissions from the HMM states represented by the word and boundary event. Under this framework, the forward-backward algorithm can again be used to determine posterior probabilities of boundary events. Similar to the linearly interpolated model, a weighting factor can be used to introduce the desired bias to the combination model. The joint tree model has the advantage that the (possibly) complex interaction between lexical and prosodic cues can be captured. However, since the tree is trained on reference transcriptions, it favors lexical cues, which are less reliable in STT output. In the linearly interpolated and joint HMM approaches, the relative weighting of the two knowledge sources is estimated on the development test set for STT output, so it is possible for prosodic cues to be given a higher weight.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Edit and Filler Detection
</SectionTitle>
      <Paragraph position="0"> After SUs and IPs have been marked, we use transformation-based learning (TBL) to learn rules to detect edit dis uencies and conversational llers. TBL is an automatic rule learning technique that has been successfully applied to a variety of problems in natural language processing, including part-of-speech tagging (Brill, 1995), spelling correction (Mangu and Brill, 1997), error correction in automatic speech recognition (Mangu and Padmanabhan, 2001), and named entity detection (Kim and Woodland, 2000). We selected TBL for our tagging-like metadata detection task since it has been used successfully for these other tagging tasks.</Paragraph>
      <Paragraph position="1"> TBL is an iterative technique for inducing rules from training data. A TBL system consists of a baseline predictor, a set of rule templates, and an objective function for scoring potential rules. After tagging the training data using the baseline predictor, the system learns a list of rules to correct errors in these predictions. At each iteration, the system uses the rule templates to generate all possible rules that correct at least one error in the training data and selects the best rule according to the objective function, commonly token error rate. The best rule is  POS Match the dog IP the cat recorded and applied to the training data in preparation for the next iteration. The standard stopping criterion for rule learning is to stop when the score of the best rule falls below a threshold value; statistical signi cance measures have also been used (Mangu and Padmanabhan, 2001).</Paragraph>
      <Paragraph position="2"> To tag new data, the rules are applied in the order in which they were learned. This allows rules which are learned later in the process to ne tune the effects of the earlier rules. TBL produces concise, comprehensible rules, and uses the entire corpus to train all of the rules. We used Florian and Ngai's Fast TBL system (fnTBL) (Ngai and Florian, 2001) to train rules using dis uency annotated conversational speech data.</Paragraph>
      <Paragraph position="3"> The input to our TBL system consists of text divided into utterances, with IPs and SUs inserted as if they were extra words. (For simplicity, these special words are also assigned IP and SU as part of speech tags.) Our TBL system used the following types of features: Identity of the word.</Paragraph>
      <Paragraph position="4"> Part of speech (POS) and grouped part of speech (GPOS) of the word (same as the decision tree).</Paragraph>
      <Paragraph position="5"> Is the word commonly used as: lled pause (FP), backchannel (BC), explicit editing term (EET), discourse marker (DM)? Does this word/ POS/ GPOS match the word/ POS/ GPOS that is 1/2/3 positions to its right? Is this word at the beginning of a turn or utterance? Tag to be learned.</Paragraph>
      <Paragraph position="6"> The tag feature is the one we want the system to learn. It is also used in templates that consider features of neighboring words. The baseline predictor sets the tag to its most common value, no dis uency, for all words.</Paragraph>
      <Paragraph position="7"> Other values of the tag are the three types of llers (FP, EET, DM) and edit. The objective function for our learner is token error rate, and rule learning is stopped at a threshold score of 5.</Paragraph>
      <Paragraph position="8"> We generated a set of rule templates using these features. The rule templates account for individual features of the current word and/or its neighbors, the proximity of potential FP/EET/DM terms, and matches between the current word and nearby words, especially when in close proximity to a boundary event or potential ller. Example word and POS matches are shown in Table 3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML