File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0204_metho.xml

Size: 19,784 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0204">
  <Title>Modeling Conversational Speech for Speech Recognition</Title>
  <Section position="4" start_page="37" end_page="39" type="metho">
    <SectionTitle>
3 Linguistic Segmentations
</SectionTitle>
    <Paragraph position="0"> As explained in the previous section, one of the important annotations of the Switchboard corpus involved the issue of sentence boundaries, or segment boundaries. Sentence boundaries are easy to detect in the case of read speech where there is a distinctive pause at the end of the sentence and the sentence is usually grammatically complete (the second also holds true in case of written speech, where in addition a period marks the end of a sentence). However, this is not so in the case of conversational speech as is clear from the examples above. In conversational speech, it is possible to have incomplete sentences, sentences across turns and complex sentences involving restarts and other dysfluencies.</Paragraph>
    <Paragraph position="1"> Prior to having annotated data, the segment boundaries for conversational text data were provided in the form of acoustic segmentations. These segmentations were based on pauses, silences, non-speech elements (e.g. laughs and coughs) and turn taking. The differences between the two forms of segmentations can be observed with the example given below1: Acoustic segmentations I'm not sure how many active volcanoes there are now and and what the amount of material that they do &lt;s&gt; uh &lt;s&gt; put into the atmosphere &lt;s&gt; ! think probably the greatest cause is uh &lt;s&gt; vehicles &lt;s&gt; especially around cities &lt;s&gt; Linguistic segmentations I'm not sure how many active volcanoes there are now and and what the amount of material that they do uh put into the atmosphere &lt;s&gt; I think probably the greatest cause is uh vehicles especially around cities &lt;s&gt; In the n-gram approach to statistical language modeling, the segment boundary is treated as one of the symbols in the lexical dictionary and modeled similar to other words in the data stream. The segment boundaries provide an additional source of information to the language model and hence it appears intuitively correct to use linguistic segmentations for training language models. The notion of segmentation is also an important issue if we use higher level language models such as phrase structure models or sentence-level mixture models (Iyer, et al. 1994). However, given only a speech signal during recognition with no text cues available for segmentation, there will be an inherent mismatch between the linguistically segmented training data and the acoustically segmented test data.</Paragraph>
    <Paragraph position="2"> Thus the segmentation experiments tried to answer three important issues:  &lt;s&gt; represents the sentence/segment boundary.</Paragraph>
    <Paragraph position="3"> deg Does a mismatch in training/testing segmentation hurt language model performance (perplexity and word error rate)? * Is there any information in segment boundaries? * If no boundary information is available during testing,  can we hypothesize this information using a language model trained with segmented training data?</Paragraph>
    <Section position="1" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
3.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> In order to analyze the above issues, we first obtained our baseline training and testing data. Since the linguistic segmentations are available for only two thirds of the Switchboard data, we decided to use the corresponding two thirds of the acoustically segmented training data for our comparative experiments. The test set was obtained from the Switchboard lattices which served as the baseline for the 1995 Language Modeling Workshop at Johns Hopkins. The test set was acoustically segmented. A corresponding linguistically segmented test set was also made available 2.</Paragraph>
      <Paragraph position="1">  We used the N-best rescoring formalism for recognition experiments with the test data (Ostendorf, et al. 1991). The HTK toolkit was used to generate the top 2500 hypotheses for each segment. A weighted combination of scores from different knowledge sources (HTK acoustic model scores, number of words, different language model scores, etc.) was then used to re-rank the hypotheses. The top ranking hypothesis was then considered as the recognized output.</Paragraph>
    </Section>
    <Section position="2" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
3.2 Mismatch in Training and Test Data
Segmentations
</SectionTitle>
      <Paragraph position="0"> We trained three trigram language models: two using acoustic segmentations and linguistic segmentations respectively and a third model trained on data with no segment boundaries. The models used the Good-Turing back-off for smoothing unseen n-gram estimates (Katz 1987). These models were then used to compute perplexity on the different versions of the test data. The trigram perplexity numbers are shown in Table 1.</Paragraph>
      <Paragraph position="1">  2 Unfortunately, the two test sets did not match completely in terms of the number of words since the lattice test set had been hand corrected after the initial transcription to account for some transcription errors. Hence, there is a difference of about two hundred words between the acoustically and linguistically segmented test sets.</Paragraph>
      <Paragraph position="2">  As indicated in Table 1, mismatch between training and testing segmentation hurts perplexity. The best perplexity numbers are obtained under matched conditions. Though the results for the linguistically segmented test set (78) are significantly better than the corresponding matched case for the acoustic segmentations (105), we cannot conclusively state that this is due to better segmentation since we have not controlled for the length of the different segments.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
3.3 Hypothesizing Segment Boundaries
</SectionTitle>
      <Paragraph position="0"> A second perplexity experiment that we conducted tried to test whether we can hypothesize segmentations, given that we have no boundaries in the test set. Our segmenthypothesizing algorithm 3 assumed that at any word, we have two paths possible,  * A transition to the next word.</Paragraph>
      <Paragraph position="1"> * A transition to the next word through a segment  boundary.</Paragraph>
      <Paragraph position="2"> The algorithm was approximate in that we did not keep track of all possible segmentations. Instead at every point, we picked the most likely path as the history for the next word.</Paragraph>
      <Paragraph position="3"> As in the first experiment, we trained two language models on linguistic segmentations and acoustic segmentations respectively. Henceforth, these models are referred to as the ling-seg and acoustie-seg models. Both models try to hypothesize the segment boundaries while computing the likelihood of the no-segmentation test set.</Paragraph>
      <Paragraph position="4">  The perplexity results of Table 2 indicate that the ling-seg model does better than the acoustic-seg model for hypothesizing segment boundaries. Thus, we can gain a significant amount of boundary information by this simple scheme of hypothesizing segmentations.</Paragraph>
    </Section>
    <Section position="4" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
3.4 Recognition Experiments
</SectionTitle>
      <Paragraph position="0"> There were a couple of experimental constraints to analyze the aforementioned issues in terms of recognition word error rate.</Paragraph>
      <Paragraph position="1"> We were constrained to use the lattices that had been provided to the workshop. Since these lattices were built on acoustic segments, the models had to deal with implicit acoustic segment boundaries. The context from the previous lattice was not provided for the current lattice.</Paragraph>
      <Paragraph position="2"> We tried to alleviate this problem by trying to provide the context for the current lattice by selecting the most likely pair of words from the previous lattice using pair occurrence frequency. One problem with this approch is that since the standard Switchboard WER is about 50%, about 73% of the time we were providing incorrect context using these lattices.</Paragraph>
      <Paragraph position="3"> We used our segment hypothesizing scheme for scoring an N-best list corresponding to these lattices (N=2500). While the initial context was provided for the N-best lists, we had to throw away the final segment boundary. This led to a degradation in performance.</Paragraph>
      <Paragraph position="4">  measurements on LM95 SWBD dev. test set As shown in Table 3, the mismatch between the training and test segmentations degrades performance by half a point, from 50.46% to 50.88%. Throwing out the end segment boundary from the N-best lists degrades performance by slightly more than an absolute 1%. Also, the ling-seg model does slightly better at hypothesizing segment boundaries than the acoustic-seg model.</Paragraph>
    </Section>
    <Section position="5" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
3.5 Conclusions
</SectionTitle>
      <Paragraph position="0"> Our experiments indicated the following:  * Mismatch in segmentation hurts language model performance, both in terms of perplexity as well as in terms of recognition word error rate.</Paragraph>
      <Paragraph position="1"> * There is information in the knowledge of segment boundaries that should be incorporated in our language models.</Paragraph>
      <Paragraph position="2"> * If no segment boundaries are known during testing, it is better to hypothesize segment boundaries using a model  trained on linguistic segments than one based on acoustic segments.</Paragraph>
      <Paragraph position="3"> The notion of linguistic segmentation is important in language modeling because it provides information that is used in many higher order language models, for example the &amp;quot;given-new&amp;quot; model described in the next section, phrase structure language models, or sentence-level mixture models. However, this information cannot easily be derived from the acoustic signal. In this section, we have described a simple technique of hypothesizing segmentations using an n-gram language model trained on annotated data. We plan to run some controlled perplexity and recognition experiments in the future to use this information in our recognition system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="39" end_page="39" type="metho">
    <SectionTitle>
4. Information Structure in
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
Conversational Speech
</SectionTitle>
      <Paragraph position="0"> It is well known in the linguistic literature that sentences are not uniform from beginning to end in the kinds of words or structures used. Sentences have a &amp;quot;given/new&amp;quot; or &amp;quot;topic/comment&amp;quot; structure which is especially pronounced in conversational speech. According to discourse theories in linguistics, given information tends to occur in the beginning of a sentence where the topic is established, whereas new information tends to occur at the end, the comment on the topic. 4 We are looking at ways of taking advantage of this structure in the language model. The first stage of the work is to devise a method of dividing sentences into these two parts.</Paragraph>
      <Paragraph position="1"> Next, treating the before and after portions of the sentences as separate corpora, we look at the distribution of the vocabulary and the distribution of other phenomena, such as restarts. We also build language models using these two 4 This tendency is overridden in marked syntactic structures, such as cleft sentences (&amp;quot;It was Suzie who had the book last&amp;quot;). These structures are relatively rare in conversational speech.</Paragraph>
      <Paragraph position="2"> corpora and test perplexity both within and across corpora.</Paragraph>
      <Paragraph position="3"> The final step, which we are currently in the process of, is to find a way to integrate these models and use them within the speech recognition system to see if these more focused models can actually improve recognition performance.</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
4.1 Dividing the sentence
</SectionTitle>
      <Paragraph position="0"> In order to divide sentences into their given and new portions, we devised a simple heuristic which determines a pivot point for each sentence. The underlying assumption is that the dividing line is the verb, or more particularly, the first verb that carries content (disregarding &amp;quot;weak&amp;quot; verbs such as &amp;quot;is&amp;quot;, &amp;quot;have&amp;quot;, &amp;quot;seems&amp;quot;). The heuristic finds the first strong verb and if there is none, then the last weak verb, and places a pivot point either before the strong verb or after the weak one. Sentences that have no pivot (i.e. that have no verb) are put into one of two classes, those that are considered complete (such as &amp;quot;Yeah&amp;quot; and &amp;quot;OK&amp;quot;) and those that are incomplete, that is interrupted by either the speaker or the other conversant (i.e. &amp;quot;Well, I, I, uh.&amp;quot;). The no pivot complete set is very similar the &amp;quot;Back Channel&amp;quot; model developed by Mark Liberman at the LM95 Summer Language Modeling Workshop (Jelinek 1995). Liberman separated back channel responses from information-bearing utterances and created a separate language model. Initial experiments shows no overall improvement in word error rate, however, the model was able to identify both previously identified and new backchannel utterances in the test data.</Paragraph>
      <Paragraph position="1"> For the purposes of this paper, we will refer to the given and new parts as before and after (meaning before and after the pivot), and &amp;quot;NPC&amp;quot; for no pivot complete and &amp;quot;NPI&amp;quot; for no pivot incomplete sentences. The following shows an example dialog and Table 4 shows the corresponding division into the four categories:  A.I: Okay. I think the first thing they said, I have written this down so it would, is it p-, do you think it's possible to have honesty in government or an honest government? B.2: Okay. You're asking what my opinion about, A.3: #Yeah.# B.4: #whether it's# possible \[laughter\] to have honesty in government. Well, I suspect that it is possible. Uh, I think it probably is more likely if you have a small government unit where everybody knows everybody.</Paragraph>
      <Paragraph position="2"> A.5: Right. That's a good point.</Paragraph>
      <Paragraph position="3"> B.6: But, uh, other than that I think maybe it just depends on how you define honesty.</Paragraph>
      <Paragraph position="4"> A.7: That's an int-, you know, that's interesting.</Paragraph>
      <Paragraph position="5">  written this down honesty in government? asking what my opinion about : #whether it's# possible \[laughter\] to have honesty in government suspect that it is possible.</Paragraph>
      <Paragraph position="6"> everybody a good point.</Paragraph>
      <Paragraph position="7"> depends on how you define honesty interesting  Note that this heuristic doesn't always make the correct division. In sentence 8, &amp;quot;is&amp;quot; is the main verb, however, the algorithm prefers to find a strong verb, so it keeps going until it finds &amp;quot;know&amp;quot;, which is actually part of a relative clause. A more complex algorithm that finds the main verb group and uses the last verb in the verb group rather than the last verb in the sentence would remedy this. However, our goal here is to first determine whether in fact this division is useful in the language model. As long as errors such as this are in the minority, we can evaluate the method and then go back and refine it if it proves useful.</Paragraph>
      <Paragraph position="8"> In order to do the classification, we relied on three kinds of annotations that were available for the switchboard corpus: sentence boundaries, part of speech, and dysfluency annotation. The dysfluency markings are needed since the pivot point is restricted from being inside of a restart. The following shows the first two turns in the above discourse with both of these annotationsS:</Paragraph>
      <Paragraph position="10"> Table 5 shows the breakdown of the data into the four divisions, before the pivot (before), after the pivot (after), complete sentences with no pivot (&amp;quot; NPC&amp;quot;), incomplete sentences with no pivot (&amp;quot; NPI&amp;quot;).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="39" end_page="548" type="metho">
    <SectionTitle>
5 In this version of the annotation, complete sentences are
</SectionTitle>
    <Paragraph position="0"> marked E_S and incomplete sentences are marked N_S, rather than / and -/as described in SS2. This is to avoid confusion with the / which delimits words and their part of speech.</Paragraph>
    <Paragraph position="1">  It is interesting to note that the size of the before and after corpora are very similar. Note that this is not necessarily because the algorithm is dividing the sentences into two equal portions, as we can see in the example above. Some sentences have a rather long introduction with restarts, as in sentence 4 and 11, whereas others have just a single word and a long after portion, as in sentence 6.</Paragraph>
    <Paragraph position="2"> since pronouns are used generally to refer to participants in the conversation or things already mentioned, whereas articles such as &amp;quot;the&amp;quot; and &amp;quot;a&amp;quot; (rows 3 and 8) are much more frequent in the after part of the sentence, since they are more frequently used in full noun phrases describing new entities.</Paragraph>
    <Paragraph position="3"> &amp;quot;Yeah&amp;quot; (row 13) occurs almost exclusively in the NPC set, which is comprised mainly of replies.</Paragraph>
    <Section position="1" start_page="548" end_page="548" type="sub_section">
      <SectionTitle>
4.2 Vocabulary distributions
</SectionTitle>
      <Paragraph position="0"> It is clear from the definitions of the given vs. new parts of the sentence, that the vocabularies in the corpora resulting from the division will have different distributions, given information will be expressed with a larger number of pronouns whereas the new portion will have more complex descriptive noun phrases, and thus a wider ranging vocabulary. Within the verb group, weak (and more common) verbs will appear in the given portion, whereas strong verbs that carry content will appear in the new portion. But rather than relying on these intuitions, we apply a more careful analysis of the data to determine more closely what the differences are.</Paragraph>
      <Paragraph position="1"> Table 7 plots the 50 most frequent words in the corpus, showing their before and after raw totals. Note that while the values cross, they are rarely the same for the same word. This reinforces our intuition that the use of function words (typically the most common words) in the two parts of the sentences are quite different 6.</Paragraph>
      <Paragraph position="2">  The most frequent words in the corpus divide rather sharply across the data sets. For example, in Table 6, which shows the counts for the top 15 words, pronouns such as 'T' and &amp;quot;it&amp;quot; (rows 1 and 6) are much more frequent in the before set,</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML