File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2025_metho.xml
Size: 15,241 bytes
Last Modified: 2025-10-06 14:07:03
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2025"> <Title>Minimizing Word Error Rate in Textual Summaries of Spoken Language</Title> <Section position="5" start_page="186" end_page="186" type="metho"> <SectionTitle> 3 Summarization system </SectionTitle> <Paragraph position="0"> Prior to summarizing, the input text is cleaned up for disfluencies, such as hesitations, filled pauses, and repetitions. I In the context of multi-topical recordings we use for our experiments, summaries are generated for each topical segment separately.</Paragraph> <Paragraph position="1"> The segment boundaries were determined to be at those places where the majority Cat least half) of the human annotators agreed (see section 5). Intercoder agreement for topical boundaries is fairly good (and higher than the agreement on relevant words or passages).2 null To determine the content of the summaries, we use a &quot;maximal marginal relevance&quot; (MMR) based summarizer with speaker turns as minimal units (cf.</Paragraph> <Paragraph position="2"> (Carbonell and Goldstein, 1998)).</Paragraph> <Paragraph position="3"> The MMR formula is given in equation 1. It generates a list of turns ranked by their relevance and states that the next turn to be put in this ranked list will be taken from the turns which were not yet ranked (tar) and has the following properties: it is (a) maximally similar to a &quot;query&quot; and (b) maximally dissimilar to the turns which were already ranked (tr). As &quot;query&quot; we use a frequency vector for all content words within a topical segment. The A-parameter (0.0 < A < 1.0) is used to trade off the influence of C a) vs. (b).</Paragraph> <Paragraph position="4"> Both similarity metrics (sire1, sire2) are inner vector products of (_stemmed) term frequencies (see equations 2 to 4); tft is a vector of stem frequencies in a turn; f, are in-segment frequencies of a stem; f, rna= are maximal segment frequencies of any stem in the topical segment, sirnl can be normalized or not. The formulae for tfa (equation 4) are inspired from Cornell's SMART system (Salton, 1971); we will call these parameters &quot;smax', &quot;log&quot;, and C/'freq&quot;, respectively.</Paragraph> <Paragraph position="6"> Using the MMR algorithm, we obtain a list of ranked turns for each topical segment. We compute this both for human and machine generated transcripts of the audio files (&quot;reference text&quot; vs. &quot;hypothesis text&quot;) .3</Paragraph> </Section> <Section position="6" start_page="186" end_page="188" type="metho"> <SectionTitle> 4 Evaluation metrics </SectionTitle> <Paragraph position="0"> The challenge of devising a meaningful evaluation metric for the task of audio summarization is that it has to be applicable to both the reference (human transcript) and the hypothesis transcripts (automatic speech recognizer (ASR) transcripts). We want to be able to assess the quality of the summary with respect to the relevance markings of the human annotators (see section 5), as well as to relate this &quot;summary accuracy&quot; to the word error rate present in the ASR transcripts.</Paragraph> <Paragraph position="1"> The approach we take is to align the words in the summary with the words in the reference transcript (wa). For ASR transcripts, word substitutions are aligned with their &quot;true original&quot; and word insertions are aligned with a NIL dummy. That way, 3The human reference is considered to be an &quot;optimal&quot; or &quot;ideal&quot; rendering of the words which were actually said in a conversation. Human transcription errors do occur, but are marginal and hence ignored in the context of this paper.</Paragraph> <Paragraph position="2"> we can determine for each individual word wa in the summary (a) whether it occurs in a &quot;relevant phrase&quot; and (b) whether it is correctly recognized or a recognition error (for ASR transcripts).</Paragraph> <Paragraph position="3"> We define word error rate as WER = (S +</Paragraph> <Paragraph position="5"> S=substitution, C=correct).</Paragraph> <Paragraph position="6"> Each word's relevance score r is the average number it occurs in the human annotators' relevant phrases (0.0 < r <_. 1.0). Relevance scores for insertions and substitutions are always 0.0.</Paragraph> <Paragraph position="7"> We choose to define the summary accuracy sa (&quot;relevance&quot;) as the sum of relevance scores of all n aligned words ~--~deg r~, divided by the maximum achievable relevance score with the same number of n words somewhere in the text (i.e., 0.0 < sa <_ 1.0). Word deletions obviously do not show up in the summary, but are accounted for, as well, to make the WER computation sound.</Paragraph> <Paragraph position="8"> To better illustrate how these metrics work, we demonstrate them on a simplified example of only two speaker turns (Figure 1). The first line represents the relevance score r for each word (the number this word was within a &quot;relevant phrase&quot; divided by the number of annotators for that text). In turn 1, &quot;this is to illustrate&quot; was only marked relevant by two annotators, whereas &quot;the idea&quot; by 3 out of 4 annotators. The second line provides the reference transcript, the third line the ASB. transcript. Line 4 gives the type of word error, and line 5 the confidence score of the speech recognizer (between 0.0 and 1.0, 1.0 meaning maximal confidence).</Paragraph> <Paragraph position="9"> Now let us assume that turn 2 shows up in the summary. The scores are computed as follows: * When summarizing the reference: Here, the word error rate is trivially 0.0; the summary accuracy sa is the sum of all relevance scores (-6.0) divided by the maximal achievable score with the same number of words (n = 7). l&quot;hrn 2 has 6 words which were marked relevant by all coders (r -- 1.0), turn l's highest score is r = 0.75. Therefore: sa2 = 6.0/(6.0 + 0.75) = 0.89.</Paragraph> <Paragraph position="10"> This is higher than the summary accuracy for turn 1: sal = 3.5/6.0 = 0.58(n = 6).</Paragraph> <Paragraph position="11"> * When summarizing the ASR transcript (&quot;hypothesis&quot;): Selecting turn 2 will give sa2 =</Paragraph> <Paragraph position="13"> relevance scores based on the aligned words wa which were correctly recognized, therefore the 1.0-scores in turn 2 cannot be used). Turn 2 has WER=6\[5=l.2, turn 1 has WER=3/6=0.5.</Paragraph> <Paragraph position="14"> Obviously, when summarizing the ASB. output, we would rather have turn 1 showing up in the summary than turn 2, because turn 2 is completely off from the truth and turn 1 only partially. The fact that turn 2 was considered to be more relevant by human coders cannot, in our opinion, be used to favor its inclusion in the summary. An exception would be a situation where the user has immediate access to the audio as well and is able to listen to selected passages from the summary (see section 1). In our case, where we focus on text-only summaries to be used stand-alone, we have to minimize their word error rate.</Paragraph> <Paragraph position="15"> Given that, turn 1 has to be favored over turn 2, both because of its lower WEB, and because of its higher accuracy with respect to the relevance annotations. null In order to increase the likelihood that turns with lower WEB, are selected over turns with higher WEB., we make use of the speech recognizer's confidence scores which are attached to every word hypothesis and can be viewed as probabilities: they are in \[0.0,1.0\], high values reflecting a high confidence in the correctness of the respective word. 4 Following (Valenza et al., 1999) we conjecture that we can use these confidence scores to increase the probability of passages with lower WEB, to show up in the summary. To test how far this assumption is justified, we correlated the WEB. with various metrics of confidence scores: (i) sum of scores, (ii) average of scores, (iii) number of scores above a threshold, (iv) the latter normalized by the number of all scores, and (v) the geometric mean of scores. Table 1 shows the correlation coefficients (Pearson r) for the four ASK transcripts we used in our experiments (see section 5). To prevent the influence of large differences in turn length, those computations were done for subsequent &quot;buckets&quot; of 50 words each.</Paragraph> <Paragraph position="16"> Since in most cases we achieve the highest correlation coefficient (absolute value) for method (iv = avgth) (average number of words whose confidence score is greater than a threshold of 0.95), we apply this metric to the computation of turn-query similarities (sire1 in equation 1). We use the two following formulae to adjust the similarity,scores. (We shall call these adjustments MULT and EXP in the follow-</Paragraph> <Paragraph position="18"> For both equations it holds that if a = 0.0, the scores don't change, whereas if c~ > 0.0, we enhance the weights of turns with many high confidence scores (&quot;boosting&quot;) and hence increase their likelihood of showing up earlier in the summary. 5 Even though our evaluation method looks like it would &quot;guarantee&quot; an increase in summary accu- null this is to illustrate the idea *** this is to ILLUMINATE *** idea racy when the word error rate is reduced, this is not necessarily the case. For example, it could turn out that while we can reduce WER by &quot;boosting&quot; passages with higher confidence scores, those passages might have (much) fewer words marked relevant than those being present in the summary with- ~ out boosting. This way, it would be conceivable to create low word error summaries that contain also very few relevant pieces of information. However, as we will see later, WER reduction goes hand in hand with an increase of summary accuracy.</Paragraph> </Section> <Section position="7" start_page="188" end_page="188" type="metho"> <SectionTitle> 5 Data characteristics and </SectionTitle> <Paragraph position="0"> annotation Table 2 describes the main features of the corpus we used for our experiments: we selected four audio excerpts from four television shows, together with human generated textual transcripts. All these shows are conversations between multiple speakers. The audio was sampled at 16kHz and then also automatically transcribed using a gender independent, vocal tract length normalized, large vocabulary speech recognizer which was trained on about 80 hours of Broadcast News data (Yu et al., 1999). The average word error rates for our 4 recordings ranged from 25% to 50%.</Paragraph> <Paragraph position="1"> The reference transcripts of the four recordings were given to six human annotators who had to segment them into topically coherent regions and to decide on the &quot;most relevant phrases&quot; to be included in a summary for each topical region. Those phrases usually do not coincide exactly with speaker turns and the annotators were encouraged to mark sections of text freely such that they would form meaningful, concise, and informative phrases. Three an- * notators could listen to the audio while annotating the corpus, the other three only had the human generated transcripts available. 2 of the 6 annotators only finished the NewsHour data, so we have the opinion of 4 annotators for the recordings BUCHANAN and GRAY and of 6 annotators for BACK and 19CENT.</Paragraph> </Section> <Section position="8" start_page="188" end_page="189" type="metho"> <SectionTitle> 6 Experiments on human generated </SectionTitle> <Paragraph position="0"> transcripts We created summaries of the reference transcripts using different parameters for the MMR computation: For tf we used &quot;freq&quot;, &quot;log&quot;, and &quot;smax&quot;; further, we did or did not normalize these weights; finally, we varied the MMR-A from 0.85 to 1.0. Summarization accuracy was determined at 5%, 10%, 15%, 20%, and 25% of the text length of each summarized topical segment and then averaged over all sample points in all segments. Since these were word-based lengths, words were added incrementally to the summary in the order of the turns ranked via MMR; turns were cut off when the length limit was reached. As explained in the example in section 4, the accuracy score is defined as the fraction of the sum of all individual word relevance scores (as de- null termined by human annotators) over the maximum possible score given the current number of words in the summary.</Paragraph> <Paragraph position="1"> Table 3 shows the summary accuracy results for the best parameter setting (if=log, no normalization) ~.</Paragraph> </Section> <Section position="9" start_page="189" end_page="189" type="metho"> <SectionTitle> 7 Experiments on automatically </SectionTitle> <Paragraph position="0"> generated transcripts Using the same summarizer as before, we now created summaries from ASR transcripts. Additionally to the summary accuracy, we evaluate now also the WER for each evaluation point. Again, we ran a series of experiments for different parameters of the MMR formula (if=log, smax, freq; with/without normalization). As before, we achieved the best results for non normalized scores and tf=log. We varied a from 0.0 to 10.0 to see how much of an effect we would get from the &quot;boosting&quot; of turns with many high confidence scores (see equations 5 and 6).</Paragraph> <Paragraph position="1"> The ExP formula yielded better results than MULl? (Table 4), the optimum for ExP was reached for = 3.0 with a WER of 26.6%, an absolute improvement of over 8% over the average of WER=35.1% for the complete ASR transcripts (non-summarized).</Paragraph> <Paragraph position="2"> The summarization accuracy peaks at 0.47, a 9% absolute improvement over the a = 0.0-baseline and only about 5% absolute lower than for reference summaries (Table 4 and Figure 2).</Paragraph> <Paragraph position="3"> When we compare the baseline of ~ = 0.0 (i.e., no &quot;boosting&quot; of high confidence turns) to the best result (a = 3.0), we see that the WER drops markedly by about 12% relative from 30.1 to 26.6%. At the same time, the summarization accuracy increases by about 18% relative form 0.401 to 0.472.</Paragraph> <Paragraph position="4"> degIf we use non-normalized scores, the value of the MMR-X does not have any measurable effect; we assigned it to be 0.95 for all subsequent experiments.</Paragraph> <Paragraph position="5"> with ~.xP boosting (0 < a < 7) Results for the MULT formula confirm this trend, but it is considerably weaker: approximately 6% WER reduction and 14% accuracy improvement for c~ = 10.0 over the c~ = 0.0 baseline.</Paragraph> <Paragraph position="6"> An appendix (section 11) provides an example of actual summaries generated by our system for the first topical segment of the BACK conversation. It illustrates how WER reduction and summary accuracy improvement can be achieved by using our confidence boosting method.</Paragraph> </Section> class="xml-element"></Paper>