File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2006_metho.xml
Size: 8,663 bytes
Last Modified: 2025-10-06 14:10:12
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2006"> <Title>Class Model Adaptation for Speech Summarisation</Title> <Section position="3" start_page="0" end_page="21" type="metho"> <SectionTitle> 2 Summarisation Method </SectionTitle> <Paragraph position="0"> The summarisation system used in this paper is essentially the same as the one described in (Kikuchi et al., 2003), which involves a two step summarisation process, consisting of sentence extraction and sentence compaction. Practically, only the sentence extraction part was used in this paper, as preliminary experiments showed that compaction had little impact on results for the data used in this study.</Paragraph> <Paragraph position="1"> Important sentences are first extracted according to the following score for each sentence W = w1,w2,...,wn, obtained from the automatic speech recognition output:</Paragraph> <Paragraph position="3"> where N is the number of words in the sentence W, and C(wi), I(wi) and L(wi) are the confidence score, the significance score and the linguistic score of word wi, respectively. aC, aI and aL are the respective weighting factors of those scores, determined experimentally.</Paragraph> <Paragraph position="4"> For each word from the automatic speech recogni- null tion transcription, a logarithmic value of its posterior probability, the ratio of a word hypothesis probability to that of all other hypotheses, is calculated using a word graph obtained from the speech recogniser and used as a confidence score.</Paragraph> <Paragraph position="5"> For the significance score, the frequencies of occurrence of 115k words were found using the WSJ and the Brown corpora.</Paragraph> <Paragraph position="6"> In the experiments in this paper we modified the linguistic component to use combinations of different linguistic models. The linguistic component gives the linguistic likelihood of word strings in the sentence. Starting with a baseline LiM (LiMB) we perform LiM adaptation by linearly interpolating the baseline model with other component models trained on different data. The probability of a given n-gram sequence then becomes:</Paragraph> <Paragraph position="8"> wheresummationtextk lk = 1 and lk and Pk are the weight and the probability assigned by model k.</Paragraph> <Paragraph position="9"> In the case of a two-sided class-based model,</Paragraph> <Paragraph position="11"> where Pk(wi|C(wi)) is the probability of the word wi belonging to a given class C, and Pk(C(wi)|C(wi[?]n+1)..C(wi[?]1)) the probability of a certain word class C(wi) to appear after a history of word classes, C(wi[?]n+1),...,C(wi[?]1).</Paragraph> <Paragraph position="12"> Different types of component LiM are built, coming from different sources of data, either as word or class models. The LiMB and component LiMs are then combined for adaptation using linear interpolation as in Equation (2). The linguistic score is then computed using this modified probability as in Equation (4):</Paragraph> <Paragraph position="14"/> </Section> <Section position="4" start_page="21" end_page="21" type="metho"> <SectionTitle> 3 Evaluation Criteria </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.1 Summarisation Accuracy </SectionTitle> <Paragraph position="0"> To automatically evaluate the summarised speeches, correctly transcribed talks were manually summarised, and used as the correct targets for evaluation. Variations of manual summarisation results are merged into a word network, which is considered to approximately express all possible correct summarisations covering subjective variations. The word accuracy of automatic summarisation is calculated as the summarisation accuracy (SumACCY) using the word network (Hori et al., 2003):</Paragraph> <Paragraph position="2"> where Sub is the number of substitution errors, Ins is the number of insertion errors, Del is the number of deletion errors, and Len is the number of words in the most similar word string in the network.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.2 ROUGE </SectionTitle> <Paragraph position="0"> Version 1.5.5 of the ROUGE scoring algorithm (Lin, 2004) is also used for evaluating results.</Paragraph> <Paragraph position="1"> ROUGE F-measure scores are given for ROUGE-2 (bigram), ROUGE-3 (trigram), and ROUGE-SU4 (skip-bigram), using the model average (average score across all references) metric.</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="22" type="metho"> <SectionTitle> 4 Experimental Setup </SectionTitle> <Paragraph position="0"> Experiments were performed on spontaneous speech, using 9 talks taken from the Translanguage English Database (TED) corpus (Lamel et al., 1994; Wolfel and Burger, 2005), each transcribed and manually summarised by nine different humans for both 10% and 30% summarization ratios. Speech recognition transcriptions (ASR) were obtained for each talk, with an average word error rate of 33.3%.</Paragraph> <Paragraph position="1"> A corpus consisting of around ten years of conference proceedings (17.8M words) on the subject of speech and signal processing is used to generate the LiMB and word classes using the clustering algorithm in (Ney et al., 1994).</Paragraph> <Paragraph position="2"> Different types of component LiM are built and combined for adaptation as described in Section 2.</Paragraph> <Paragraph position="3"> The first type of component linguistic models are built on the small corpus of hand-made summaries described above, made for the same summarisation ratio as the one we are generating. For each talk the hand-made summaries of the other eight talks (i.e. 72 summaries) were used as the LiM training corpus. This type of LiM is expected to help generate automatic summaries in the same style as those made manually.</Paragraph> <Paragraph position="4"> The second type of component linguistic models are built from the papers in the conference proceedings for the talk we want to summarise. This type of LiM, used for topic adaptation, is investigated because key words and important sentences that appear in the associated paper are expected to have a high information value and should be selected during the summarisation process.</Paragraph> <Paragraph position="5"> Three sets of experiments were made: in the first experiment (referred to as Word), LiMB and both component models are word models, as introduced in (Chatain et al., 2006). For the second one (Class), both LiMB and the component models are class models built using exactly the same data as the word models. For the third experiment (Mixed), the LiMB is an interpolation of class and word models, while the component LiMs are class models.</Paragraph> <Paragraph position="6"> To optimise use of the available data, a rotating form of cross-validation (Duda and Hart, 1973) is used: all talks but one are used for development, the remaining talk being used for testing. Summaries from the development talks are generated automatically by the system using different sets of parameters and the LiMB. These summaries are evaluated and the set of parameters which maximises the development score for the LiMB is selected for the remaining talk. The purpose of the development phase is to choose the most effective combination of weights aC, aI and aL. The summary generated for each talk using its set of optimised parameters is then evaluated using the same metric, which gives us our baseline for this talk. Using the same parameters as those that were selected for the baseline, we generate summaries for the lectures in the development set for different LiM interpolation weights lk. Values between 0 and 1 in steps of 0.1, were investigated for the latter, and an optimal set of lk is selected. Using these interpolation weights, as well as the set of parameters determined for the baseline, we generate a summary of the test talk, which is evaluated using the same evaluation metric, giving us our final adapted result for this talk. Averaging those results over the test set (i.e. all talks) gives us our final adapted result.</Paragraph> <Paragraph position="7"> This process is repeated for all evaluation metrics, and all three experiments (Word, Class, and Mixed).</Paragraph> <Paragraph position="8"> Lower bound results are given by random summarisation (Random) i.e. randomly extracting sentences and words, without use of the scores present in Equation (1) for appropriate summarisation ratios.</Paragraph> </Section> class="xml-element"></Paper>