File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1110_metho.xml
Size: 12,595 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1110"> <Title>Semantic Similarity Applied to Spoken Dialogue Summarization</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Text, Speech and Dialogue Summarization </SectionTitle> <Paragraph position="0"> Most research on automatic summarization dealt with written text. This work was based either on corpus-based, statistical methods or on knowledge-based techniques (for an overview over both strands of research see Mani & Maybury (1999)). Recent advances in text summarization are mostly due to statistical techniques with some additional usage of linguistic knowledge, e.g. (Marcu, 2000; Teufel & Moens, 2002), which can be applied to unrestricted input.</Paragraph> <Paragraph position="1"> Research on speech summarization focused mainly on single-speaker, written-to-be-spoken text (e.g. spoken news, political speeches, etc.). The methods were mostly derived from work on text summarization, but extended it by exploiting particular characteristics of spoken language, e.g. acoustic confidence scores or intonation. Difficulties arise because speech recognition systems are not perfect.</Paragraph> <Paragraph position="2"> Therefore, spoken dialogue summarization systems have to deal with errors in the input. There are no sentence boundaries in spoken language either.</Paragraph> <Paragraph position="3"> Work on spoken dialogue summarization is still in its infancy (Reithinger et al., 2000; Zechner, 2002). Multiparty dialogue is much more difficult to process than written text. In addition to the difficulties speech summarization has to face, spoken dialogue contains a whole range of dialogue phenomena as disfluencies, hesitations, interruptions, etc. Also, the information to be summarized may be contributed by different speakers (e.g. in question-answer pairs). Finally, the language used in spoken dialogue differs from language used in texts. Because discourse participants are able to immediately clarify misunderstandings, the language used does not have to be that explicit.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Semantic Similarity </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Semantic Similarity Metrics </SectionTitle> <Paragraph position="0"> Experiments reported here employed Ted Pedersen's (2002) semantic similarity package. We applied five of the metrics, which rely on WordNet as a knowledge base and were developed in the context of work on word sense disambiguation. The first measure is Leacock and Chodorow's (1998) Normalized Path Length (we will refer to it as lch). Semantic similarity sim between words w1 and w2 is defined as given in Equation 1:</Paragraph> <Paragraph position="2"> len(c1;c2) is the length of the shortest path between them. D is the maximum depth of the taxonomy.</Paragraph> <Paragraph position="3"> The following measures incorporate an additional, qualitatively different knowledge source based on some kind of corpus analysis. The extended gloss overlaps measure introduced by Banerjee & Pedersen (2003) (referred to as lesk in the following) is based on the number of shared words (overlaps) in the WordNet definitions (glosses) of the respective concepts. It also extends the glosses to include the definitions of concepts related to the concept under consideration based on the WordNet hierarchy. Formally, semantic relatedness sim between words w1 and w2 is defined by the following equation:</Paragraph> <Paragraph position="5"> where R is a set of semantic relations, score() is a function accepting two glosses as input, finding overlaps between them, and returning a corresponding relatedness score.</Paragraph> <Paragraph position="6"> The remaining three methods require an additional knowledge source, an information content file (ICF). This file contains information content values for WordNet concepts, which are needed for computing the semantic similarity score for two concepts. Information content values are based on the frequency counts for respective concepts. Resnik (1995) (res for short) calculates the information content of the concept that subsumes the given two</Paragraph> <Paragraph position="8"> where S(c1;c2) is the set of concepts which subsume both c1 and c2 and log p(c) is the negative log likelihood (information content). The probability p is computed as the relative frequency of the concept. Resnik's measure is based on the intuition that the semantic similarity between concepts may be quantified on the basis of information shared between them. In this case the WordNet hierarchy is used to determine the closest super-ordinate of a pair of concepts.</Paragraph> <Paragraph position="9"> Jiang & Conrath (1997) proposed to combine edge- and node-based techniques in counting the edges and enhancing it by the node-based calculation of the information content as introduced by Resnik (1995) (the method is abbreviated as jcn).</Paragraph> <Paragraph position="10"> The distance between two concepts c1 and c2 is formalized as given in Equation 4:</Paragraph> <Paragraph position="12"> where IC is the information content value of the concept, and lso(c1;c2) is the closest subsumer of the two concepts.</Paragraph> <Paragraph position="13"> The last method is that of Lin (1998) (we call this metric lin). He defined semantic similarity using a formula derived from information theory. This measure is sometimes called a universal semantic similarity measure as it is supposed to be application-, domain-, and resource independent. According to this method, the similarity is given in Equation 5:</Paragraph> <Paragraph position="15"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Semantic Similarity in Summarization </SectionTitle> <Paragraph position="0"> The process of automatic dialogue summarization, as defined in the context of this work, means to extract the most relevant utterances from the dialogue.</Paragraph> <Paragraph position="1"> We restate this as a classification problem, which is similar to the definition given by Kupiec et al.</Paragraph> <Paragraph position="2"> (1995). This means that utterances are classified as relevant or irrelevant for the summary of a specific dialogue. By relevant utterances we mean those carrying the most essential parts of the dialogue's content. The summarization task is, then, to extract the set of utterances from the transcript, which a human would use to make a dialogue summary.</Paragraph> <Paragraph position="3"> The key idea behind the algorithm presented here is to quantify the degree of semantic similarity between a given utterance and the whole dialogue. We argue that semantic similarity between an utterance and the dialogue as a whole represents an appropriate criterion for the selection of relevant utterances. We describe each of the processing steps, employing the example dialogue D from Table 1. This example consists of the set of utterances fUtt1,..., Utt11g.</Paragraph> <Paragraph position="4"> The semantic similarity algorithms introduced in Section 3.1 operate on the noun portion of WordNet. Our approach to dialogue summarization, as previously stated, is to compute semantic similarity for a given pair fUttn,Dg. In order to do that, we require a WordNet-based conceptual representation of both Uttn, i.e. CRUttn, and D, i.e. CRD, and compare them using the semantic similarity measures. Therefore, we map the nouns contained in the utterances to their respective WordNet senses and operate on these representations in the subsequent steps. The results of this operation are shown in Table 2.</Paragraph> <Paragraph position="5"> The number in the last column indicates the disambiguated WordNet sense.</Paragraph> <Paragraph position="6"> The resulting dialogue representation CRD will be the set of concepts resulting from adding individual utterance representations, i.e. CRD = fhome, home, sixties, pier, beam, house, bedrooms, bath, area, Houstong.</Paragraph> <Paragraph position="7"> For each utterance Uttn, we create a two-dimensional matrix C with the dimensions (#CRD #CRUttn), where # denotes the number of elements in the set.</Paragraph> <Paragraph position="9"> ble 3. Then, we compute the semantic similarity SSscore(i;j) employing any of the semantic similarity metrics described above for each pair of concepts. The semantic similarity score SSfinal for CRUttn and CRD is then defined as the average pairwise semantic similarity between all concepts in CRUttn and CRD:</Paragraph> <Paragraph position="11"> Computing SSfinal results in a list of utterances with scores from the respective scoring methods, taken from the real data, i.e. they have been normalized w.r.t. the conceptual representation for the whole dialogue, and not for the dialogue fragment given in Table 1. The rankings were produced for this specific example to make it more illustrative. In order to produce a summary of the dialogue, the utterances first have to be sorted numerically, i.e. ranked on the basis of their scores, see Table 4 for the results of the ranking procedure.2 Given a compression rate COMP with the range [1,100], the number of utterances classified as relevant by an individual scoring method PNr is a function of the total number of utterances in the dialogue:</Paragraph> <Paragraph position="13"> given a specific compression rate COMP, the top-ranked PNr utterances will be automatically classified as relevant. - Returning to the example in Table 1, we obtain the summaries given in Table 5.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Data </SectionTitle> <Paragraph position="0"> The data used in the experiments are 20 randomly chosen Switchboard dialogues (Greenberg, 1996).</Paragraph> <Paragraph position="1"> These dialogues contain two-sided telephone conversations among American speakers of at least 10 minutes duration. The callers were given a certain topic for discussion. The recordings of spontaneous speech were, then, transcribed. Statistical data about the corpus, i.e. total numbers and averages for separate dialogues, are given in Table 6. Tokens are defined as running words and punctuation.</Paragraph> <Paragraph position="2"> An utterance is a complete unit of speech spoken by a single speaker, while a turn is a joint sequence of utterances produced by one speaker.</Paragraph> <Paragraph position="3"> In the annotation experiments, we tested whether humans could reliably determine the utterances conveying the overall meaning of the dialogue. Therefore, each utterance is assumed to be a markable, i.e. the expression to be annotated resulting in a total of 3275 markables in the corpus. Three annotators were instructed to select the most important utterances. They were supposed to first read the dialogue und then to mark about 10% of all utterances in the dialogue as being relevant. Then, we produced two kinds of Gold Standards from these data.</Paragraph> <Paragraph position="4"> Gold Standard 1 included the utterances which were marked by all three annotators as being relevant.</Paragraph> <Paragraph position="5"> Gold Standard 2 included the utterances which were selected by at least two annotators.</Paragraph> <Paragraph position="6"> Table 7 shows the results of these experiments.</Paragraph> <Paragraph position="7"> We present the absolute number of markables selected as relevant by separate annotators and in two Gold Standards. Also, we indicate the percentage, given the total number of markables 3275. As the table shows, Gold Standard 1 includes only 3.69% of all markables. Therefore, we used Gold Standard 2 in the evaluation reported in Section 5. The Kappa coefficient for inter-annotator agreement varied from 0.1808 to 0.6057 for individual dialogues.</Paragraph> <Paragraph position="8"> An examination of the particular dialogue with the very low Kappa rate showed that this was one of the shortest ones. It did not have a well-defined topical structure, resulting in a low agreement rate between annotators. For the whole corpus, the Kappa co-efficient yielded 0.4309. While this is not a high agreement rate on a general scale, it is comparable to what has been reported concerning the task of summarization and in particular dialogue summarization. null</Paragraph> </Section> class="xml-element"></Paper>