File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1001_metho.xml
Size: 13,611 bytes
Last Modified: 2025-10-06 14:07:33
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1001"> <Title>Activity detection for information access to oral communication</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. ACTIVITY DETECTION </SectionTitle> <Paragraph position="0"> We are interested in the detection of activities that are described by action verbs and have annotated those in two databases: meetings have been collected at Interactive Systems Labs at CMU (Waibel et al., 1998) and a subset of 8 meetings has been annotated. Most of the meetings are by the data annotation group itself and are fairly informal in style. The participants are often well acquainted and meet each other a lot besides their meetings.</Paragraph> <Paragraph position="1"> Santa Barbara (SBC) is a corpus released by the LDC and 7 out of 12 rejoinders have been annotated.</Paragraph> <Paragraph position="2"> The annotator has been instructed to segment the rejoinders into units that are coherent with respect to their topic databases contain a lot of discussing, informing and story-telling activities however the meeting data contains a lot more planning and advising.</Paragraph> <Paragraph position="3"> and activity and annotate them with an activity which follows the intuitive de nition of the action-verb such as discussing, planning, etc. Additionally an activity annotation manual containing more speci c instructions has been available (Ries et al., 2000; Thym e-Gobbel et al., 2001) 3. The list of tags and the distribution can be seen in Tab. 1. The set of activities can be clustered into \interactive&quot; activities of equal contribution rights (discussion,planning), one per-son being active (advising, information giving, story-telling), interrogations and all others.</Paragraph> <Paragraph position="4"> meeting dialogues and Santa Barbara corpus have been annotated by a semi-naive coder and the rst author of the paper. The -coe cient is determined as in Carletta et al. (1997) and mutual information measures how much one label \informs&quot; the other (see Sec. 3). For CallHome Spanish 3 dialogues were coded for activities by two coders and the result seems to indicate that the task was easier.</Paragraph> <Paragraph position="5"> Both datasets have been annotated not only by a semi-naive annotator but also by the rst author of the paper. The results for -statistics (Carletta et al., 1997) and mutual information between the coders can be seen in Tab. 2. The intercoder agreement would be considered moderate but compares approximately to Carletta et al. (1997) agreement on transactions ( = 0:59), especially for the interactive activities and CallHome Spanish.</Paragraph> <Paragraph position="6"> For classi cation a neural network was trained that uses the softmax function as its output and KL-divergence as 3 In contrast to (Ries et al., 2000; Thym e-Gobbel et al., 2001) the \consoling&quot; activity has been eliminated and an \informing&quot; activity has been introduced for segments where one or more than one member of the rejoinder give information to the others. Additionally an \introducing&quot; activity was added to account for a introduction of people or topics at the beginning of meetings.</Paragraph> <Paragraph position="7"> on the Santa Barbara Corpus (SBC) and the meeting database (meet) either without clustering the activities (all) or clustering them according to their interactivity (interactive) (see Sec. 2 for details).</Paragraph> <Paragraph position="8"> the error function. The network connects the input directly to the output units. Hidden units have not been used since they did not yield improvements on this task. The network was trained using RPROP with momentum (Riedmiller and Braun, 1993) and corresponds to an exponential model (Nigam et al., 1999). The momentum term can be interpreted as a Gaussian prior with zero mean on the network weights. It is the same architecture that we used previously (Ries et al., 2000) for the detection of activities on CallHome Spanish. Although some feature sets could be trained using the iterative scaling algorithm if no hidden units are being used the training times weren't high enough to justify the use of the less exible iterative scaling algorithm. The features used for classi cation are words the 50 most frequent words / part of speech pairs are used directly, all other pairs are replaced by their part of speech 4.</Paragraph> <Paragraph position="9"> stylistic features adapted from Biber (1988) and contain mostly syntactic constructions and some word classes.</Paragraph> <Paragraph position="10"> Wordnet a total of 40 verb and noun classes (so called lexicographers classes (Fellbaum, 1998)) are de ned and a word is replaced by the most frequent class over all possible meanings of the word.</Paragraph> <Paragraph position="11"> dialogue acts such as statements, questions, backchannels, ::: are detected using a language model based detector trained on Switchboard similar to Stolcke et al. (2000) 5 the following choices were taken: (a) the dialogue model is context-independent and (b) only the part of speech are taken as the input to the model plus the 50 most likely word/part of speech types.</Paragraph> <Paragraph position="12"> dominance is described as the distribution of the speaker dominance in a conversation. The distribution is represented as a histogram and speaker dominance is measured as the average dominance of the dialogue acts (Linell et al., 1988) of each speaker. The dialogue acts are detected and the dominance is a numeric value assigned for each dialogue act type. Dialogue act types that restrict the options of the conversation partners have high dominance (questions), dialogue acts that signal understanding (backchannels) carry low dominance.</Paragraph> <Paragraph position="13"> First author The activities used for classi cation are those of the semi-naive coder. The \ rst author&quot; column describes the \accuracy&quot; of the rst author with respect to the naive coder.</Paragraph> <Paragraph position="14"> The detection of interactive activities works fairly well using the dominance feature on SBC which is also natural since the relative dominance of speakers should describe what kind of interaction is exhibited. The dialogue act distribution on the other hand works fairly well on the more homogeneous meeting database were there is a better chance to see generalizations from more speci c dialogue based information. Overall the combination of more than one feature is really important since word level, Wordnet and stylistic information, while sometimes successful, seem to be able to improve the result while they don't provide good features by themselves. The meeting data is also more di cult which might be due to its informal style.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. INFORMATION ACCESS ASSESSMENT </SectionTitle> <Paragraph position="0"> Assuming a probabilistic information retrieval model a query r { in our example an activity { predicts a document d with the probability q(djr) = q(rjd)q(d)q(r) . Let p(d;r) be the real probability mass distribution of these quantities. The probability mass function q(rjd) is estimated on a separate training set by a neural network based classier 6. The quantity we are interested in is the reduction in expected coding length of the document using the neural network based detector 7: Eplog q(D)q(DjR) H(R) Ep log 1q(RjD) The two expectations correspond exactly to the measures in Tab. 5, the rst represents the baseline, the second the one for the respective classi er. In more standard information theoretic notation this quantity may be written as:</Paragraph> <Paragraph position="2"> This equivalence is not extremely useful though since the quantities in parenthesis can't be estimated separately. For the small meeting database and SBC however no entropy reductions could be obtained. On the larger databases, on the other hand, entropy reductions could be obtained ( rp(d;r).</Paragraph> <Paragraph position="3"> Another option is to assume that the labels of one coder are part of D. If the query by the other coder is R we are interested in the reduction of the document entropy given the query. If we furthermore assume that H(RjD) = H(RjR0) where R0 is the activity label embedded in D:</Paragraph> <Paragraph position="5"> Tab. 2 shows that the labels of the semi-naive coder and the rst author only inform each other by 0:25 0:65 bits. However, since all constraints are important to apply, it might be important to include manual annotations to be matched by a query or in a graphical presentation of the output results.</Paragraph> <Paragraph position="6"> Another interesting question to consider is whether the activity is correlated with the rejoinder or not. This question is important since a correlation of the activity with the rejoinder would mean that the indexing performance of activities needs to be compared to other indices that apply to rejoinders such as attendance, time and place (for results on the correlation with rejoinders see Waibel et al. (2001)). The correlation can be measured using the mutual information between the activity and the meeting identity. The mutual information is moderate for SBC ( 0.67 bit) and much lower for the meetings ( 0.20 bit). This also corresponds to our intuition since some of the rejoinders in SBC belong to very distinct dialogue genre while the meeting database is homogeneous. The conclusion is that activities are useful for navigation in a rejoinder if the database is homogeneous and they might be useful for nding conversations in a more heterogeneous database.</Paragraph> <Paragraph position="7"> types in a large database of TV shows (1067 shows) that has been recorded over the period of a couple of months until April 2000 in Pittsburgh, PA</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. DETECTION OF SUB-DATABASES </SectionTitle> <Paragraph position="0"> We set up an environment for TV shows that records the subtitles with timestamps continuously from one TV channel and the channel was switched every other day. At the same time the TV program was downloaded from http: //tv.yahoo.com/ to obtain programming information including the genre of the show. Yahoo assigns primary and secondary show types and unless the combination of primary/secondary show-type is frequent enough the primary showtype is used (Tab. 4). The TV show database has the advantage that we were able to collect a large and varied database with little e ort. The same classi er as in Sec. 2 has been used however dialogue acts have not been detected since the data contains a lot of noise, is not necessarily conversational and speaker identities can't be determined easily. Detection results for TV shows can be seen in Tab. 5. It may be noted that adding a lot of keywords does improve the detection result but not so much the entropy. It may therefore be assume that there is a limited dependence between topic and genre which isn't really a surprise since there are many shows with weekly sequels and there may be some true repeats. null Feature accuracy entropy work described in Sec. 2 the show type was detected.</Paragraph> <Paragraph position="1"> If there is a number in the word column the word feature is being used. The number indicates how many word/part of speech pairs are in the vocabulary additionally to the parts of speech.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. EMOTION AND DOMINANCE </SectionTitle> <Paragraph position="0"> Emotions are displayed in a variety of gestures, some of which are oral and may be detected via automated methods from the audio channel (Polzin, 1999). Using only verbal information the emotions happy, excited and neutral can be detected on the meeting database with 88:1% accuracy while always picking neutral yields 83:6%. This result can be improved to 88:6% by adding pitch and power information.</Paragraph> <Paragraph position="1"> While these experiments were conducted at the utterance level emotions can be extended to topical segments. For that purpose the emotions of the individual utterances are entered in a histogram over the segment and the vectors are clustered automatically. The resulting clusters roughly correspond to a \neutral&quot;, \a little happy&quot; and \somewhat excited&quot; segment. Using the classi er for emotions on the word level the segment can be classi ed automatically into categories with a 83:3% accuracy while the baseline is 68:9%.</Paragraph> <Paragraph position="2"> The entropy reduction by automatically detected emotional activities is 0:3bit 8. A similar attempt can be made for dominance (Linell et al., 1988) distributions: Dominance is easy to understand for the user of an information access system and it can be determined automatically with high accuracy.</Paragraph> <Paragraph position="3"> 8 A similar classi cation result for emotions on the utterance level has been obtained by just using the laughter vs. non-laughter tokens of the transcript as the input. This may indicate that (a) the index should really be the amount of laughter in the conversational segment and that (b) emotions might not be displayed very overtly in meetings. These results however would require a wider sampling of meeting types to be generally acceptable.</Paragraph> </Section> class="xml-element"></Paper>