XML Viewer - w06-1314

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1314_metho.xml
Size: 20,937 bytes
Last Modified: 2025-10-06 14:10:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1314">
  <Title>Automatically Detecting Action Items in Audio Meeting Recordings</Title>
  <Section position="5" start_page="96" end_page="97" type="metho">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> We use action item annotations produced by Gruenstein et al. (2005). This corpus provides topic hierarchy and action item annotations for the ICSI meeting corpus as well as other corpora of meetings; due to the ready availability of other types of annotations for the ICSI corpus, we focus solely on the annotations for these meetings. Figure 1 gives an example of the annotations.</Paragraph>
    <Paragraph position="1"> The corpus covers 54 ICSI meetings annotated by two human annotators, and several other meetings annotated by one annotator. Of the 54 meetings with dual annotations, 6 contain no action items. For this study we consider only those meetings which contain action items and which are annotated by both annotators.</Paragraph>
    <Paragraph position="2"> As the annotations were produced by a small number of untrained annotators, an immediate question is the degree of consistency and reliability. Inter-annotator agreement is typically measured by the kappa statistic (Carletta, 1996), de- null ment) across the 54 ICSI meetings tagged by two annotators. Of the two meetings with k = 1.0, one has only two action items and the other only four.</Paragraph>
    <Paragraph position="3"> fined as:</Paragraph>
    <Paragraph position="5"> where P(O) is the probability of the observed agreement, and P(E) the probability of the &amp;quot;expected agreement&amp;quot; (i.e., under the assumption the two sets of annotations are independent). The kappa statistic ranges from [?]1 to 1, indicating perfect disagreement and perfect agreement, respectively. null Overall inter-annotator agreement as measured by k on the action item corpus is poor, as noted in Purver et al. (2006), with an overall k of 0.364 and values for individual meetings ranging from 1.0 to less than zero. Figure 2 shows the distribution of k across all 54 annotated ICSI meetings.</Paragraph>
    <Paragraph position="6"> To reduce the effect of poor inter-annotator agreement, we focus on the top 15 meetings as ranked by k; the minimum k in this set is 0.435.</Paragraph>
    <Paragraph position="7"> Although this reduces the total amount of data available, our intention is that this subset of the most consistent annotations will form a higher-quality corpus.</Paragraph>
    <Paragraph position="8"> While the corpus classifies related action item utterances into action item &amp;quot;groups,&amp;quot; in this study we wish to treat the annotations as simply binary attributes. Visual analysis of annotations for several meetings outside the set of chosen 15 suggests that the union of the two sets of annotations yields the most consistent resulting annotation; thus, for this study, we consider an utterance to be an action item if at least one of the annotators marked it as such.</Paragraph>
    <Paragraph position="9"> The 15-meeting subset contains 24,250 utter-</Paragraph>
    <Paragraph position="11"> - X Even if it's really crude.</Paragraph>
    <Paragraph position="12"> - - OK? So, you know, - - here a- here are-- X So we're supposed to @@ about features and whatnot, and- null &amp;quot;@@&amp;quot; signifies an unintelligible word. This transcript is from an ICSI meeting recording and has k = 0.373, ranking it 16th out of 54 meetings in annotator agreement.  ances across the 15 selected meetings. There are 24,250 utterances total, 590 of which (2.4%) are action item utterances.</Paragraph>
    <Paragraph position="13"> ances total; under the union strategy above, 590 of these are action item utterances. Figure 3 shows the number of action item utterances and the number of total utterances in the 15 selected meetings. One noteworthy feature of the ICSI corpus underlying the action item annotations is the &amp;quot;digit reading task,&amp;quot; in which the participants of meetings take turns reading aloud strings of digits. This task was designed to provide a constrainedvocabulary training set of speech recognition developers interested in multi-party speech. In this study we did not remove these sections; the net effect is that some portions of the data consist of these fairly atypical utterances.</Paragraph>
  </Section>
  <Section position="6" start_page="97" end_page="98" type="metho">
    <SectionTitle>
4 Experimental methodology
</SectionTitle>
    <Paragraph position="0"> We formulate the action item detection task as one of binary classification of utterances. We apply a maximum entropy (maxent) model (Berger et al., 1996) to this task.</Paragraph>
    <Paragraph position="1"> Maxent models seek to maximize the conditional probability of a class c given the observations X using the exponential form</Paragraph>
    <Paragraph position="3"> where fi,c(X) is the ith feature of the data X in class c, li,c is the corresponding weight, and Z(X) is a normalization term. Maxent models choose the weights li,c so as to maximize the entropy of the induced distribution while remaining consistent with the data and labels; the intuition is that such a distribution makes the fewest assumptions about the underlying data.</Paragraph>
    <Paragraph position="4"> Our maxent model is regularized by a quadratic prior and uses quasi-Newton parameter optimization. Due to the limited amount of training data (see Section 3) and to avoid overfitting, we employ 10-fold cross validation in each experiment.</Paragraph>
    <Paragraph position="5"> To evaluate system performance, we calculate the F measure (F) of precision (P) and recall (R), defined as:</Paragraph>
    <Paragraph position="7"> where A is the set of utterances marked as action items by the system, and C is the set of (all) correct action item utterances.</Paragraph>
    <Paragraph position="8">  The use of precision and recall is motivated by the fact that the large imbalance between positive and negative examples in the corpus (Section 3) means that simpler metrics like accuracy are insufficient--a system that simply classifies every utterance as negative will achieve an accuracy of 97.5%, which clearly is not a good reflection of desired behavior. Recall and F measure for such a system, however, will be zero.</Paragraph>
    <Paragraph position="9"> Likewise, a system that flips a coin weighted in proportion to the number of positive examples in the entire corpus will have an accuracy of 95.25%, but will only achieve P = R = F = 2.4%.</Paragraph>
  </Section>
  <Section position="7" start_page="98" end_page="99" type="metho">
    <SectionTitle>
5 Features
</SectionTitle>
    <Paragraph position="0"> As noted in Section 3, we treat the task of producing action item annotations as a binary classification task. To this end, we consider the following sets of features. (Note that all real-valued features were range-normalized so as to lie in [0,1] and that no binning was employed.)</Paragraph>
    <Section position="1" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
5.1 Immediate lexical features
</SectionTitle>
      <Paragraph position="0"> We extract word unigram and bigram features from the transcript for each utterance. We normalize for case and for certain contractions; for example, &amp;quot;I'll&amp;quot; is transformed into &amp;quot;I will&amp;quot;. Note that these are oracle features, as the transcripts are human-produced and not the product of automatic speech recognizer (ASR) system output. null</Paragraph>
    </Section>
    <Section position="2" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
5.2 Contextual lexical features
</SectionTitle>
      <Paragraph position="0"> We extract word unigram and bigram features from the transcript for the previous and next utterances across all speakers in the meeting.</Paragraph>
    </Section>
    <Section position="3" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
5.3 Syntactic features
</SectionTitle>
      <Paragraph position="0"> Under the hypothesis that action item utterances will exhibit particular syntactic patterns, we use a conditional Markov model part-of-speech (POS) tagger (Toutanova and Manning, 2000) trained on the Switchboard corpus (Godfrey et al., 1992) to tag utterance words for part of speech. We use the following binary POS features:  * Presence of UH tag, denoting the presence of an &amp;quot;interjection&amp;quot; (including filled pauses, unfilled pauses, and discourse markers).</Paragraph>
      <Paragraph position="1"> * Presence of MD tag, denoting presence of a modal verb.</Paragraph>
      <Paragraph position="2"> * Number of NN* tags, denoting the number of nouns.</Paragraph>
      <Paragraph position="3"> * Number of VB* tags, denoting the number of verbs.</Paragraph>
      <Paragraph position="4"> * Presence of VBD tag, denoting the presence of a past-tense verb.</Paragraph>
    </Section>
    <Section position="4" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
5.4 Prosodic features
</SectionTitle>
      <Paragraph position="0"> Under the hypothesis that action item utterances will exhibit particular prosodic behavior--for example, that they are emphasized, or are pitched a certain way--we performed pitch extraction using an auto-correlation method within the sound analysis package Praat (Boersma and Weenink, 2005).</Paragraph>
      <Paragraph position="1"> From the meeting audio files we extract the following prosodic features, on a per-utterance basis: (pitch measures are in Hz; intensity in energy; normalization in all cases is z-normalization)  * Pitch and intensity range, minimum, and maximum.</Paragraph>
      <Paragraph position="2"> * Pitch and intensity mean.</Paragraph>
      <Paragraph position="3"> * Pitch and intensity median (0.5 quantile).</Paragraph>
      <Paragraph position="4"> * Pitch and intensity standard deviation.</Paragraph>
      <Paragraph position="5"> * Pitch slope, processed to eliminate halving/doubling. null * Number of voiced frames.</Paragraph>
      <Paragraph position="6"> * Duration-normalized pitch and intensity ranges and voiced frame count.</Paragraph>
      <Paragraph position="7"> * Speaker-normalized pitch and intensity means.</Paragraph>
    </Section>
    <Section position="5" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
5.5 Temporal features
</SectionTitle>
      <Paragraph position="0"> Under the hypothesis that the length of an utterance or its location within the meeting as a whole will determine its likelihood of being an action item--for example, shorter statements near the end of the meeting might be more likely to be action items--we extract the duration of each utterance and the time from its occurrence until the end of the meeting. (Note that the use of this feature precludes operating in an online setting, where the end of the meeting may not be known in advance.) 5.6 General semantic features Under the hypothesis that action item utterances will frequently involve temporal expressions--e.g.</Paragraph>
      <Paragraph position="1"> &amp;quot;Let's have the paper written by next Tuesday&amp;quot;-we use Identifinder (Bikel et al., 1997) to mark temporal expressions (&amp;quot;TIMEX&amp;quot; tags) in utterance transcripts, and create a binary feature denoting  the existence of a temporal expression in each utterance. null Note that as Identifinder was trained on broadcast news corpora, applying it to the very different domain of multi-party meeting transcripts may not result in optimal behavior.</Paragraph>
    </Section>
    <Section position="6" start_page="99" end_page="99" type="sub_section">
      <SectionTitle>
5.7 Dialog-specific semantic features
</SectionTitle>
      <Paragraph position="0"> Under the hypothesis that action item utterances may be closely correlated with specific dialog act tags, we use the dialog act annotations from the ICSI Meeting Recorder Dialog Act Corpus.</Paragraph>
      <Paragraph position="1"> (Shriberg et al., 2004) As these DA annotations do not correspond one-to-one with utterances in the ICSI corpus, we align them in the most liberal way possible, i.e., if at least one word in an utterance is annotated for a particular DA, we mark the entirety of that utterance as exhibiting that DA.</Paragraph>
      <Paragraph position="2"> We consider both fine-grained and coarse-grained dialog acts.1 The former yields 56 features, indicating occurrence of DA tags such as &amp;quot;appreciation,&amp;quot; &amp;quot;rhetorical question,&amp;quot; and &amp;quot;task management&amp;quot;; the latter consists of only 7 classes--&amp;quot;disruption,&amp;quot; &amp;quot;backchannel,&amp;quot; &amp;quot;filler,&amp;quot; &amp;quot;statement,&amp;quot; &amp;quot;question,&amp;quot; &amp;quot;unlabeled,&amp;quot; and &amp;quot;unknown.&amp;quot; null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="99" end_page="99" type="metho">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> The final performance for the maxent model across different feature sets is given in Table 1.</Paragraph>
    <Paragraph position="1"> F measures scores range from 13.81 to 31.92.</Paragraph>
    <Paragraph position="2"> Figure 4 shows the interpolated precision-recall curves for several of these feature sets; these graphs display the level of precision that can be achieved if one is willing to sacrifice some recall, and vice versa.</Paragraph>
    <Paragraph position="3"> Although ideally, all combinations of features should be evaluated separately, the large number of features in this precludes this strategy. The combination of features explored here was chosen so as to start from simpler features and successively add more complex ones. We start with transcript features that are immediate and context-independent (&amp;quot;unigram&amp;quot;, &amp;quot;bigram&amp;quot;, &amp;quot;TIMEX&amp;quot;); then add transcript features that require context (&amp;quot;temporal&amp;quot;, &amp;quot;context&amp;quot;), then non-transcript (i.e. audio signal) features (&amp;quot;prosodic&amp;quot;), and finally add features that require both the transcript and the audio signal (&amp;quot;DA&amp;quot;).</Paragraph>
    <Paragraph position="4"> 1We use the map 01 grouping defined in the MRDA corpus to collapse the tags.</Paragraph>
    <Paragraph position="5">  gests the level of precision that can be achieved if one is willing to sacrifice some recall, and vice versa.</Paragraph>
    <Paragraph position="6"> In total, nine combinations of features were considered. In every case except that of syntactic and coarse-grained dialog act features, the additional features improved system performance and these features were used in succeeding experiments. Syntactic and coarse-grained DA features resulted in a drop in performance and were discarded from succeeding systems.</Paragraph>
  </Section>
  <Section position="9" start_page="99" end_page="101" type="metho">
    <SectionTitle>
7 Analysis
</SectionTitle>
    <Paragraph position="0"> The unigram and bigram features provide significant discriminative power. Tables 2 and 3 give the top features, as determined by weight, for the models trained only on these features. It is clear from Table 3 that the detailed end-of-utterance punctuation in the human-generated transcripts provide valuable discriminative power.</Paragraph>
    <Paragraph position="1"> The performance gain from adding TIMEX tagging features is small and likely not statistically significant. Post-hoc analysis of the TIMEX tagging (Section 5.6) suggests that Identifinder tagging accuracy is quite plausible in general, but exhibits an unfortunate tendency to mark the digit-reading (see Section 3) portion of the meetings as temporal expressions. It is plausible that removing these utterances from the meetings would allow this feature a higher accuracy.</Paragraph>
    <Paragraph position="2"> Based on the low feature weight assigned, utterance length appears to provide no significant value to the model. However, the time until the meeting is over ranks as the highest-weighted feature in the unigram+bigram+TIMEX+temporal feature set. This feature is thus responsible for the 39.25%  the preceding feature set, and the number of features, across all feature sets tried. Italicized lines denote the addition of features which do not improve performance; these are omitted from succeeding systems.  action item), and weight for the top ten features in the unigram-only model. &amp;quot;Nine&amp;quot; and &amp;quot;five&amp;quot; are common words in the digit-reading task (see Section 3).</Paragraph>
    <Paragraph position="3">  the top ten features in the unigram+bigram model.</Paragraph>
    <Paragraph position="4"> The symbol * denotes the beginning of an utterance and $ the end. All of the top ten features are bigrams except for the unigrams &amp;quot;email&amp;quot;.</Paragraph>
    <Paragraph position="5">  the top ten features on the best-performing model.</Paragraph>
    <Paragraph position="6"> Bigrams labeled &amp;quot;prev.&amp;quot; and &amp;quot;next&amp;quot; correspond to the lexemes from previous and next utterances, respectively. Prosodic features labeled as &amp;quot;norm.&amp;quot; have been normalized on a per-speaker basis.</Paragraph>
    <Paragraph position="7"> boost in F measure in row 3 of Table 1.</Paragraph>
    <Paragraph position="8"> The addition of part-of-speech tags actually decreases system performance. It is unclear why this is the case. It may be that the unigram and bi-gram features already adequately capture any distinctions these features make, or simply that these features are generally not useful for distinguishing action items.</Paragraph>
    <Paragraph position="9"> Contextual features, on the other hand, improve system performance significantly. A post-hoc analysis of the action item annotations makes clear why: action items are often split across multiple utterances (e.g. as in Figure 1), only a portion of which contain lexical cues sufficient to distinguish them as such. Contextual features thus allow utterances immediately surrounding these &amp;quot;obvious&amp;quot; action items to be tagged as well.  Prosodic features yield a 7.10% increase in F measure, and analysis shows that speakernormalized intensity and pitch, and the range in intensity of an utterance, are valuable discriminative features. The subsequent addition of coarse-grained dialog act tags does not further improve system performance. It is likely this is due to reasons similar to those for POS tags--either the categories are insufficient to distinguish action item utterances, or whatever usefulness they provide is subsumed by other features.</Paragraph>
    <Paragraph position="10"> Table 4 shows the feature weights for the top-ranked features on the best-scoring system. The addition of the fine-grained DA tags results in a significant increase in performance.The F measure of this best feature set is 31.92%.</Paragraph>
  </Section>
  <Section position="10" start_page="101" end_page="101" type="metho">
    <SectionTitle>
8 Conclusions
</SectionTitle>
    <Paragraph position="0"> We have shown that several classes of features are useful for the task of action item annotation from multi-party meeting corpora. Simple lexical features, their contextual versions, the time until the end of the meeting, prosodic features, and fine-grained dialog acts each contribute significant increases in system performance.</Paragraph>
    <Paragraph position="1"> While the raw system performance numbers of Table 1 are low relative to other, better-studied tasks on other, more mature corpora, we believe the relative usefulness of the features towards this task is indicative of their usefulness on more consistent annotations, as well as to related tasks.</Paragraph>
    <Paragraph position="2"> The Gruenstein et al. (2005) corpus provides a valuable and necessary resource for research in this area, but several factors raise the question of annotation quality. The low k scores in Section 3 are indicative of annotation problems. Post-hoc error analysis yields many examples of utterances which are somewhat difficult to imagine as possible, never mind desirable, to tag. The fact that the extremely useful oracular information present in the fine-grained DA annotation does not raise performance to the high levels that one might expect further suggests that the annotations are not ideal--or, at the least, that they are inconsistent with the DA annotations.2 This analysis is consistent with the findings of Purver et al. (2006), who achieve an F measure of 2Which is not to say they are devoid of significant value-training and testing our best system on the corpus with the 590 positive classifications randomly shuffled across all utterances yields an F measure of only 4.82.</Paragraph>
    <Paragraph position="3"> less than 25% when applying SVMs to the classification task to the same corpus, and motivate the development of a new corpus of action item annotations. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML