File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3405_evalu.xml

Size: 3,480 bytes

Last Modified: 2025-10-06 13:59:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3405">
  <Title>Shallow Discourse Structure for Action Item Detection</Title>
  <Section position="6" start_page="32" end_page="33" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0">  Wetrainedindividualclassifiersforeachoftheutterance sub-classes, and cross-validated as before. For agreement utterances, we used a naive n-gram classifier similar to that of (Webb et al., 2005) for dialogue act detection, scoring utterances via a set of most predictive n-grams of length 1-3 and making a classification decision by comparing the maximum score to a threshold (where the n-grams, their scores and the threshold are automatically extracted from the training data). For owner, timeframe and task description utterances, we used SVMs as before, using word unigrams as features (2- and 3-grams gave no improvement - probably due to the small amount of training data). Performance varied greatly by sub-class (see Table 2), with some (e.g. agreement) achieving higher accuracy than the baseline flat classifications, but others being worse.</Paragraph>
    <Paragraph position="1"> As there is now significantly less training data available to each sub-class than there was for all utterances grouped together in the baseline experiment, worse performance might be expected; yet some sub-classes perform better. The worst performing class is owner. Examination of the data shows that owner utterances are more likely than other classes to be assigned to more than one category; they may therefore have more feature overlap with other classes, leading to less accurate classification.</Paragraph>
    <Paragraph position="2"> Use of relevant sub-strings for training (rather than full utterances) may help; as may part-of-speech information - while proper names may be useful features, the name tokens themselves are sparse and may be better substituted with a generic tag.</Paragraph>
    <Paragraph position="3">  Even with poor performance for some of the subclassifiers, we should still be able to combine them to get a benefit as long as their true positives correlate better than their false positives (intuitively, if they make mistakes in different places). So far we have only conducted an initial naive experiment, in whichwecombinetheindividualclassifierdecisions in a weighted sum over a window (currently set to 5 utterances). If the sum over the window reaches a given threshold, we hypothesize an action item, and take the highest-confidence utterance given by each sub-classifier in that window to provide the corresponding property. As shown in Table 3, this gives reasonable performance on most meetings, although it does badly on meeting 5 (apparently because no explicit agreement takes place, while our manual weights emphasized agreement).1 Most encouragingly, the correct examples provide some useful &amp;quot;best&amp;quot; sub-class utterances, from which the relevant properties could be extracted.</Paragraph>
    <Paragraph position="4"> These results can probably be significantly improved: rather than sum over the binary classification outputs of each classifier, we can use their confidence scores or posterior probabilities, and learn  the combination weights to give a more robust approach. There is still a long way to go to evaluate this approach over more data, including the accuracy and utility of the resulting sub-class utterance hypotheses.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML