File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1031_evalu.xml

Size: 5,118 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1031">
  <Title>Towards Finding and Fixing Fragments: Using ML to Identify Non-Sentential Utterances and their Antecedents in Multi-Party Dialogue</Title>
  <Section position="8" start_page="250" end_page="252" type="evalu">
    <SectionTitle>
4.2 Results
</SectionTitle>
    <Paragraph position="0"> The Tables 3-5 show the results of the experiments.</Paragraph>
    <Paragraph position="1"> The entries are roughly sorted by performance of the classifier used; for most of the classifiers and data-sets for each task we show the performance for baseline, intermediate feature set(s), and full feature-set, for the rest we only show the best-performing setting. We also indicate whether a balanced or unbalanced data set was used. I.e., the first three lines in Table 3 report on MaxEnt on a balanced data set for the fragment-task, giving results for the baseline, baseline+nrb+bft, and the full feature-set.</Paragraph>
    <Paragraph position="2"> We begin with discussing the fragment task. As Table 3 shows, the three main classifiers perform roughly equivalently. Re-balancing the data, as expected, boosts recall at the cost of precision. For all settings (i.e., combinations of data-sets, feature-sets and classifier), except re-balanced maxent, the base-line (verb in b yes/no, and length of b) already has some success in identifying fragments, but adding the remaining features still boosts the performance.</Paragraph>
    <Paragraph position="3"> Having available the string (condition s.s; slipper with set valued features) interestingly does not help SLIPPER much.</Paragraph>
    <Paragraph position="4"> Overall the performance on this task is not great.</Paragraph>
    <Paragraph position="5"> Why is that? An analysis of the errors made shows two problems. Among the false negatives, there is a high number of fragments like &amp;quot;yeah&amp;quot; and &amp;quot;mhm&amp;quot;, which in their particular context were answers to questions, but that however occur much more often as backchannels (true negatives). The classifier, without having information about the context, can of course not distinguish between these cases, and goes for the majority decision. Among the false positives, we find utterances that are indeed non-sentential, but for which no antecedent was marked (as in (3) above), i.e., which are not fragments in our narrow sense. It seems, thus, that the required distinctions are not ones that can be reliably learnt from looking at the fragments alone.</Paragraph>
    <Paragraph position="6"> The antecedent-task was handled more satisfactorily, as Table 4 shows. For this task, a na&amp;quot;ive base-line (&amp;quot;always take previous utterance&amp;quot;) preforms relatively well already; however, all classifiers were able to improve on this, with a slight advantage for the maxent model (f(0.5) = 0.76). As the entry for MaxEnt shows, adding to the baseline-features  information about whether a is a question or not already boost the performance considerably. An analysis of the predictions of this model then indeed shows that it already captures cases of question and answer pairs quite well. Adding the similarity feature god then gives the model information about semantic relatedness, which, as hypothesised, captures elaboration-type relations (as in (1-b) and (1-c) above). Structural information (iqu) further improves the model; however, the remaining features only seem to add interfering information, for performance using the full feature-set is worse.</Paragraph>
    <Paragraph position="7"> If one of the problems of the fragment-task was that information about the context is required to distinguish fragments and backchannels, then the hope could be that in the combined-task the classifier would able to capture these cases. However, the performance of all classifiers on this task is not satisfactory, as Table 5 shows; in fact, it is even slightly worse than the performance on the fragment task alone. We speculate that instead of of cancelling out mistakes in the other part of the task, the two goals (let b be a fragment, and a a typical antecedent) interfere during optimisation of the rules.</Paragraph>
    <Paragraph position="8"> To summarise, we have shown that the task of identifying the antecedent of a given fragment is learnable, using a feature-set that combines structural and lexical features; in particular, the inclusion of a measure of semantic relatedness, which was computed via queries to an internet search engine, proved helpful. The task of identifying (resolutionvia-identity) fragments, however, is hindered by the high number of non-sentential utterances which can be confused with the kinds of fragments we are interested in. Here it could be helpful to have a method that identifies and filters out backchannels, presumably using a much more local mechanism (as for example proposed in (Traum, 1994)). Similarly, the performance on the combined task is low, also due to a high number of confusions of backchannels and fragments. We discuss an alternative set-up below.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML