XML Viewer - w05-0613

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0613_evalu.xml
Size: 8,199 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0613">
  <Title>Probabilistic Head-Driven Parsing for Discourse Structure</Title>
  <Section position="7" start_page="157" end_page="157" type="evalu">
    <SectionTitle>
6 Results
</SectionTitle>
    <Paragraph position="0"> For our experiments, we use a standard chart parsing algorithm with beam search that allows a maximum of 500 edges per cell. The gure of merit for the cut-off combines the probability of an edge with the prior probability of its label, head and head tag. Hypothesized trees that do not conform to some simple discourse tree constraints are also pruned.3 The parser is given the elementary discourse units as de ned in the corpus. These units correspond directly to the utterances already de ned in Redwoods and we can thus easily access their complete syntactic analyses directly from the treebank.</Paragraph>
    <Paragraph position="1"> The parser is also given the correct utterance moods to start with. This is akin to getting the correct part-of-speech tags in syntactic parsing. We do this since we are using the parser for semi-automated annotation. Tagging moods for a new discourse is a very quick and reliable task for the human. With them the parser can produce the more complex hierarchical structure more accurately than if it had to guess them with the potential to dramatically reduce the time to annotate the discourse 3E.g., nodes can have at most one child with a relation label. structures of further dialogues. Later, we will create a sentence mood tagger that presents an n-best list for the parser to start with, from the tag set ind, int, imp, irr, pause, and pls.</Paragraph>
    <Paragraph position="2"> Models are evaluated by using a leave-one-out strategy, in which each dialogue is parsed after training on all the others. We measure labelled and unlabelled performance with both the standard PARSEVAL metric for comparing spans in trees and a relation-based metric that compares the SDRS's produced by the trees. The latter gives a more direct indication of the accuracy of the actual discourse logical form, but we include the former to show performance using a more standard measure. Scores are globally determined rather than averaged over all individual dialogues.</Paragraph>
    <Paragraph position="3"> For the relations metric, the relations from the derived discourse tree for the test dialogue are extracted; then, the overlap with relations from the corresponding gold standard tree is measured. For labelled performance, the model is awarded a point for a span or relation which has the correct discourse relation label and both arguments are correct. For unlabelled, only the arguments need to be correct.4 Figure 5 provides the f-scores5 of the various models and compares them against those of a base-line model and annotators. All differences between models are signi cant, using a pair-wise t-test at 99.5% con dence, except that between the baseline and Model 2 for unlabelled relations.</Paragraph>
    <Paragraph position="4"> The baseline model is based on the most frequent way of attaching the current utterance to its dia4This is a much stricter measure than one which measures relations between a head and its dependents in syntax because it requires two segments rather than two heads to be related correctly. For example, Model 4's labelled and unlabelled relation f-scores using segments are 43.2% and 67.9%, respectively; on a head-to-head basis, they rise to 50.4% and 81.8%.</Paragraph>
    <Paragraph position="5">  standard utterance moods. For this corpus, this results in a baseline which is a right-branching structure, where the relation Plan-Elaboration is used if the utterance is indicative, Question-Elaboration if it is interrogative, and Request-Elaboration if it is imperative. The baseline also appropriately handles ignorable utterances (i.e, those with the mood labels irrelevant, pause, or pleasantry).</Paragraph>
    <Paragraph position="6"> The baseline performs poorly on labelled relations (7.4%), but is more competitive on unlabelled ones (53.3%). The main reason for this is that it takes no segmentation risks. It simply relates every non-ignorable utterance to the previous one, which is indeed a typical con guration with common content-level relations like Continuation. The generative models take risks that allow them to correctly identify more complex segments at the cost of missing some of these easier cases.</Paragraph>
    <Paragraph position="7"> Considering instead the PARSEVAL scores for the baseline, the labelled performance is much higher (14.7%) and the unlabelled is much lower (33.8%) than for relations. The difference in labelled performance is due to the fact that the intentional-level relations used in the baseline often have arguments that are multi-utterance segments in the gold standard. These are penalized in the relations comparison, but the spans used in PARSEVAL are blind to them. On the other hand, the unlabelled score drops considerably this is due to poor performance on dialogues whose gold standard analyses do not have a primarily right-branching structure.</Paragraph>
    <Paragraph position="8"> Model 1 performs most poorly of all the models.</Paragraph>
    <Paragraph position="9"> It is signi cantly better than the baseline on labelled relations, but signi cantly worse on unlabelled relations. All its features are derived from the structure of the trees, so it gets no clues from speaker turns or the semantic content of utterances.</Paragraph>
    <Paragraph position="10"> Model 2 brings turns and larger context via the ST and HCR features, respectively. This improves segmentation over Model 1 considerably, so that the model matches the baseline on unlabelled relations and beats it signi cantly on labelled relations.</Paragraph>
    <Paragraph position="11"> The inclusion of the TC feature in Model 3 brings large (and signi cant) improvements over Model 2.</Paragraph>
    <Paragraph position="12"> Essentially, this feature has the effect of penalizing hypothesized content-level segments that span several turns. This leads to better overall segmentation. Finally, Model 4 incorporates the domain-based TM feature that summarizes some of the semantic content of utterances. This extra information improves the determination of labelled relations. For example, it is especially useful in distinguishing a Plan-Correction from a Plan-Elaboration.</Paragraph>
    <Paragraph position="13"> The overall trend of differences between PARSEVAL and relations scoring show that PARSEVAL is tougher on overall segmentation and relations scoring is tougher on whether a model got the right arguments for each labelled relation. It is the latter that ultimately matters for the discourse structures produced by the parser to be useful; nonetheless, the PARSEVAL scores do show that each model progressively improves on capturing the trees themselves, and that even Model 1 as a syntactic model is far superior to the baseline for capturing the overall form of the trees.</Paragraph>
    <Paragraph position="14"> We also compare our best model against two upperbounds: (1) inter-annotator agreement on ten dialogues that were annotated independently and (2) the best annotator against the gold standard agreed upon after the independent annotation phase.</Paragraph>
    <Paragraph position="15"> For the rst, the labelled/unlabelled relations f-scores are 50.3%/73.0% and for the latter, they are 75.3%/84.0% this is similar to the performance on other discourse annotation projects, e.g., Carlson et al. (2001). On the same ten dialogues, Model 4 achieves 42.3%/64.9%.</Paragraph>
    <Paragraph position="16"> It is hard to compare these models with Marcu's (1999) rhetorical parsing model. Unlike Marcu, we did not use a variety of corpora, have a smaller training corpus, are analysing dialogues as opposed to monologues, have a larger class of rhetorical relations, and obtain the elementary discourse units  from the Redwoods annotations rather than estimating them. Even so, it is interesting that the scores reported in Marcu (1999) for labelled and unlabelled relations are similar to our scores for Model 4.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML