File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0314_evalu.xml
Size: 5,260 bytes
Last Modified: 2025-10-06 14:00:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0314"> <Title>Automatically Extracting</Title> <Section position="11" start_page="111" end_page="113" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> In order to determine whether the mapping we propose here results in accurate grounding annotation, we wrote a Perl script to perform the mapping on SGML-format files containing dialogues annotated with the BF tags. We used the script on a set of four TRAINS-93 dialogues containing a total of 325 utterances, that had been previously tagged with BF tags (HA95; CA97).</Paragraph> <Paragraph position="1"> The procedure for tagging the dialogues with BF tags was to have an annotator segment and annotate the dialogue, pass the segmented (but untagged) dialogue to a second annotator to tag independently, and finally for the two annotators to meet and produce a reconciled version of the tagged dialogue.</Paragraph> <Paragraph position="2"> To evaluate the quality of the tags that were output by the script, we had a human annotator tag the</Paragraph> <Paragraph position="4"> same four TRAINS-93 dialogues with grounding acts. Our grounding annotator is a computational linguist familiar with the concept of grounding but with no prior knowledge of Tranm's coding scheme, the BF coding scheme, or the mapping scheme we were using. Before performing the annotation task, the annotator read Traum's descriptions of the grounding tags, tagged a preliminary dialogue (found in Traum's dissertation), and compared the tags he assigned to those assigned by Traum.</Paragraph> <Paragraph position="5"> Tables 3 through 6 show the similarity of the human annotator's grounding tags to those automatically derived. The analysis is split into two parts to deal with the ability of annotators to give an utterance multiple labels. Tables 3 and 4 show a per tag analysis. If both the annotators (the human and the Perl script) gave a tag such as INIT to an utterance (in addition to possibly other tags) then it is counted as agreement with respect to the INIT tag. Table 3 shows the number of times a tag appeared and the number of times there was disagreement.</Paragraph> <Paragraph position="6"> Table 4 shows PA (percent agreement), PE (percent expected agreement), and kappa for each tag. PA is simply the total agreement (either on the presence or absence of a tag in an utterance) divided by the total number of utterances. If N=number of utterances, TotalInit = number of utterances tagged as INIT and TotalNone = number of utterances not tagged as INIT, then PE ~ ( Totallnit/ 2N) ~ + ( TotalNone/ 2N) ~. In this case, there are 2N data points, the two sets of dialogs by the two annotators. Kappa is defined as K = ~. See (Car96; SC88) for more details on these measures and the significance levels listed. Table 5 presents the various combinations of grounding tags seen in the corpus. Disagreement is counted whenever two utterances do not have the same exact set of tags. Since the groups of tags are mutually exclusive, we can calculate PA, PE, and kappa over all the tag groups. If agree = utterances where annotators assigned the same set of tags, then PA = agree/N. If Cj is the number of times a set of tags such as CANCEL or INIT+ACK was assigned, then PE = ~'~j15___ 1 ( Cj / 2N) e. The definition of kappa remains the same. Given these definitions, PA = 0.7477, PE = 0.2876, and kappa = 0.6458. To help determine where the disagreements occurred, a simple measure of PA was applied to the tag sets, if agreeonTag = cases where annotators agreed on a certain tag and NTag = occurences of tag, then in table 6,</Paragraph> <Paragraph position="8"> The kappa of the &quot;All-or-nothing&quot; analysis is somewhat low compared with the 0.67 standard for tentative conclusions and the 0.8 standard for reliable results as reported in (Car96). The &quot;partial credit&quot; analysis is more favorable as the kappas for grounding tags are somewhat independent; an init always starts a new discourse unit whether or not it also acknowledges a previous discourse unit.</Paragraph> <Paragraph position="9"> Thus, the partial credit analysis is likely to be closer to the actual reliability we want to measure. The remaining &quot;partial credit&quot; kappas have low significance levels indicating that more examples are needed to calculate these measures.</Paragraph> <Paragraph position="10"> Another limitation of this study was that technical papers were used for annotator training rather than an annotation manual designed to explain how tags apply in different situations. This was especially problematic when several tags seemed to apply at once. The BF tags themselves were not perfect as explained in (CA97). Kappas for these annotations varied from the lowest at 0.15 to 0.77 for the highest.</Paragraph> <Paragraph position="11"> Given these limitations, the results of this experiment are promising. An annotation manual needs to be developed for labeling grounding and more dialogs need to be labeled. When these sources of confusion are addressed, analysis of remaining differences will reveal any minor changes necessary to the mapping.</Paragraph> </Section> class="xml-element"></Paper>