File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1670_evalu.xml
Size: 3,978 bytes
Last Modified: 2025-10-06 13:59:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1670"> <Title>Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger[?]</Title> <Section position="10" start_page="599" end_page="600" type="evalu"> <SectionTitle> 5.4 Results </SectionTitle> <Paragraph position="0"> Table 4 summarizes overall performance6. The first line shows the accuracy of a baseline which assigns possible supersenses of identified words at random. The second line shows the performance of the first sense baseline (cf. Section 5.3), the marked difference between the two is a measure of the robustness of the first sense heuristic. On the Semcor data the tagger improves over the base-line by 10.71%, 31.19% error reduction, while on Senseval-3 the tagger improves over the base-line by 6.45%, 17.96% error reduction. We can put these results in context, although indirectly, by comparison with the results of the Senseval-3 all words task systems. There, with a base-line of 62.40%, only 4 out of 26 systems performed above the baseline, with the two best systems (Mihalcea and Faruque, 2004; Decadt et al., 2004) achieving an F-score of 65.2% (2.8% improvement, 7.45% error reduction). The system based on the HMM tagger (Molina et al., 2004), 6Scoring was performed with a re-implementation of the &quot;conlleval&quot; script .</Paragraph> <Paragraph position="1"> achieved an F-score of 60.9%. The supersense tagger improves mostly on precision, while also improving on recall. Overall the tagger achieves F-scores between 70.5 and 77.2%. If we compare these figures with the accuracy of NER taggers the results are very encouraging. Given the considerably larger - one order of magnitude - class space some loss has to be expected. Experiments with augmented tagsets in the biomedical domain alsoshowperformancelosswithrespecttosmaller tagsets; e.g., Kazama et al. (2002) report an F-score of 56.2% on a tagset of 25 Genia classes, compared to the 75.9% achieved on the simplest binary case. The sequence fragments from SEMv contribute about 1% F-score improvement.</Paragraph> <Paragraph position="2"> Table 5 focuses on subsets of the evaluation.</Paragraph> <Paragraph position="3"> The upper part summarizes the results on Semcor for the classes comparable to standard NER's: &quot;person&quot;, &quot;group&quot;, &quot;location&quot; and &quot;time&quot;. However, these categories here are composed of common nouns as well as proper names/named entities. On this four tags the tagger achieves an average 82.46% F-score, not too far from NER results.</Paragraph> <Paragraph position="4"> The lower portion of Table 5 summarizes the results on the five most frequent noun and verb supersense labels on the Senseval-3 data, providing more specific evidence for the supersense tagger's disambiguation accuracy. The tagger outperforms the first sense baseline on all categories, with the exception of &quot;verb.cognition&quot; and &quot;noun.person&quot;. The latter case has a straightforward explanation, named entities (e.g., &quot;Phil Haney&quot;, &quot;Chevron&quot; or &quot;Marina District&quot;) are not annotated in the Senseval data, while they are in Semcor. Hence the tagger learns a different model for nouns than the one used to annotate the Senseval data. Because of this discrepancy the tagger tends to return false positives for some categories. In fact, the other noun categories on which the tagger performs poorly in SE3 are &quot;group&quot; and &quot;location&quot; (baseline 52.10 tagger 44.72 and baseline 47.62% tagger 47.54% F-score). Naturally, the lower performance on Senseval is also explained by the fact that the eval- null Semcor (upper section), and 5 most frequent verb (middle) and noun (bottom) categories evaluated on Senseval. uation comes from different sources than training.</Paragraph> </Section> class="xml-element"></Paper>