File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1105_evalu.xml
Size: 6,347 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1105"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Comparison of Similarity Models for the Relation Discovery Task</Title> <Section position="8" start_page="29" end_page="31" type="evalu"> <SectionTitle> 5 Experiment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 5.1 Method </SectionTitle> <Paragraph position="0"> This section describes experimental setup, which uses relation extraction data from ACE 2005 to answer four questions concerning the effectiveness of similarity models based on term co-occurrence and dimensionality reduction for the relation dis- null covery task: 1. Do term co-occurrence models provide a better representation of relation semantics than standard term-by-document vector space? 2. Do textual dimensionality reduction techniques provide any further improvements? 3. How do probabilistic topic models perform with respect to SVD on the relation discovery task? 4. Does one similarity measure (for probability distributions) outperform the others on the relation discovery task? System configurations are compared across six different data subsets (entity type pairs, i.e., organisation-geopolitical entity, organisationorganisation, person-facility, person-geopolitical entity, person-organisation, person-person) and evaluated following suggestions by DemVsar (2006) for statistical comparison of classifiers over multiple data sets.</Paragraph> <Paragraph position="1"> The dependent variable is the clustering performance as measured by the F-score. F-score accounts for both the amount of predictions made that are true (Precision) and the amount of true classes that are predicted (Recall). We use the CLUTO implementation of this measure for evaluating hierarchical clustering. Based on (Larsen and Aone, 1999), this is a balanced F-score</Paragraph> <Paragraph position="3"> score over all possible alignments of gold standard classes with nodes in the hierarchical tree.</Paragraph> <Paragraph position="4"> The average F-score for the entire hierarchical tree is a micro-average over the class-specific scores weighted according to the relative size of the class.</Paragraph> </Section> <Section position="2" start_page="29" end_page="31" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> Table 3 contains F-score performance on the test set (ACE 2005). The columns contain results from the different system configurations. The column labels in the top row indicate the different representations of relation similarity. The column labels in the second row indicate the dimensional- null criterion function.</Paragraph> <Paragraph position="1"> ity reduction technique used. The column labels in the third row indicated the similarity measure used, i.e. cosine (Cos) and KL (KL), symmetrised KL (Sym) and JS (JS) divergence. The rows contain results for the different data subsets. While we do not use them for analysis of statistical significance, we include micro and macro averages over the data subsets.11 We also include the average ranks, which show that the LDA system using KL divergence performed best.</Paragraph> <Paragraph position="2"> Initial inspection of the table shows that all systems that use the term co-occurrence semantic space outperform the baseline system that uses the term-by-document semantic space. To test for statistical significance, we use non-parametric tests proposed by DemVsar (2006) for comparing classifiers across multiple data sets. The use of non-parametric tests is safer here as they do not assume normality and outliers have less effect. The first test we perform is a Friedman test (Friedman, 1940), a multiple comparisons technique which is the non-parametric equivalent of the repeatedmeasures ANOVA. The null hypothesis is that all models perform the same and observed differences are random. With a Friedman statistic (kh2F ) of 21.238, we reject the null hypothesis at p < 0.01.</Paragraph> <Paragraph position="3"> The first question we wanted to address is whether term co-occurrence models outperform the term-by-document representation of relation semantics. To address this question, we continue with post-hoc analysis. The objective here is to 11Averages over data sets are unreliable where it is not clear whether the domains are commensurable (Webb, 2000). We present averages in our results but avoid drawing conclusions based on them.</Paragraph> <Paragraph position="4"> compare several conditions to a control (i.e., compare the term co-occurrence systems to the term-by-document baseline) so we use a Bonferroni-Dunn test. At a significance level of p < 0.05, the critical difference for the Bonferroni-Dunn test for comparing 6 systems across 6 data sets is 2.782. We conclude that the unreduced term co-occurrence system and the LDA systems with KL and JS divergence all perform significantly better than baseline, while the SVD system and the LDA system with symmetrised KL divergence do not.</Paragraph> <Paragraph position="5"> The second question asks whether SVD and LDA dimensionality reduction techniques provide any further improvement. We observe that the systems using KL and JS divergence both outperform the unreduced term co-occurrence system, though the difference is not significant.</Paragraph> <Paragraph position="6"> The third question asks how the probabilistic topic models perform with respect to the SVD models. Here, Holm-correct Wilcoxon signed-ranks tests show that the KL divergence system performs significantly better than SVD while the symmetrised KL divergence and JS divergence systems do not.</Paragraph> <Paragraph position="7"> The final question is whether one of the divergence measures (KL, symmetrised KL or JS) out-performs the others. With a statistic of kh2F = 9.336, we reject the null hypothesis that all systems are the same at p < 0.01. Post-hoc analysis with Holm-corrected Wilcoxon signed-ranks tests show that the KL divergence system and the JS divergence system both perform significantly better than the symmetrised KL system at p < 0.05, while there is no significant difference between the KL and JS systems.</Paragraph> </Section> </Section> class="xml-element"></Paper>