File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1041_evalu.xml
Size: 9,482 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1041"> <Title>Structuring Knowledge for Reference Generation: A Clustering Algorithm</Title> <Section position="7" start_page="325" end_page="326" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The evaluation of the algorithm was based on a comparisonofitsoutputagainsttheoutputofhumanbeings null in a similar task.</Paragraph> <Paragraph position="1"> Thirteen native or fluent speakers of English volunteered to participate in the study. The materials consisted of 8 domains, 4 of which were graphical representations of a 2D spatial layout containing 13 points. The pictures were generated by plotting numerical x/y coordinates (the same values are used as input to the algorithm). The other four domains consisted of a set of 13 arbitrarily chosen nouns. Participants were presented with an eight-page booklet with spatial and semantic domains on alternate pages. They were instructed to draw circles around the best clusters in the pictures, or write down the words in groups that were related according to their intuitions. Clusters could be of arbitrary size, but each element had to be placed in exactly one cluster.</Paragraph> <Section position="1" start_page="325" end_page="325" type="sub_section"> <SectionTitle> 5.1 Participant agreement </SectionTitle> <Paragraph position="0"> Participant agreement on each domain was measured using kappa. Since the task did not involve predefined clusters, the set of unique groups (denoted G) generated by participants in every domain was identified, representing the set of 'categories' available post hoc.</Paragraph> <Paragraph position="1"> For each domain element, the number of times it occurred in each group served as the basis to calculate the proportionof agreementamongparticipantsfor the element. The total agreement P(A) and the agreement expected by chance, P(E) were then used in the stan-</Paragraph> <Paragraph position="3"> Table 2 shows a remarkable difference between the two domain types, with very high agreement on spatial domains and lower values on the semantic task.</Paragraph> <Paragraph position="4"> The difference was significant (t = 2.54, p < 0.05).</Paragraph> <Paragraph position="5"> Disagreement on spatial domains was mostly due to the problemof reciprocalpairs, where participantsdisagreedonwhetherentitiessuchase8 ande9 inFigure1 gave rise to a well-formed cluster or not. However, all the participants were consistent with the version of the NearestNeighbourPrinciplegivenin(5). Ifanelement was grouped, it was always grouped with its anchor.</Paragraph> <Paragraph position="6"> The disagreement in the semantic domains seemed to turn on two cases4: 1. Sub-clusters Whereas some proposals included clusters such as{man, woman, boy, girl, infant, toddler, baby, child } , others chose to group { infant, toddler, baby,child} separately.</Paragraph> <Paragraph position="7"> 2. Polysemy For example, liver was in some cases clustered with { steak, pizza } , while others grouped it with items like{heart, lung}.</Paragraph> <Paragraph position="8"> Insofar as an algorithm should capture the whole range of phenomena observed, (1) above could be accounted for by making repeated calls to the Algorithm to sub-divide clusters. One problem is that, in case only one clusterisfoundintheoriginaldomain,thesamecluster willbereturnedafterfurtherattemptsatsub-clustering. A possible solution to this is to redefine the parameter k in Algorithm (1), makingthe conditionfor proximity more strict. As for the second observation, the desideratumexpressedin(3)maybetoostronginthesemantic null domain, since words can be polysemous. As suggested above, one way to resolve this would be to measure distance between word senses, as opposed to words.</Paragraph> </Section> <Section position="2" start_page="325" end_page="326" type="sub_section"> <SectionTitle> 5.2 Algorithm performance </SectionTitle> <Paragraph position="0"> Theperformanceofthe algorithm(hereafterthetarget) against the human output was compared to two base-line algorithms. In the spatial domains, we used an implementationoftheThorissonalgorithm(Thorisson, 1994) described inSS2. In our implementation, the procedure was called iteratively until all domain objects had been clustered in at least one group.</Paragraph> <Paragraph position="1"> For the semantic domains, the baseline was a simple procedure which calculated the powerset of each domain S. For each subset in pow(S)[?]{[?],S}, the procedurecalculatesthe meanpairwisesimilarity between words, returning an ordered list of subsets. This partial order is then traversed, choosing subsets until all elements had been grouped. This seemed to be a reasonable baseline, because it corresponds to the intuition that the 'best cluster' from a semantic point of view is the one with the highest pairwise similarity among its elements.</Paragraph> <Paragraph position="2"> The outputof the targetand baseline algorithmswas compared to human output in the following ways: 1. By item In each of the eight test domains, an agreement score was calculated for each domain element e (i.e. 13 scores in each domain). Let Us be the set of distinct groups containing e proposed bythe experimentalparticipants,and let Ua be the set of unique groups containinge proposed by the algorithm (|Ua |= 1 in case of the target algorithm, but not necessarily for the baselines, since they do not impose a partition). For each pair<Uai,Usj> of algorithm-human clusters, the agreement score was defined as</Paragraph> <Paragraph position="4"> i.e. the ratio of the number of elements on which the human/algorithmagree, and the number of elements on which they do not agree. This returns a number in (0,1) with 1 indicating perfect agreement. Themaximalsuchscoreforeachentitywas selected. This controlled for the possible advantage that the target algorithm might have, given that it, like the human participants, partitions the domain.</Paragraph> <Paragraph position="5"> 2. By participant An overall mean agreement score was computed for each participant using the above formula for the target and baseline algorithms in each domain.</Paragraph> <Paragraph position="6"> Results by item Table 3 shows the mean and modal agreement scores obtained for both target and base-line in each domain type. At a glance, the target algorithm performed better than the baseline on the spatial domains, with a modal score of 1, indicating perfect agreement on 60% of the objects. The situation is different in the semantic domains, where target and base-line performedroughlyequally well; in fact, the modal score of 1 accounts for 75% baseline scores.</Paragraph> <Paragraph position="7"> target baseline spatial mean 0.84 0.72 mode 1 (60%) 0.67 (40%) semantic mean 0.86 0.86 mode 1 (65%) 1 (75%) Unsurprisingly, the difference between target and baseline algorithmswas reliableon the spatialdomains (t = 2.865, p < .01), but not on the semantic domains (t < 1, ns). Thiswas confirmedby a one-wayAnalysis of Variance (ANOVA), testing the effect of algorithm (target/baseline)anddomaintype(spatial/semantic)on agreement results. There was a significant main effect of domain type (F = 6.399, p = .01), while the main effect of algorithm was marginallysignificant (F = 3.542, p = .06). However, there was a reliable typexalgorithm interaction (F = 3.624, p = .05), confirming the finding that the agreement between target and human output differed between domain types. Given the relative lack of agreement between participants in the semantic clustering task, this is unsurprising. Although the analysis focused on maximal scores obtained per entity, if participants do not agree on groupings, then the means which are statistically compared are likely to mask a significant amount of variance. We now turn to the analysis by participants. Results by participant The difference between target and baselines in agreement across participants was significant both for spatial (t = 16.6, p < .01) and semantic (t = 5.759, t < .01) domain types.</Paragraph> <Paragraph position="8"> This corroborates the earlier conclusion: once participant variation is controlled for by including it in the statistical model, the differences between target and baseline show up as reliable across the board. A univariate ANOVA corroborates the results, showing no significant main effect of domain type (F < 1, ns), but a highly significant main effect of algorithm (F = 233.5, p < .01) and a significant interaction (F = 44.3, p < .01).</Paragraph> <Paragraph position="9"> Summary The results of the evaluation are encouraging, showing high agreement between the output of the algorithm and the output that was judged by humans as most appropriate. They also suggest framework ofSS4 correspondsto human intuitions better than thebaselinestestedhere. However,theseresultsshould beinterpretedwithcautioninthecaseofsemanticclustering, wheretherewas significantvariabilityin human agreement. With respect to spatial clustering, one outstanding problem is that of reciprocal pairs which are too distant from eachother to form a perceptually well-formed cluster. We are extending the empirical study to new domains involving such cases, in order to infer from the human data a threshold on pairwise distance between entities, beyond which they are not clustered.</Paragraph> </Section> </Section> class="xml-element"></Paper>