File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1104_intro.xml
Size: 3,224 bytes
Last Modified: 2025-10-06 14:03:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1104"> <Title>Automatically creating datasets for measures of semantic relatedness</Title> <Section position="4" start_page="16" end_page="16" type="intro"> <SectionTitle> 2 Evaluating SR measures </SectionTitle> <Paragraph position="0"> Various approaches for computing semantic relatedness of words or concepts have been proposed, e.g. dictionary-based (Lesk, 1986), ontology-based (Wu and Palmer, 1994; Leacock and Chodorow, 1998), information-based (Resnik, 1995; Jiang and Conrath, 1997) or distributional (Weeds and Weir, 2005). The knowledge sources used for computing relatedness can be as different as dictionaries, ontologies or large corpora.</Paragraph> <Paragraph position="1"> According to Budanitsky and Hirst (2006), there are three prevalent approaches for evaluating SR measures: mathematical analysis, application-specific evaluation and comparison with human judgments.</Paragraph> <Paragraph position="2"> Mathematical analysis can assess a measure with respect to some formal properties, e.g.</Paragraph> <Paragraph position="3"> whether a measure is a metric (Lin, 1998).4 However, mathematical analysis cannot tell us whether a measure closely resembles human judgments or whether it performs best when used in a certain application.</Paragraph> <Paragraph position="4"> The latter question is tackled by application-specific evaluation, where a measure is tested within the framework of a certain application, e.g. word sense disambiguation (Patwardhan et al., 2003) or malapropism detection (Budanitsky and Hirst, 2006). Lebart and Rajman (2000) argue for application-specific evaluation of similarity measures, because measures are always used for some task. But they also note that evaluating a measure as part of a usually complex application only indirectly assesses its quality. A certain measure may work well in one application, but not in another. Application-based evaluation can only state the fact, but give little explanation about the reasons.</Paragraph> <Paragraph position="5"> The remaining approach - comparison with human judgments - is best suited for application independent evaluation of relatedness measures.</Paragraph> <Paragraph position="6"> Human annotators are asked to judge the relatedness of presented word pairs. Results from these experiments are used as a gold standard for evaluation. A further advantage of comparison with human judgments is the possibility to gain deeper 4That means, whether it fulfills some mathematical criteria: d(x,y) [?] 0; d(x,y) = 0 = x = y; d(x,y) = d(y,x); d(x,z)[?] d(x,y)+ d(y,z).</Paragraph> <Paragraph position="7"> insights into the nature of semantic relatedness.</Paragraph> <Paragraph position="8"> However, creating datasets for evaluation has so far been limited in a number of respects. Only a small number of word pairs was manually selected, with semantic similarity instead of relatedness in mind. Word pairs consisted only of noun-noun combinations and only general terms were included. Polysemous and homonymous words were not disambiguated to concepts, i.e. humans annotated semantic relatedness of words rather than concepts.</Paragraph> </Section> class="xml-element"></Paper>