File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0113_intro.xml
Size: 3,393 bytes
Last Modified: 2025-10-06 14:05:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0113"> <Title>Evaluation Techniques for Automatic Semantic Extraction: Comparing Syntactic and Window Based Approaches</Title> <Section position="2" start_page="0" end_page="143" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> As more text becomes available electronically, it is tempting to imagine the development of automatic filters able to screen these tremendous flows of text extracting usefill bits of information. In order to properly filter, it is useful to know when two words are similar in a corpus. Knowing this would allcviate part of the term variability problem of natural language discussed in Furnas et al. (1987) . Individuals will choose a variety of words to name the same object or operation, with little overlap between people's choices. This variability in naming was cited as the principal reason for large numbers of missed citations in a large-scale evaluation of an information retrieval system \[Blair and Maron, 1985\]. A proper filter must be able to access information in the text using any word of a set of similar words. A number of knowledge-rich \[Jacobs and Rau, 1990, Calzolari and Bindi, 1990, Mauldin, 1991\] and knowledge-poor \[Brown et al., 1992, Hindle, 1990, Ruge, 1991, Grefenstette, 1992\] methods have been proposed for recognizing when words are similar.</Paragraph> <Paragraph position="1"> The knowledge-rich approaches require either a conceptual dependency representation, or semantic tagging of the words, while the knowledge-poor approaches require no previously encoded semantic information, and depend on frequency of co-occurrence of word contexts to determine similarity. Evaluations of results produced by the above systems are often been limited to visual verification by a human subject or left to the human reader.</Paragraph> <Paragraph position="2"> In this paper, we propose gold standard evaluation techniques, allowing us to objectively evaluate and to compare two knowledge-poor approaches for extracting word similarity relations from large text corpora. In order to evaluate the relations extracted, we measure the overlap of the results of each technique against existing hand-created repositories of semantic information such as thesauri and dictionaries. We describe below }low such resources can be used as evaluation tools, and apply them to two knowledge-poor approaches.</Paragraph> <Paragraph position="3"> One of the tested semantic extraction approaches uses selective natural language processing, in this case the lexical-syntactic relations that can be extracted for each word in a corpus by robust parsers \[Hindle, 1983, Grefenstette, 1993\]. The other approach uses a variation on a classic windowing technique around each word such as was used in \[Phillips, 1985\]. Both techniques are applied to the same 4 megabyte corpus. We evaluate the results of both techniques using our gold standard evaluations over thesauri and dictionaries and compare the results obtained by the syntactic based method to those obtained by the windowing method. The syntax-based method provides a better overlap with the manually defined thesaurus classes for the 600 most frequently appearing words in the corpus, while for rare words the windowing method performs slightly better for rare words.</Paragraph> </Section> class="xml-element"></Paper>