File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0111_intro.xml
Size: 3,946 bytes
Last Modified: 2025-10-06 14:05:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0111"> <Title>EXPERIMENTS IN SYNTACTIC AND SEMANTIC CLASSIFICATION AND DISAMBIGUATION USING BOOTSTRAPPING*</Title> <Section position="3" start_page="0" end_page="117" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> The identification of the syntactic class and the discovery of semantic information for words not contained in any on-line dictionary or thesaurus is an important and challenging * This material is based upon work supported by the National Science Foundation under Grant No. DIR-8814522. problem. Excellent methods have been developed for part-of-speech (POS) tagging using stochastic models trained on partially tagged corpora (Church, 1988; Cutting, Kupiec, Pedersen & Sibun, 1992). Semantic issues have been addressed, particularly for sense disambiguation, by using large contexts, e.g., 50 nearby words (Gale, Church & Yarowsky, 1992) or by reference to on-line dictionaries (Krovetz, 1991; Lesk, 1986; Liddy & Paik, 1992; Zernik, 1991). More recently, methods to work with entirely untagged corpora have been developed which show great promise (Brill & Marcus, 1992; Finch & Chater, 1992; Myaeng & Li, 1992; Schutze, 1992). They are particularly useful for text with specialized vocabularies and word use.</Paragraph> <Paragraph position="1"> These methods of unsupervised classification typically have clustering algorithms at their heart (Jain & Dubes, 1988). They use similarity of contexts (the distribution principle) as a measure of distance in the space of words and then cluster similar words into classes. This paper demonstrates a particular approach to these classification techniques.</Paragraph> <Paragraph position="2"> In our approach, we take into account both the relative positions of the nearby context words as well as the mutual information (Church & Hanks, 1990) associated with the occurrence of a particular context word. The similarities computed from these measures of the context contain information about both syntactic and semantic relations. For example, high similarity values are obtained for the two semantically similar nouns, &quot;diameter&quot; and &quot;length&quot;, as well as for the two adjectives &quot;nonmotile&quot; and &quot;nonchemotactic&quot;. We demonstrate the technique on three problems, all using a 200,000 word corpus composed of 1700 abstracts from a specialized field of biology: #1: Generating the full classification tree for the 1,000 most frequent words (covering 80% of all word occurrences).</Paragraph> <Paragraph position="3"> #2: The classification of 138 occurrences of the -ed forms, &quot;cloned&quot; and &quot;deduced&quot; into four syntactic categories, including improvements by using expanded context information derived in #1. #3: The classification of 100 words that only occur once in the entire corpus (hapax legomena), again using expanded contexts.</Paragraph> <Paragraph position="4"> The results described below were obtained using no pretagging or on-line dictionary, but the results compare favorably with methods that do. The results are discussed in terms of the semantic fields they delineate, the accuracy of the classifications and the nature of the errors that occur. The results make it clear that this new technology is very promising and should be pursued vigorously.</Paragraph> <Paragraph position="5"> The power of the approach appears to result from using a focused corpus, using detailed positional information, using mutual information measures and using a clustering method that updates the detailed context information when each new cluster is formed.</Paragraph> <Paragraph position="6"> Our approach was inspired by the fascinating results achieved by Finch and Chater at Edinburgh and the methods they used (Finch</Paragraph> </Section> class="xml-element"></Paper>