File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1003_intro.xml
Size: 2,815 bytes
Last Modified: 2025-10-06 14:05:59
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1003"> <Title>Clustering Words with the MDL Principle</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Recently various methods for automatically constructing a thesaurus (hierarchically clustering words) based on corpus data. have been proposed (Hindle, 1990; Brown et al., 1992; Pereira et al., 1993; Tokunaga et al., 1995). The realization of such an automatic construction method would make it possible to a) save the cost of constructing a thesaurus by hand, b) do away with subjectivity inherent in a hand made thesaurus, and c) make it easier to adapt a natural language processing system to a new domain. In this paper, we propose a new method for automatic construction of thesauri. Specifically, we view the problem of automatically clustering words as that of estimating a joint distributiofl over the Cartesian product of a partition of a set of nouns (in general, any set of words) and a partition of a set of w:rbs (in general, any set of words), and propose an est.imation</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> *Real World Computing Partership </SectionTitle> <Paragraph position="0"> algorithm using simulated annealing with an energy function based on the blinimum Description Length (MDL) Principle. The MDL Principle is a well-motivated and theoretically sound principle for data compression and estimation in information theory and statistics. As a method of statisticM estimation MDL is guaranteed to be near optimal.</Paragraph> <Paragraph position="1"> We empiricMly evMuated the effectiveness of our method. In particular, we compared the performance of an MDL-based sinm\]ated anuealilag Mgorithm in hierarchical word clustering against. that of one based on the Maximum Likelihood Estimator (MLE, for short). We found that the MDL-based method performs better than the MLE-based method. We also evaluated our method by conducting pp-attachment disambiguation experiments using a thesaurus automatically constructed by it and found that disambiguation results can be improved.</Paragraph> <Paragraph position="2"> Since some words never occur in a corpus, and thus cannot be reliably classified by a method solely based on corpus data, we propose to combine the use of an automatically constructed thesaurus and a hand made thesaurus in disambiguation. We conducted some experiments in order to test the effectiveness of this strategy. Our experimental results indicate that combining an automatically constructed thesaurus and a hand made thesaurus widens the coverage 1 of our disambiguation method, while maintaining high accuracy e.</Paragraph> </Section> </Section> class="xml-element"></Paper>