File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1003_intro.xml

Size: 2,815 bytes

Last Modified: 2025-10-06 14:05:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1003">
  <Title>Clustering Words with the MDL Principle</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently various methods for automatically constructing a thesaurus (hierarchically clustering words) based on corpus data. have been proposed (Hindle, 1990; Brown et al., 1992; Pereira et al., 1993; Tokunaga et al., 1995). The realization of such an automatic construction method would make it possible to a) save the cost of constructing a thesaurus by hand, b) do away with subjectivity inherent in a hand made thesaurus, and c) make it easier to adapt a natural language processing system to a new domain. In this paper, we propose a new method for automatic construction of thesauri. Specifically, we view the problem of automatically clustering words as that of estimating a joint distributiofl over the Cartesian product of a partition of a set of nouns (in general, any set of words) and a partition of a set of w:rbs (in general, any set of words), and propose an est.imation</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
*Real World Computing Partership
</SectionTitle>
      <Paragraph position="0"> algorithm using simulated annealing with an energy function based on the blinimum Description Length (MDL) Principle. The MDL Principle is a well-motivated and theoretically sound principle for data compression and estimation in information theory and statistics. As a method of statisticM estimation MDL is guaranteed to be near optimal.</Paragraph>
      <Paragraph position="1"> We empiricMly evMuated the effectiveness of our method. In particular, we compared the performance of an MDL-based sinm\]ated anuealilag Mgorithm in hierarchical word clustering against. that of one based on the Maximum Likelihood Estimator (MLE, for short). We found that the MDL-based method performs better than the MLE-based method. We also evaluated our method by conducting pp-attachment disambiguation experiments using a thesaurus automatically constructed by it and found that disambiguation results can be improved.</Paragraph>
      <Paragraph position="2"> Since some words never occur in a corpus, and thus cannot be reliably classified by a method solely based on corpus data, we propose to combine the use of an automatically constructed thesaurus and a hand made thesaurus in disambiguation. We conducted some experiments in order to test the effectiveness of this strategy. Our experimental results indicate that combining an automatically constructed thesaurus and a hand made thesaurus widens the coverage 1 of our disambiguation method, while maintaining high accuracy e.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML