XML Viewer - c96-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1003_metho.xml
Size: 13,984 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1003">
  <Title>Clustering Words with the MDL Principle</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Problem Setting
</SectionTitle>
    <Paragraph position="0"> A method of constructing a thesaurus based on corpus data usually consists of the following three steps: (i) Extract co-occurrence data (e.g. case frame data, adjacency data) fl'om a corpus, (ii) Starting from a single class (or each word composing its own class), divide (or merge) word classes 1 ~Cover~tge' refers to the proportion (in percentage) of test data for which the disambiguat.ion method can make a decision.</Paragraph>
    <Paragraph position="1"> 2'Accuracy' refers to the success rate, given that the disambiguation method makes a decision.</Paragraph>
    <Paragraph position="2"> based Oll the co-occurrence data using 8Ollle Sill&gt; ilarity (distance) measure. (The former apl)roach is called 'divisive', the latter 'agglomerative'.) (iii) Repeat step (ii) until some stopping condition is met, to construct a thesaurus (tree). The method we propose here consists of the same three st.eps.</Paragraph>
    <Paragraph position="3"> Suppose available to us are frequency data (cooccurrence data.) between verbs and their case slot. values extracted from a corpus (step (i)). We then view the problem of clustering words as that of estimating a probabilistic model (representing a.</Paragraph>
    <Paragraph position="4"> probability distribution) tllat generates such data We assume that the target model can be defined in the following way. First, we define a noun partition &amp;quot;PA. ~ over a given set of nouns ..'V&amp;quot; and a verb partioll &amp;quot;Pv over a given set. of verbs 12. A noun partition is any set T'-~ satisfying &amp;quot;P,~ C 2 H, Wc~e'&amp;v('i = A/ and VCi, (..) E 7)A.', Ci 0 (/j = O.</Paragraph>
    <Paragraph position="5"> A verb partition 7)v is defined analogously. In this paper, we call a member of a noun partition 'a, llOUll cluster', and a nlenlbe, r of a verb partition a ~verb cluster'. We refer to a member of the Cartesian product of a noun partition and a verb partition ( C &amp;quot;P:v x &amp;quot;Pv ) simply as 'a cluster'. We then define a probabilistic model (a joint distribution), written I'(C,, (:v), where random variable C,, assumes a value fl'om a fizcd nouu partition ~PX, and C~. a va.lue from a fixed verb partition 7)v. Within a given cluster, we assume thai each element is generated with equal probability, i.e., P(c,,,c~,) v., E c,,,v,,, E c,,, P(,,,,,,) - IC. x &lt;,1 (t) In this paper, we assume that the observed data are generaied by a model belonging to the class of models just de.scribed, and select a model which best explains the data.. As a result of this, we obtain both noun clusters and verb clusters. This problem setting is based on the intuit.lye assumption that similar words occur in the sa.me context with roughly equal likelihood, as is made explicit in equation (l). Thus selecting a model which best explains the given data is equivalent to finding the most appropriate classification of words base(t on their co-occurrence.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Clustering with MDL
</SectionTitle>
    <Paragraph position="0"> We now turn to the question of what. strategy (or criterion) we should employ for estimating the best model. Our choice is the MDL (Minimum Description I,ength) principle (tlissanen, 1989), a well-known principle of data compression and statistical estimation from inforlnation theory. MDI, stipulates that the best probability model for given data is that model which requires the least cod(: length \['or encoding of the model itself, as well as the giwql data relative to it a. We refer to the code length for the model aWe refer /.he interested reader to eli aml Abe, 1!195) for explana.tion of ra.tionals behind using the as 'the model description h'ngth' and that for tile data 'the data description length.&amp;quot; We apply MDI, to the problem of estimating a model consisting of a pair of partitions as described above. In this context, a model with less clusters tends to be simpler (in t.erms of the number of parameters), but also tends to have a poorer fit. to the data. In contrast, a model with more clusters is more complex, but tends to have a better fit to the data. Thus, there is a trade-off relationship between the simplicity of a model and the goodness of fit to the data. The model description length quantifies the simplicity (complexity) of a model, and the data description length quantifies the tlt. to the data. According to MDL, the model which minimizes the sum total of the two types of description lengths should be selected.</Paragraph>
    <Paragraph position="1"> In what follows, we will describe in detail how the description length is to be calculated in our current context, as well as our silnulated annealing algorithm based on MI)L.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Calculating Description Length
</SectionTitle>
      <Paragraph position="0"> We will now describe how the description length for a model is calculated, lh'call that each model is specified by the Cartesian product of a partition of nouns and a partition of verbs, and a number of parameters for them. Here we let /,', denote the size of the noun partition, and /q, the size of the verb partition. Tiien, there are k,. k~,- 1 free parameters in a model.</Paragraph>
      <Paragraph position="1"> Given a model M and data k', its total description length L(J/) 4 is COlnputed as the suni of the model description length L .... d('lt), the description length of its parameters I;~,,,,.(M), and data description length Ld,~t(M). (We often refer to Lm.od(.'l.\]) q- Lpar (:'~l) as the model description length). Namely,</Paragraph>
      <Paragraph position="3"> We employ the %inary noun clustering method', in which k,, is fixed at IVt and we are to dechle whether k,~ -- 1 or k,,. = 2, which is then to be applied recursiw~ly to the clusters thus obtained.</Paragraph>
      <Paragraph position="4"> This is as if we view the noutls as entities a.nd the verbs as features and cluster the entities based on their feat.ures. Since there are 2Pv'I subsets of the set of llottns .~, and for each 'binary' noun partition we have two different subsets (a special case of which is when one subset is A 'r and the other the empty set 0), the number of possible binary noml partitions is 2tAq/2 = 21~'l-J. Thus for each I)inary noun partition we need log 21a&amp;quot;l-t = i3j- I _ 1 bit.s 5 to describe it. 6 Ilenee L ..... a(M) is calculated MI)L principle in natural language processing.</Paragraph>
      <Paragraph position="5"> ~L(M) depends on .';, but we will leave ,5' implicit.</Paragraph>
      <Paragraph position="7"> Lpar(k~/), often referred to as the parallleter description length, is calculated by,</Paragraph>
      <Paragraph position="9"> where ISl denotes the input data size, and/C/,. \]c,,1 is the nnnlber of (free) parauleters ill tlle nlodel.</Paragraph>
      <Paragraph position="10"> It is known that using log ~ = ~ bits to describe each of the parameters will (approximately) minimize the description length (1Rissanen, 1.989).</Paragraph>
      <Paragraph position="11"> FinMly, Ld,t(M) is calculated by</Paragraph>
      <Paragraph position="13"> where f(n,,v) denotes the observed frequency of the noun verb pair (n,v), and P(n,v) the estimated probability of (n, v), which is calculated as follows.</Paragraph>
      <Paragraph position="14"> v,,. c c,,,w, c Cv P(,~,,,~,) - f'((::,,,c'~,) (s)</Paragraph>
      <Paragraph position="16"> where f(C',~, C,,) denotes the obserw.d frequency of the noun verb pairs belonging to cluster (c,~, &lt;;'~ ).</Paragraph>
      <Paragraph position="17"> With tile description length of a model defined in the above manner, we wish to select a model having the minimum description length and output it as the result of clustering. Since the model description length Lmod is the same for each model, in practice we only need to calculate and</Paragraph>
      <Paragraph position="19"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 A Sinllllated Annealing-based
Algorithm
</SectionTitle>
      <Paragraph position="0"> We could ill principle calculate the description length for each model and select, a model with the nfininmm description length, if COlnputation time were of no concern. However, since the number of probal)ilistic models under consideration is super exponential, this is not feasible in practice.</Paragraph>
      <Paragraph position="1"> We employ the 'simulated a.m~ealing technique' to deal with this problem. Figure 1 shows our (divisive) clustering algorithm s .</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Advantages of Our Method
</SectionTitle>
    <Paragraph position="0"> In this section, we elaborate on the merits of our method.</Paragraph>
    <Paragraph position="1"> In. statistical natural language processing, usually the number of parameters in a probabilistic  and it depends on the exact coding scheme used for the description of the models.</Paragraph>
    <Paragraph position="2"> SAs we noted earlier, an Mternative would be to employ an agglomerative Mgorithm.</Paragraph>
    <Paragraph position="3"> model to be estimated is very large, and therefore such a model is difficult to estimate with a reasonable data size that is available in practice. (This problem is usually referred to as the 'data sparseness problem'.) We could smooth the estimated probabilities using an existing smoothing technique (e.g., (Dagan el, al., 1992; Gale and Church, 1990)), then calculate some similarity measure using the smoothed probabilities, and then cluster words according to it. There is no guarantee, however, that the employed smoothing method is in any way consistent with the clustering method used subsequently. Our method based on MDL resolves this issue in a unified fashion. By employing models that embody the assumption that words belonging to a same class occur in the same context with equal likelihood, our method achieves the smoothing effect as a side effect of the clustering process, where the domains of smoothing coincide with the classes obtained by clustering.</Paragraph>
    <Paragraph position="4"> Thus, the coarseness or fineness of clustering also determines the degree of smoothing. All of these effects fall out naturally as a corollary of the imperatiw? of 'best possible estimation', the original motivation behind the MDL principle.</Paragraph>
    <Paragraph position="5"> in our simulated annealing algorithm, we could alternatively employ the Maxinmm Likelihood Estimator (MLE) as criterion for the best probabilistic model, instead of MDL. MLE, as its name suggests, selects a model which maximizes the likelihood of the data, that is, /5 = a.rg maxp I-\[~C/s P(x). This is equivalent to mininfizing the 'data description length' as defined in Section 3, i.e. i 5 = arg minp ~,~-~s - log P(x). We can see easily that MDL genet:al\[zes MLE, in that it also takes into account the complexity of the model itself. In the presence of models with varying complexity, MLE tends to overfit the data, and output; a model that is too complex and tailored to fit the specifics of the input data. If we employ MLE as criterion in our simulated annealing algorithm, it. will result in selecting a very fine model with many small clusters, most of which will have probabilities estimated as zero. Thus, in contrast to employing MDL, it will not have the effect of smoothing a.t all.</Paragraph>
    <Paragraph position="6"> Purely as a method of estimation as well, the superiority of MI)L over MLE is supported by convincing theoretical findings (c.f. (Barton and Cover, 1991; Yamanishi, 1992)). For instance, the speed of convergence of the models selected by MDL to the true model is known to be near optiinal. (The models selected by MDL converge to the true model approximately at the rate of 1/s where s is the nmnber of parameters in the true model, whereas for MLE the rate is l/t, where t is the size of the domain, or in our context, the total number of elements of N&amp;quot; x V.) 'Consistency' is another desirable property of MDL, which is not  shared by MLE. That is, the number of parame-Algorithm: Clustering 1. Divide the noun set N into two subs0ts. I)efine a probabilistic model consisting of the l)artition of nouns si)ecified by the two sul)sets and th(&amp;quot; entire set. of verbs. 2. do{ 2.1 Randomly select, one noun, rcmow&gt; it from t.h~; subset it. belongs to and add it. to the other. 2.2 C.alcuh~tc the description length for the two models (before and after the mow~') as L1 and Le, respectively.</Paragraph>
    <Paragraph position="7"> 2.3 Viewing the description length as the energy flmction for annealing, let AL = Le - L:. If AL &lt; 0, fix the mow~, otherwise ascertain the mowe with probability P = eXl)(-AL/T). } while (the description length has decreased during the past 10. INI trials.) Itere T is the a.nnealing t.enq.)crat.urc whose initial value, is 1 and updated to be 0.97' after 10. \]NI trials.</Paragraph>
    <Paragraph position="8"> 3. If one of the obtained subset is elul)t,y, t\]ll?ll return the IlOll-Olllpty subset, otherwise recursiw,ly apply Clustering on both of the two subsets.</Paragraph>
    <Paragraph position="9">  ters in l;he models selected by MDI~ ('otivorg~&amp;quot; to that of the true model (Rissanen, 1989). Both of these prol&gt;erties of MI)I, ar~ Oml&gt;irically w'ri/ied in our present (;Ollt(?x\[,, as will be show,: in t.ho t:(,xl section. In particular, we haw~ compared l,h(' p(u'forn:a.nc0 of employing an M1)L-based simula.ted annealing against that of one 1)ascd on M\[,I&amp;quot;, ill hierarchical woM clust.c'ring.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML