File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2124_metho.xml

Size: 13,898 bytes

Last Modified: 2025-10-06 14:14:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2124">
  <Title>Word Clustering and Disambiguation Based on Co-occurrence Data</Title>
  <Section position="4" start_page="749" end_page="749" type="metho">
    <SectionTitle>
3 Parameter Estimation
</SectionTitle>
    <Paragraph position="0"> A particular choice of partitions for a hard clustering model is referred to as a 'discrete' hard-clustering model, with the probability parameters left to be estimated. The values of these parameters can be estimated based on the co-occurrence data by the Maximum Likelihood Estimation. For a given set of</Paragraph>
    <Paragraph position="2"> the maximum likelihood estimates of the parameters are defined as the values that maximize the following likelihood function with respect to the data:</Paragraph>
    <Paragraph position="4"> It is easy to see that this is possible by setting the parameters as #(Cn, Co) = f(Cn, C~)., rn w e u v, P( lC ) = f(x) f(C~).</Paragraph>
    <Paragraph position="5"> Here, m denotes the entire data size, f(Cn, Co) the frequency of word pairs in class pair (Cn, Co), f(x) the frequency of word x, and f(C~) the frequency of words in class C~.</Paragraph>
  </Section>
  <Section position="5" start_page="749" end_page="750" type="metho">
    <SectionTitle>
4 Model Selection Criterion
</SectionTitle>
    <Paragraph position="0"> The question now is what criterion should we employ to select the best model from among the possible models. Here we adopt the Minimum Description Length (MDL) principle. MDL (Rissanen, 1989) is a criterion for data compression and statistical estimation proposed in information theory.</Paragraph>
    <Paragraph position="1"> In applying MDL, we calculate the code length for encoding each model, referred to as the 'model description length' L(M), the code length for encoding  the given data through the model, referred to as the 'data description length' L(SIM ) and their sum:</Paragraph>
    <Paragraph position="3"> The MDL principle stipulates that, for both data compression and statistical estimation, the best probability model with respect to given data is that which requires the least total description length.</Paragraph>
    <Paragraph position="4"> The data description length is calculated as</Paragraph>
    <Paragraph position="6"> where/5 stands for the maximum likelihood estimate of P (as defined in Section 3).</Paragraph>
    <Paragraph position="7"> We then calculate the model description length as</Paragraph>
    <Paragraph position="9"> where k denotes the number of free parameters in the model, and m the entire data size3 In this paper, we ignore the code length for encoding a 'discrete model,' assuming implicitly that they are equal for all models and consider only the description length for encoding the parameters of a model as the model description length.</Paragraph>
    <Paragraph position="10"> If computation time were of no concern, we could in principle calculate the total description length for each model and select the optimal model in terms of MDL. Since the number of hard clustering models is of order O(N g * vV), where N and V denote the size of the noun set and the verb set, respectively, it would be infeasible to do so. We therefore need to devise an efficient algorithm that heuristically performs this task.</Paragraph>
  </Section>
  <Section position="6" start_page="750" end_page="751" type="metho">
    <SectionTitle>
5 Clustering Algorithm
</SectionTitle>
    <Paragraph position="0"> The proposed algorithm, which we call '2D-Clustering,' iteratively selects a suboptimal MDLmodel from among those hard clustering models which can be obtained from the current model by merging a noun (or verb) class pair. As it turns out, the minimum description length criterion can be reformalized in terms of (average) mutual information, and a greedy heuristic algorithm can be formulated to calculate, in each iteration, the reduction of mutual information which would result from merging any noun (or verb) class pair, and perform the merge 1 We note that there are alternative ways of calculating the parameter description length. For example, we can separately encode the different types of probability parameters; the joint probabilities P(Cn, Cv), and the conditional probabilities P(nlCn ) and P(vlCv ). Since these alternatives are approximations of one another asymptotically, here we use only the simplest formulation. In the full paper, we plan to compare the empirical behavior of the alternatives.</Paragraph>
    <Paragraph position="1"> having the least mutual information reduction, provided that the reduction is below a variable threshold.</Paragraph>
    <Paragraph position="2">  2D-Clustering(S, b,, b~) (S is the input co-occurrence data, and bn and by are positive integers.) 1. Initialize the set of noun classes Tn and the set of verb classes Tv as:</Paragraph>
    <Paragraph position="4"> where Af and V denote the noun set and the verb set, respectively.</Paragraph>
    <Paragraph position="5">  2. Repeat the following three steps: (a) execute Merge(S, Tn, Tv, bn) to update Tn, (b) execute Merge(S, Tv, Tn, b~) to update T,, (c) if T, and T~ are unchanged, go to Step 3. 3. Construct and output a thesaurus for nouns  based on the history of Tn, and one for verbs based on the history of Tv.</Paragraph>
    <Paragraph position="6"> Next, we describe the procedure of 'Merge,' as it is being applied to the set of noun classes with the set of verb classes fixed.</Paragraph>
    <Paragraph position="7">  Merge(S, Tn, Tv, bn) 1. For each class pair in Tn, calculate the reduc null tion of mutual information which would result from merging them. (The details will follow.) Discard those class pairs whose mutual information reduction (2) is not less than the threshold</Paragraph>
    <Paragraph position="9"> where m denotes the total data size, ks the number of free parameters in the model before the merge, and \]C/ A the number of free parameters in the model after the merge. Sort the remaining class pairs in ascending order with respect to mutual information reduction.</Paragraph>
    <Paragraph position="10">  2. Merge the first bn class pairs in the sorted list. 3. Output current Tn.</Paragraph>
    <Paragraph position="11">  We perform (maximum of) bn merges at step 2 for improving efficiency, which will result in outputting an at-most bn-ary tree. Note that, strictly speaking, once we perform one merge, the model will change and there will no longer be a guarantee that the remaining merges still remain justifiable from the viewpoint of MDL.</Paragraph>
    <Paragraph position="12"> Next, we explain why the criterion in terms of description length can be reformalized in terms of mutual information. We denote the model before a merge as Ms and the model after the merge as  MA. According to MDL, MA should have the least increase in data description length</Paragraph>
    <Paragraph position="14"> and at the same time satisfies (k B -- k A ) log m 6Ldat &lt;  This is due to the fact that the decrease in model description length equals</Paragraph>
    <Paragraph position="16"> and is identical for each merge.</Paragraph>
    <Paragraph position="17"> In addition, suppose that )VIA is obtained by merging two noun classes Ci and Cj in MB to a single noun class Cq. We in fact need only calculate the difference between description lengths with respect to these classes, i.e.,</Paragraph>
    <Paragraph position="19"> Thus, the quantity 6Laat is equivalent to the mutual information reduction times the data size. ~ We conelude therefore that in our present context, a clustering with the least data description length increase is equivalent to that with the least mutual information decrease.</Paragraph>
    <Paragraph position="20"> Canceling out P(Cv) and replacing the probabilities with their maximum likelihood estimates, we</Paragraph>
    <Paragraph position="22"> Therefore, we need calculate only this quantity for each possible merge at Step 1 of Merge.</Paragraph>
    <Paragraph position="23"> In our implementation of the algorithm, we first load the co-occurrence data into a matrix, with nouns corresponding to rows, verbs to columns.</Paragraph>
    <Paragraph position="24"> When merging a noun class in row i and that in row j (i &lt; j), for each Co we add f(Ci,Co) and f(Cj,Co) obtaining f(Cij, Co), write f(Cij,Co) on row i, move f(Czast,Co) to row j, and reduce the matrix by one row.</Paragraph>
    <Paragraph position="25"> By the above implementation, the worst case time complexity of the algorithm is O(N 3 * V + V 3 * N) where N denotes the size of the noun set, V that of the verb set. If we can merge bn and bo classes at each step, the algorithm will become slightly more V 3 . efficient with the time complexity of O( bN--\]-\]. V + ~j g).</Paragraph>
  </Section>
  <Section position="7" start_page="751" end_page="752" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="751" end_page="752" type="sub_section">
      <SectionTitle>
6.1 Models
</SectionTitle>
      <Paragraph position="0"> We can restrict the hard clustering model (1) by assuming that words within a same class are generated with an equal probability, obtaining</Paragraph>
      <Paragraph position="2"> which is equivalent to the model proposed by (Li and Abe, 1996). Employing this restricted model has the undesirable tendency to classify into different classes those words that have similar co-occurrence patterns but have different absolute frequencies.</Paragraph>
      <Paragraph position="3"> The hard clustering model defined in (1) can also be considered to be an extension of the model proposed by Brown et al. First, dividing (1) by P(v),</Paragraph>
      <Paragraph position="5"> In this way, the hard clustering model turns out to be a class-based bigram model and is similar to Brown et al's model. The difference is that the model of (3) assumes that the clustering for Ca and the clustering for C, can be different, while the model of Brown et al assumes that they are the same.</Paragraph>
      <Paragraph position="6"> A very general model of noun verb joint probabilities is a model of the following form:</Paragraph>
      <Paragraph position="8"> Here Fn denotes a set of noun classes satisfying Uc~r.Cn = Af, but not necessarily disjoint. Similarly F~ is a set of not necessarily disjoint verb classes. We can view the problem of clustering words in general as estimation of such a model. This type of clustering in which a word can belong to several different classes is generally referred to as 'soft clustering.' If we assume in the above model that each verb forms a verb class by itself, then (4) becomes</Paragraph>
      <Paragraph position="10"> which is equivalent to the model of Pereira et al. On the other hand, if we restrict the general model of (4) so that both noun classes and verb classes are disjoint, then we obtain the hard clustering model we propose here (1). All of these models, therefore, are some special cases of (4). Each specialization comes with its merit and demerit. For example, employing a model of soft clustering will make the clustering process more flexible but also make the learning process more computationally demanding. Our choice of hard clustering obviously has the merits and demerits of the soft clustering model reversed.</Paragraph>
    </Section>
    <Section position="2" start_page="752" end_page="752" type="sub_section">
      <SectionTitle>
6.2 Estimation criteria
</SectionTitle>
      <Paragraph position="0"> Our method is also an extension of that proposed by Brown et al from the viewpoint of estimation criterion. Their method merges word classes so that the reduction in mutual information, or equivalently the increase in data description length, is minimized.</Paragraph>
      <Paragraph position="1"> Their method has the tendency to overfit the training data, since it is based on MLE. Employing MDL can help solve this problem.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="752" end_page="752" type="metho">
    <SectionTitle>
7 Disambiguation Method
</SectionTitle>
    <Paragraph position="0"> We apply the acquired word classes, or more specifically the probability model of co-occurrence, to the problem of structural disambiguation. In particular, we consider the problem of resolving pp-attachment ambiguities in quadruples, like (see, girl, with, telescope) and that of resolving ambiguities in compound noun triples, like (data, base, system). In the former, we determine to which of 'see' or 'girl' the phrase 'with telescope' should be attached. In the latter, we judge to which of 'base' or 'system' the word 'data' should be attached.</Paragraph>
    <Paragraph position="1"> We can perform pp-attachment disambiguation by comparing the probabilities /5~ith (telescopelsee),/Swith (telescop elgirl). (5) If the former is larger, we attach 'with telescope' to 'see;' if the latter is larger we attach it to 'girl;' otherwise we make no decision. (Disambiguation on compound noun triples can be performed similarly.) Since the number of probabilities to be estimated is extremely large, estimating all of these probabilities accurately is generally infeasible (i.e., the data sparseness problem). Using our clustering model to calculate these conditional probabilities (by normalizing the joint probabilities with marginal probabilities) can solve this problem.</Paragraph>
    <Paragraph position="2"> We further enhance our disambiguation method by the following back-off procedure: We first estimate the two probabilities in question using hard clustering models constructed by our method. We also estimate the probabilities using an existing (hand-made) thesaurus with the 'tree cut' estimation method of (Li and Abe, 1995), and use these probability values when the probabilities estimated based on hard clustering models are both zero. Finally, if both of them are still zero, we make a default decision.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML