XML Viewer - n01-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1010_metho.xml
Size: 24,719 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="N01-1010">
  <Title>Tree-cut and A Lexicon based on Systematic Polysemy</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 The Tree-cut Technique
</SectionTitle>
    <Paragraph position="0"> The tree-cut technique is an unsupervised learning technique which partitions data items organized in a tree structure into mutually-disjoint clusters. It was originally proposed in (Li and Abe, 1998), and then adopted in our previous method for automatically extracting systematic polysemy(Tomuro, 2000). In this section, we give a brief summary of this tree-cut technique using examples from (Li and Abe, 1998)'s original work.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="3" type="metho">
    <SectionTitle>
2.1 Tree-cut Models
</SectionTitle>
    <Paragraph position="0"> The tree-cut technique is applied to data items that are organized in a structure called a thesaurus tree.</Paragraph>
    <Paragraph position="1"> A thesaurus tree is a hierarchicallyorganized lexicon where leaf nodes encode lexicaldata (i.e., words)and internal nodes represent abstract semantic classes.</Paragraph>
    <Paragraph position="2"> A tree-cut is a partition of a thesaurus tree. It is a list of internal/leaf nodes in the tree, and each node represents a set of all leaf nodes in asubtree rooted by the node. Such a set is also considered as a cluster.</Paragraph>
    <Paragraph position="3">  Clusters in a tree-cut exhaustively cover all leaf nodes of the tree, and they are mutually disjoint. For instance, Figure 1 shows an example thesaurus tree andone possible tree-cut[AIRCRAFT,ball, kite, puzzle], which is indicated by a thick curve in the gure. There are also four other possible tree-cuts for this tree: [airplane, helicopter, ball, kite, puzzle], [airplane, helicopter, TOY], [AIRCRAFT, TOY]and [ARTIFACT].</Paragraph>
    <Paragraph position="4"> In (Li and Abe, 1998), the tree-cut technique was applied to the problem of acquiring general- null A leaf node is also a cluster whose cardinalityis1. ized case frame patterns from a corpus. Thus, each node/word in the tree received as its value the number of instances where the word occurred as a case role (subject, object etc.) of a given verb. Then the acquisition of a generalized case frame was viewed as a problem of selecting the best tree-cut model that estimates the true probability distribution, given a sample corpus data.</Paragraph>
    <Paragraph position="5"> Formally,atree-cut model M is a pair consisting of a tree-cut ; and a probability parameter vector of the same length,</Paragraph>
    <Paragraph position="7"> ) is the probability of a cluster C</Paragraph>
    <Paragraph position="9"> ). For example, suppose a corpus contains 10 instances of verb-object relation for the verb \ y&amp;quot;, and the frequencies of object nouns n, denoted f(n), are as follows:</Paragraph>
    <Paragraph position="11"> 0;;f(kite)=2;;f(puzzle)=0.Then, the set of tree-cut models for the example thesaurus tree shown in Figure 1 includes ([airplane, helicopter, TOY], [.5, .3, .2]) and ([AIRCRAFT, TOY], [.8, .2]).</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.2 The MDL Principle
</SectionTitle>
      <Paragraph position="0"> To select the best tree-cut model, (Li and Abe, 1998) uses the Minimal Description Length (MDL). The MDL is a principle of data compression in Information Theory which states that, for a given dataset, the best model is the one which requires the minimum length (often measured in bits) to encode the model (the model description length) and the data (the data description length) (Rissanen, 1978).</Paragraph>
      <Paragraph position="1"> Thus, the MDL principle captures the trade-o between the simplicity of a model, which is measured bythenumberof clustersin atree-cut, and the goodness of t to the data, which is measured by the estimation accuracy of the probability distribution.</Paragraph>
      <Paragraph position="2"> The calculation of the description length for a tree-cut model is as follows. Given a thesaurus tree T and a sample S consisting of the case frame instances, the total description length L(M;;S) for a</Paragraph>
      <Paragraph position="4"> where L(;) is the model description length, L(j;) is the parameter description length (explained shortly), and L(Sj;;;) is the data description length. Note that L(;) + L(j;) essentially corresponds to the usual notion of the model description length.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="5" type="metho">
    <SectionTitle>
ARTIFACT
AIRCRAFT TOY
</SectionTitle>
    <Paragraph position="0"> airplane helicopter ball kite puzzle  Each length in L(M;;S) is calculated as follows.  The model description length L(;) is</Paragraph>
    <Paragraph position="2"> where G is the set of all cuts in T,andjGj denotes the size of G. This value is a constantforallmodels, thus it is omitted in the calculation of the total length.</Paragraph>
    <Paragraph position="3"> The parameter description length L(j;) indicates the complexityofthemodel. It is the length required to encode the probability distribution of the clusters in the tree-cut ;. It is calculated as</Paragraph>
    <Paragraph position="5"> where k is the length of , and jSj is the size of S.</Paragraph>
    <Paragraph position="6"> Finally, the data description length L(Sj;;;) is the length required to encode the whole sample data.</Paragraph>
    <Paragraph position="7"> It is calculated as</Paragraph>
    <Paragraph position="9"> where, for each n 2 C and each C 2 ;,</Paragraph>
    <Paragraph position="11"> Note the equation (7) essentially computes the Maximum Likelihood Estimate (MLE) for all n.</Paragraph>
    <Paragraph position="12">  A table in Figure 1 shows the MDL lengths for all ve tree-cut models. The best model is the one with the tree-cut [AIRCRAFT, ball, kite, puzzle].</Paragraph>
  </Section>
  <Section position="7" start_page="5" end_page="8" type="metho">
    <SectionTitle>
3 Clustering Systematic Polysemy
</SectionTitle>
    <Paragraph position="0"> Using the tree-cut technique described above, our previous work (Tomuro, 2000) extracted systematic polysemyfromWordNet. In this section, wegivea summary of this method, and describe the cluster pairs obtained by the method.</Paragraph>
    <Paragraph position="1">  For justication and detailed explanation of these formulas, see (Li and Abe, 1998).</Paragraph>
    <Paragraph position="2">  In our previous work, weusedentropy instead of MLE. That is because the lexicon represents true population, not samples;; thus there is no additional data to estimate.</Paragraph>
    <Section position="1" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
3.1 Extraction Method
</SectionTitle>
      <Paragraph position="0"> In our previous work, systematically related word senses are derived as binary cluster pairs, by applying the extraction procedure to a combination of two WordNet (sub)trees. This process is done in the following three steps. In the rst step, all leaf nodes of the two trees are assigned avalue of either 1, if a node/word appears in both trees, or 0 otherwise.</Paragraph>
      <Paragraph position="1">  In the second step, the tree-cut technique is applied to each tree separately,andtwo tree-cuts (or sets of clusters) are obtained. To searchthebest tree-cut for atree(i.e., the model which requires the minimum total description length), a greedy algorithm called Find-MDL described in (Li and Abe, 1998) is used to speed up the search. Finally in the third step, clusters in those two tree-cuts are matched up, and the pairs whichhavesubstantial overlap (more than three overlapping words) are selected as systematic polysemies.</Paragraph>
      <Paragraph position="2"> Figure 2 shows parts of the nal tree-cuts for the ARTIFACT and MEASURE classes. Note in the gure, bold letters indicate words which are polysemous in the two trees (i.e., assigned a value 1).</Paragraph>
    </Section>
    <Section position="2" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
3.2 Modication
</SectionTitle>
      <Paragraph position="0"> In the currentwork, we made a minor modication to the extraction method described above, by removing nodes that are assigned a value 0 from the trees. The purpose was to make the tree-cut technique less sensitive to the structure of a tree and produce more specic clusters dened at deeper levels. null  The MDL principle inherently penalizes a complex tree-cut by assigning a long parameter length. Therefore, shorter tree-cuts partitioned at abstract levels are often preferred. This causes a problem when the tree is bushy, which is the case with Word-Net trees. Indeed, many tree-cut clusters obtained in our previous work were from nodes at depth 1 (counting the root as depth 0) { around 88% (122  Prior to this, eachWordNet (sub)tree is transformed into a thesaurus tree, since WordNet tree is a graph rather than a tree, and internal nodes as well as leaf nodes carry data. In the transformation, all internal nodes in a WordNet tree are copied as leaf nodes, and shared subtrees are duplicated.  Removing nodes with 0 is also warranted since we are not estimating values for those nodes (as explained in footnote 5).  out of total 138clusters)obtained for 5 combinations of WordNet noun trees. Note that we did not allow a cluster at the root of a tree;; thus, depth 1isthe highest level for any cluster. After the modication above, the proportion of depth 1 clusters decreased to 49% (169 out of total 343 clusters) for the same tree combinations.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="8" type="sub_section">
      <SectionTitle>
3.3 Extracted Cluster Pairs
</SectionTitle>
      <Paragraph position="0"> We applied the modied method described aboveto all nouns and verbs in WordNet. We rst partitioned words in the two categories into basic classes. A basic class is an abstract semantic concept, and it corresponds to a (sub)tree in the WordNet hierarchies. Wechose 24 basic classes for nouns and 10 basic classes for verbs, from WordNet Top categories for nouns and lexicographers' le names for verbs respectively. Those basic classes exhaustively cover all words in the two categories encoded in Word-Net. For example, basic classes for nouns include ARTIFACT, SUBSTANCE and LOCATION, while basic classes for verbs include CHANGE, MOTION and STATE.</Paragraph>
      <Paragraph position="1"> For each part-of-speech category, we applied our extraction method to all combinations of two basic classes. Here, a combined class, for instance ARTIFACT-SUBSTANCE, represents an underspecied semantic class. We obtained 2,377 cluster pairs in 99 underspecied classes for nouns, and 1,710cluster pairs in 59 underspecied classes for verbs. Table 1 shows a summary of the number of basic and underspecied classes and cluster pairs extracted by our method.</Paragraph>
      <Paragraph position="2"> Although the results vary among category combinations, the accuracy (precision) of the derived cluster pairswasratherlow: 50to 60%on average,based on our manual inspection using around 5% randomly chosen samples.</Paragraph>
      <Paragraph position="3">  This means our automatic method over-generates possible relations. We speculate that this is because in general, there are many homonymous relations that are 'systematic' in the English language. For example, in the ARTIFACT-GROUP class, a pair [LUMBER, SOCIAL GROUP]was extracted.</Paragraph>
      <Paragraph position="4"> Words which are common in the two clusters are \picket&amp;quot;, \board&amp;quot; and \stock&amp;quot;. Since there are enough numberofsuchwords (for our purpose), our automatic method could not dierentiate them from true systematic polysemy.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="8" end_page="9" type="metho">
    <SectionTitle>
4 Evaluation: Comparison with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="8" end_page="9" type="sub_section">
      <SectionTitle>
WordNet Cousins
</SectionTitle>
      <Paragraph position="0"> To test our automatic extraction method, wecompared the cluster pairs derived by our method to WordNet cousins. The cousin relation is relatively new in WordNet, and the coverage is still incomplete. Currently a total of 194 unique relations are encoded. A cousin relation in WordNet is dened between two synsets, and it indicates that senses of aword that appear in both of the (sub)trees rooted by those synsetsare related.</Paragraph>
      <Paragraph position="1">  The cousinswere man- null Note that the relatedness between clusters was determined solely by our subjective judgement. That is because there is no existing large-scale lexicon which encodes related senses completely for all words in the lexicon. (Note that WordNet cousin relation is encoded only for some words). Although the distinction between related vs. unrelated meanings is sometimes unclear, systematicity of the related senses among words is quite intuitive and has been well studied in Lexical Semantics (for example, (Apresjan, 1973;; Nunberg, 1995;; Copestake and Briscoe, 1995)). A comparison with WordNet cousin is discussed in the next section 4.</Paragraph>
      <Paragraph position="2">  Actually, cousin is one of the three relations which indicate the grouping of related senses of a word. Others are sister and twin. In this paper, we use cousin to refer to all relations listed in \cousin.tps&amp;quot; le (available in a WordNet distribution).</Paragraph>
      <Paragraph position="3"> ually identied by the WordNet lexicographers.</Paragraph>
      <Paragraph position="4"> To compare the automatically derived cluster pairs to WordNet cousins, we used the hypernym-hyponym relation in the trees, instead of the number or ratio of the overlapping words. This is because the levels at which the cousin relations are dened dier quite widely, from depth 0 to depth 6, thus the number of polysemous words covered in each cousin relation signicantly varies. Therefore, it was di-cult to decide on an appropriate threshold value for either criteria.</Paragraph>
      <Paragraph position="5"> Using the hypernym-hyponym relation, we checked, for each cousin relation, whether there was at least one cluster pair that subsumed or was subsumed by the cousin. More specically, for a cousin relation dened between nodes c1 and c2 in trees T1 and T2 respectively and a cluster pair dened between nodes r1 and r2 in the same trees, wedecided on the correspondence if c1isahypernym or hyponym of r1, and c2isahypernym or hyponym r2 at the same time.</Paragraph>
      <Paragraph position="6"> Based on this criteria, we obtained a result indicating that 173 out of the 194 cousin relations had corresponding cluster pairs. This makes the recall ratio 89%, whichwe consider to be quite high.</Paragraph>
      <Paragraph position="7"> In addition to the WordNet cousins, our automatic extraction method discovered several interesting relations. Table 2 shows some examples.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="9" end_page="43" type="metho">
    <SectionTitle>
5 A Lexicon based on Systematic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="9" end_page="43" type="sub_section">
      <SectionTitle>
Relations
</SectionTitle>
      <Paragraph position="0"> Using the extracted cluster pairs, we partitioned word senses for all nouns and verbs in WordNet, and produced a lexicon. Recall from the previous section that our cluster pairs are generated for all possible binary combinations of basic classes, thus one sense could appear in more than one cluster pair. For example, Table3shows the cluster pairs (and a set of senses covered by each pair, which we call a sense cover) extracted for the noun \table&amp;quot; (which has 6 senses in WordNet). Also as wehave mentioned earlier in section accuracy-result, our cluster pairs contain many false positives ones. For those reasons, we took a conservativeapproach, by disallowing transitivity of cluster pairs.</Paragraph>
      <Paragraph position="1"> To partition senses of a word, we rst assign each sense cover a value whichwecallaconnectedness. It is dened as follows. For a given word w whichhasn senses, let S be the set of all sense covers generated for w. Let c ij denote the numberofsensecovers in which sense i (s</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> represents the weightofa direct relation, and d ij represents the weightofan indirect relation between anytwo senses i and j. The idea behind this connectedness measure is to favor sense covers that have strong intra-relations. This measure also eectively takes into account a one-level</Paragraph>
      <Paragraph position="7"> . As an example, the connectedness of (2 3 4) is the summation of c  and d  = 4 because sense 2 and 3 co-occur in four sense covers, and c  shows the connectedness values for all sense covers for \table&amp;quot;.</Paragraph>
      <Paragraph position="8"> Then, we partition the senses by selecting a set of non-overlapping sense covers which maximizes the total connectedness value. So in the example above, the set f(1 4),(2 3 5)g yields the maximum connectedness. Finally, senses that are not covered by any sense covers are taken as singletons, and added to the nal sense partition. So the sense partition for \table&amp;quot; becomes f(1 4),(2 3 5),(6)g. Table 4 shows the comparison between Word-Net and our new lexicon. As you can see, our lexicon contains much less ambiguity: the ratio of monosemous words increased from 84% (88,650/105,461.84)to 92% (96,964/105,461.92), and the average number of senses for polysemous words decreased from 2.73 to 2.52 for nouns, and from 3.57 to 2.82 for verbs.</Paragraph>
      <Paragraph position="9"> As a note, our lexicon is similar to CORELEX (Buitelaar, 1998) (or CORELEX-II presented in (Buitelaar, 2000)), in that both lexicons share the same motivation. However, our lexicon diers from CORELEX in that CORELEX looks at all senses of aword and groups words that have the same sense distribution pattern, whereas our lexicon groups  word senses that have the same systematic relation. Thus, our lexicon represents systematic polysemyat anerlevel than CORELEX, by pinpointing related senses within eachword.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="43" end_page="231" type="metho">
    <SectionTitle>
6 Evaluation: Inter-annotator
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="43" end_page="231" type="sub_section">
      <SectionTitle>
Disagreement
</SectionTitle>
      <Paragraph position="0"> To test if the sense partitions in our lexicon constitute an appropriate (or useful) level of granularity, we applied it to the inter-annotator disagreement observed in two semantically annotated corpora: WordNet Semcor (Landes et al., 1998) and DSO (Ng and Lee, 1996). The agreementbetween those corpora is previously studied in (Ng et al., 1999). In our current work, we rst re-produced their agreement data, then used our sense partitions to see whether or not they yield a better agreement.</Paragraph>
      <Paragraph position="1"> In this experiment, we extracted 28,772 sentences/instances for 191 words (consisting of 121 nouns and 70 verbs) tagged in the intersection of the two corpora. This constitutes the base data set.</Paragraph>
      <Paragraph position="2"> Table 5 shows the breakdown of the number of instances where tags agreed and disagreed.</Paragraph>
      <Paragraph position="3">  Note that the numbers reported in (Ng et al., 1999) are slightly more than the ones reported in this paper. For instance, the number of sentences in the intersected corpus reported in (Ng et al., 1999) is 30,315. We speculate the discrepancies are due to the dierentsentence alignment meth- null This low agreement ratio is also re ected in a measure called the statistic (Carletta, 1996;; Bruce and Wiebe, 1998;; Ng et al., 1999). measure takes into accountchance agreement, thus better representing the state of disagreement. A value is calculated for each word, on a confusion matrix where rows represent the senses assigned by judge 1 (DSO) and columns represent the senses assigned by judge 2 (Semcor). Table 6 shows an example matrix for the noun \table&amp;quot;.</Paragraph>
      <Paragraph position="4"> A value for a word is calculated as follows. We use the notation and formula used in (Bruce and Wiebe, 1998). Let n ij denote the number of instances where the judge 1 assigned sense i and the judge 2 assigned sense j to the same instance, and</Paragraph>
      <Paragraph position="6"> denote the marginal totals of rows and columns respectively. The formula is:</Paragraph>
      <Paragraph position="8"> (i.e., proportion of n ii , the number of instances where both judges agreed on sense i,to the total instances), P</Paragraph>
      <Paragraph position="10"> The value is 1.0 when the agreement is perfect (i.e., values in the o-diagonal cells are all 0, that is,</Paragraph>
      <Paragraph position="12"> (Ng et al., 1999) reports a higher agreement of 57%. We speculate the discrepancy might be from the version of Word-Net senses used in DSO, whichwas slightly dierent from the standard delivery version (as noted in (Ng et al., 1999)).</Paragraph>
      <Paragraph position="13">  bychance (i.e., values in a row (or column) are uniformly distributed across rows (or columns), that is,</Paragraph>
      <Paragraph position="15"> for all 1 i M, where M is the number of rows/columns). also takes a negative value when there is a systematic disagreement between the two judges (e.g., some values in the diagonal cells are 0, that is, P ii = 0 for some i). Normally, :8 is considered a good agreement (Carletta, 1996).</Paragraph>
      <Paragraph position="16"> By using the formula above, the average for the 191 words was .264, as shown in Table 5.  This means the agreement between Semcor and DSO is quite low.</Paragraph>
      <Paragraph position="17"> We selected the same 191 words from our lexicon, and used their sense partitions to reduce the size of the confusion matrices. For eachword, we computed the for the reduced matrix, and compared it with the for a random sense grouping of the same partition pattern.</Paragraph>
      <Paragraph position="18">  For example, the partition pattern of  . The value for a random grouping is obtained by generating 5,000 random partitions whichhave the same pattern as the corresponding sense partition in our lexicon, then taking the mean of their 's. Then we measured the possible increase in by our lexicon by taking the dierence between the paired values for all words (i.e., w by our sense partition w null by random partition, for a word w), and performed a signicance  (Ng et al. 1999)'s result is slightly higher: = :317.  For this comparison, we excluded 23 words whose sense partitions consisted of only 1 sense cover. This is re ected in the total number of instances in Table 8.  test, with a null hypothesis that there was no significant increase. The result showed that the P-values were 4.17 and 2.65 for nouns and verbs respectively, whichwere both statistically signicant. Therefore, the null hypothesis was rejected, and we concluded that there was a signicant increase in by using our lexicon.</Paragraph>
      <Paragraph position="19"> As a note, the average 's for the 191 words from our lexicon and their corresponding random partitions were .260 and .233 respectively. Those values are in fact lower than that for the original WordNet lexicon. There are two major reasons for this. First, in general, combining any arbitrary senses does not always increase . In the given formula 9, actually decreases when the increase in</Paragraph>
      <Paragraph position="21"> (i.e., the diagonal sum) in the reduced matrix is less than the in-</Paragraph>
      <Paragraph position="23"> (i.e., the marginalproduct sum) by some factor.</Paragraph>
      <Paragraph position="24">  This situation typically happens when senses combined are well distinguished in the original matrix, in the sense that, for senses i and j, n ij and n ji are 0 or very small (relative to the total frequency). Second, some systematic relations are in fact easily distinguishable. Senses in such relations often denote dierent objects in a context, for instance ANIMAL and MEAT senses of \chicken&amp;quot;. Since our lexicon groups those senses together, the 's for the reduce matrices decrease for the reason we mentioned above. Table 8shows the breakdown of the average for our lexicon and random groupings.  This is because</Paragraph>
      <Paragraph position="26"> is subtracted in both the numerator and the denominator in the formula. Note that</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML