File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1101_metho.xml
Size: 21,225 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1101"> <Title>Semantic Taxonomy Induction from Heterogenous Evidence</Title> <Section position="4" start_page="801" end_page="803" type="metho"> <SectionTitle> 2 A Probabilistic Framework for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="801" end_page="801" type="sub_section"> <SectionTitle> Taxonomy Induction </SectionTitle> <Paragraph position="0"> In section 2.1 we introduce our definitions for taxonomies, relations, and the taxonomic constraints that enforce dependencies between relations; in section 2.2 we give a probabilistic model for defining the conditional probability of a set of relational evidence given a taxonomy; in section 2.3 we formulate a local search algorithm to find the taxonomy maximizing this conditional probability; and in section 2.4 we extend our framework to deal with lexical ambiguity.</Paragraph> </Section> <Section position="2" start_page="801" end_page="801" type="sub_section"> <SectionTitle> 2.1 Taxonomies, Relations, and Taxonomic Constraints </SectionTitle> <Paragraph position="0"> We define a taxonomy T as a set of pairwise relations R over some domain of objects DT. For example, the relations in WordNet include hypernymy, holonymy, verb entailment, and many others; the objects of WordNet between which these relations hold are its word senses or synsets. We define that each relation R [?] R is a set of ordered or unordered pairs of objects (i,j) [?] DT; we define Rij [?] T if relationship R holds over objects (i,j) in T.</Paragraph> </Section> <Section position="3" start_page="801" end_page="801" type="sub_section"> <SectionTitle> Relations for Hyponym Acquisition </SectionTitle> <Paragraph position="0"> For the case of hyponym acquisition, the objects in our taxonomy are WordNet synsets. In this paper we focus on two of the many possible relationships between senses: the hypernym relation and the coordinate term relation. We treat the hypernym or ISA relation as atomic; we use the notation Hnij if a sense j is the n-th ancestor of a sense i in the hypernym hierarchy. We will simply use Hij to indicate that j is an ancestor of i at some unspecified level. Two senses are typically considered to be &quot;coordinate terms&quot; or &quot;taxonomic sisters&quot; if they share an immediate parent in the hypernym hierarchy. We generalize this notion of siblinghood to state that two senses i and j are (m,n)-cousins if their closest least common subsumer (LCS)2 is within exactly m and n links, respectively.3 We use the notation Cmnij to denote that i and j are (m,n)-cousins. Thus coordinate terms are (1,1)-cousins; technically the hypernym relation may also be seen as a specific case of this representation; an immediate parent in the hypernym hierarchy is a (1,0)-cousin, and the k-th ancestor is a (k,0)-cousin.</Paragraph> </Section> <Section position="4" start_page="801" end_page="801" type="sub_section"> <SectionTitle> Taxonomic Constraints </SectionTitle> <Paragraph position="0"> A semantic taxonomy such as WordNet enforces certain taxonomic constraints which disallow particular taxonomies T. For example, the ISA transitivity constraint in WordNet requires that each synset inherits the hypernyms of its hypernym, and the part-inheritance constraint requires that each synset inherits the meronyms of its hypernyms.</Paragraph> <Paragraph position="1"> For the case of hyponym acquisition we enforce the following two taxonomic constraints on the hypernym and (m,n)-cousin relations: the hypernyms of its direct hypernym; constraint (2) simply defines the (m,n)-cousin relation in terms of the atomic hypernym relation.</Paragraph> <Paragraph position="2"> The addition of any new hypernym relation to a preexisting taxonomy will usually necessitate the addition of a set of other novel relations as implied by the taxonomic constraints. We refer to the full set of novel relations implied by a new link Rij as I(Rij); we discuss the efficient computation of the set of implied links for the purpose of hyponym acquisition in Section 3.4.</Paragraph> </Section> <Section position="5" start_page="801" end_page="802" type="sub_section"> <SectionTitle> 2.2 A Probabilistic Formulation </SectionTitle> <Paragraph position="0"> We propose that the event Rij [?] T has some prior probability P(Rij [?] T), and P(Rij [?] 2A least common subsumer LCS(i,j) is defined as a synset that is an ancestor in the hypernym hierarchy of both i and j which has no child that is also an ancestor of both i and j. When there is more than one LCS (due to multiple inheritance), we refer to the closest LCS, i.e.,the LCS that minimizes the maximum distance to i and j.</Paragraph> <Paragraph position="1"> T)+P(Rij negationslash[?] T) = 1. We define the probability of the taxonomy as a whole as the joint probability of its component relations; given a partition of all possible relations R = {A,B} where A [?] T and B negationslash[?] T, we define:</Paragraph> <Paragraph position="3"> We assume that we have some set of observed evidence E consisting of observed features over pairs of objects in some domain DE; we'll begin with the assumption that our features are over pairs of words, and that the objects in the taxonomy also correspond directly to words.4 Given a set of features ERij [?] E, we assume we have some model for inferring P(Rij [?] T|ERij), i.e., the posterior probability of the event Rij [?] T given the corresponding evidence ERij for that relation. For example, evidence for the hypernym relation EHij might be the set of all observed lexico-syntactic patterns containing i and j in all sentences in some corpus.</Paragraph> <Paragraph position="4"> For simplicity we make the following independence assumptions: first, we assume that each item of observed evidence ERij is independent of all other observed evidence given the taxonomyT,</Paragraph> <Paragraph position="6"> Further, we assume that each item of observed evidence ERij depends on the taxonomy T only by way of the corresponding relation Rij, i.e.,</Paragraph> <Paragraph position="8"> For example, if our evidence EHij is a set of observed lexico-syntactic patterns indicative of hypernymy between two words i and j, we assume that whatever dependence the relations in T have on our observations may be explained entirely by dependence on the existence or non-existence of the single hypernym relation H(i,j).</Paragraph> <Paragraph position="9"> Applying these two independence assumptions we may express the conditional probability of our evidence given the taxonomy:</Paragraph> <Paragraph position="11"> Rewriting the conditional probability in terms of our estimates of the posterior probabilities 4In section 2.4 we drop this assumption, extending our model to manage lexical ambiguity.</Paragraph> <Paragraph position="13"> Within our model we define the goal of taxonomy induction to be to find the taxonomy ^T that maximizes the conditional probability of our observations E given the relationships of T, i.e., to</Paragraph> <Paragraph position="15"/> </Section> <Section position="6" start_page="802" end_page="803" type="sub_section"> <SectionTitle> 2.3 Local Search Over Taxonomies </SectionTitle> <Paragraph position="0"> We propose a search algorithm for finding ^T for the case of hyponym acquisition. We assume we begin with some initial (possibly empty) taxonomy T. We restrict our consideration of possible new taxonomies to those created by the single operation ADD-RELATION(Rij,T), which adds the single relation Rij to T.</Paragraph> <Paragraph position="1"> We define the multiplicative change [?]T(Rij) to the conditional probability P(E|T) given the addition of a single relation Rij:</Paragraph> <Paragraph position="3"> Here k is the inverse odds of the prior on the event Rij [?] T; we consider this to be a constant independent of i,j, and the taxonomy T.</Paragraph> <Paragraph position="4"> To enforce the taxonomic constraints in T, for each application of the ADD-RELATION operator we must add all new relations in the implied set I(Rij) not already in T.5 Thus we define the multiplicative change of the full set of implied relations as the product over all new relations:</Paragraph> <Paragraph position="6"> This definition leads to the following best-first search algorithm for hyponym acquisition, which at each iteration defines the new taxonomy as the union of the previous taxonomy T and the set of novel relations implied by the relation Rij that maximizes [?]T(I(Rij)) and thus maximizes the conditional probability of the evidence over all possible single relations: Since word senses are not directly observable, if the objects in the taxonomy are word senses (as in WordNet), we must extend our model to allow for a many-to-many mapping (e.g., a word-to-sense mapping) between DE and DT. For this setting we assume we know the function senses(i), mapping from the word i to all of iprimes possible corresponding senses.</Paragraph> <Paragraph position="7"> We assume that each set of word-pair evidence ERij we possess is in fact sense-pair evidence ERkl for a specific pair of senses k0 [?] senses(i),l0 [?] senses(j). Further, we assume that a new relation between two words is probable only between the correct sense pair, i.e.:</Paragraph> <Paragraph position="9"> When computing the conditional probability of a specific new relation Rkl [?] I(Rab), we assume that the relevant sense pair k0,l0 is the one which maximizes the probability of the new relation, i.e.</Paragraph> <Paragraph position="10"> for k [?] senses(i),l [?] senses(j),</Paragraph> <Paragraph position="12"> Our independence assumptions for this extension need only to be changed slightly; we now assume that the evidence ERij depends on the taxonomy T via only a single relation between sensepairs Rkl. Using this revised independence assumption the derivation for best-first search over taxonomies for hyponym acquisition remains unchanged. One side effect of this revised independence assumption is that the addition of the single &quot;sense-collapsed&quot; relation Rkl in the taxonomy T will explain the evidence ERij for the relation over words i and j now that such evidence has been revealed to concern only the specific senses k and l.</Paragraph> </Section> </Section> <Section position="5" start_page="803" end_page="805" type="metho"> <SectionTitle> 3 Extending WordNet </SectionTitle> <Paragraph position="0"> We demonstrate the ability of our model to use evidence from multiple relations to extend Word-Net with novel noun hyponyms. While in principle we could use any number of relations, for simplicity we consider two primary sources of evidence: the probability of two words in WordNet being in a hypernym relation, and the probability of two words in WordNet being in a coordinate relation. null In sections 3.1 and 3.2 we describe the construction of our hypernym and coordinate classifiers, respectively; in section 3.3 we outline the efficient algorithm we use to perform local search over hyponym-extended WordNets; and in section 3.4 we give an example of the implicit structure-based word sense disambiguation performed within our framework.</Paragraph> <Section position="1" start_page="803" end_page="804" type="sub_section"> <SectionTitle> 3.1 Hyponym Classification </SectionTitle> <Paragraph position="0"> Our classifier for the hypernym relation is derived from the &quot;hypernym-only&quot; classifier described in (Snow et al., 2005). The features used for predicting the hypernym relationship are obtained by parsing a large corpus of newswire and encyclopedia text with MINIPAR (Lin, 1998). From the resulting dependency trees the evidence EHij for each word pair (i,j) is constructed; the evidence takes the form of a vector of counts of occurrences that each labeled syntactic dependency path was found as the shortest path connecting i and j in some dependency tree. The labeled training set is constructed by labeling the collected feature vectors as positive &quot;known hypernym&quot; or negative &quot;known non-hypernym&quot; examples using WordNet 2.0; 49,922 feature vectors were labeled as positive training examples, and 800,828 noun pairs were labeled as negative training examples. The model for predicting P(Hij|EHij ) is then trained using logistic regression, predicting the noun-pair hypernymy label from WordNet from the feature vector of lexico-syntactic patterns.</Paragraph> <Paragraph position="1"> The hypernym classifier described above predicts the probability of the generalized hypernymancestor relation over words P(Hij|EHij ). For the purposes of taxonomy induction, we would prefer an ancestor-distance specific set of classifiers over senses, i.e., for k [?] senses(i),l [?] senses(j), the set of classifiers estimating {P(H1kl|EHij ),P(H2kl|EHij ),...}.</Paragraph> <Paragraph position="2"> One problem that arises from directly assigning the probability P(Hnij|EHij ) [?] P(Hij|EHij ) for all n is the possibility of adding a novel hyponym to an overly-specific hypernym, which might still satisfy P(Hnij|EHij ) for a very large n. In order to discourage unnecessary overspecification, we penalize each probability P(Hkij|EHij ) by a factor lk[?]1 for some l < 1, and renormalize: P(Hkij|EHij ) [?] lk[?]1P(Hij|EHij ). In our experiments we set l = 0.95.</Paragraph> </Section> <Section position="2" start_page="804" end_page="804" type="sub_section"> <SectionTitle> 3.2 (m,n)-cousin Classification </SectionTitle> <Paragraph position="0"> The classifier for learning coordinate terms relies on the notion of distributional similarity, i.e., the idea that two words with similar meanings will be used in similar contexts (Hindle, 1990). We extend this notion to suggest that words with similar meanings should be near each other in a semantic taxonomy, and in particular will likely share a hypernym as a near parent.</Paragraph> <Paragraph position="1"> Our classifier for (m,n)-cousins is derived from the algorithm and corpus given in (Ravichandran et al., 2005). In that work an efficient randomized algorithm is derived for computing clusters of similar nouns. We use a set of more than 1000 distinct clusters of English nouns collected by their algorithm over 70 million webpages6, with each noun i having a score representing its cosine similarity to the centroid c of the cluster to which it belongs, cos(th(i,c)).</Paragraph> <Paragraph position="2"> We use the cluster scores of noun pairs as input to our own algorithm for predicting the (m,n)cousin relationship between the senses of two words i and j. If two words i and j appear in a cluster together, with cluster centroid c, we set our single coordinate input feature to be the minimum cluster score min(cos(th(i,c)),cos(th(j,c))), and zero otherwise. For each such noun pair feature, we construct a labeled training set of (m,n)cousin relation labels from WordNet 2.1. We define a noun pair (i,j) to be a &quot;known (m,n)cousin&quot; if for some senses k [?] senses(i),l [?] senses(j), Cmnij [?] WordNet; if more than one such relation exists, we assume the relation with smallest sum m + n, breaking ties by smallest absolute difference |m [?] n|. We consider all such labeled relationships from WordNet with 0 [?] m,n [?] 7; pairs of words that have no corresponding pair of synsets connected in the hypernym hi6As a preprocessing step we hand-edit the clusters to remove those containing non-English words, terms related to adult content, and other webpage-specific clusters.</Paragraph> <Paragraph position="3"> erarchy, or with min(m,n) > 7, are assigned to a single class C[?]. Further, due to the symmetry of the similarity score, we merge each class Cmn = Cmn [?]Cnm; this implies that the resulting classifier will predict, as expected given a symmetric input, P(Cmnkl |ECij) = P(Cnmkl |ECij).</Paragraph> <Paragraph position="4"> We find 333,473 noun synset pairs in our training set with similarity score greater than 0.15. We next apply softmax regression to learn a classifier that predicts P(Cmnij |ECij), predicting the Word-Net class labels from the single similarity score derived from the noun pair's cluster similarity.</Paragraph> </Section> <Section position="3" start_page="804" end_page="805" type="sub_section"> <SectionTitle> 3.3 Details of our Implementation </SectionTitle> <Paragraph position="0"> Hyponym acquisition is among the simplest and most straightforward of the possible applications of our model; here we show how we efficiently implement our algorithm for this problem. First, we identify the set of all the word pairs (i,j) over which we have hypernym and/or coordinate evidence, and which might represent additions of a novel hyponym to the WordNet 2.1 taxonomy (i.e., that has a known noun hypernym and an unknown hyponym, or has a known noun coordinate term and an unknown coordinate term). This yields a list of 95,000 single links over threshold P(Rij) > 0.12.</Paragraph> <Paragraph position="1"> For each unknown hyponym i we may have several pieces of evidence; for example, for the unknown term continental we have 21 relevant pieces of hypernym evidence, with links to possible hypernyms {carrier, airline, unit, . . .}; and we have 5 pieces of coordinate evidence, with links to possible coordinate terms {airline, american eagle, airbus, . . .}.</Paragraph> <Paragraph position="2"> For each proposed hypernym or coordinate link involved with the novel hyponym i, we compute the set of candidate hypernyms for i; in practice we consider all senses of the immediate hypernym j for each potential novel hypernym, and all senses of the coordinate term k and its first two hypernym ancestors for each potential coordinate.</Paragraph> <Paragraph position="3"> In the continental example, from the 26 individual pieces of evidence over words we construct the set of 99 unique synsets that we will consider as possible hypernyms; these include the two senses of the word airline, the ten senses of the word carrier, and so forth.</Paragraph> <Paragraph position="4"> Next, we iterate through each of the possible hypernym synsets l under which we might add the new word i; for each synset l we com- null pute the change in taxonomy score resulting from adding the implied relations I(H1il) required by the taxonomic constraints of T. Since typically our set of all evidence involving i will be much smaller than the set of possible relations in I(H1il), we may efficiently check whether, for each sense s [?] senses(w), for all words where we have some evidence ERiw, whether s participates in some relation with i in the set of implied relations I(H1il).7 If there is more than one sense s [?] senses(w), we add to I(H1il) the single relationship Ris that maximizes the taxonomy likelihood, i.e. argmaxs[?]senses(w) [?]T(Ris).</Paragraph> </Section> <Section position="4" start_page="805" end_page="805" type="sub_section"> <SectionTitle> 3.4 Hypernym Sense Disambiguation </SectionTitle> <Paragraph position="0"> A major strength of our model is its ability to correctly choose the sense of a hypernym to which to add a novel hyponym, despite collecting evidence over untagged word pairs. In our algorithm word sense disambiguation is an implicit side-effect of our algorithm; since our algorithm chooses to add the single link which, with its implied links, yields the most likely taxonomy, and since each distinct synset in WordNet has a different immediate neighborhood of relations, our algorithm simply disambiguates each node based on its surrounding structural information.</Paragraph> <Paragraph position="1"> As an example of sense disambiguation in practice, consider our example of continental. Suppose we are iterating through each of the 99 possible synsets under which we might add continental as a hyponym, and we come to the synset airline#n#2 in WordNet 2.1, i.e. &quot;a commercial organization serving as a common carrier.&quot; In this case we will iterate through each piece of hypernym and coordinate evidence; we find that the relation H(continental, carrier) is satisfied with high probability for the specific synset carrier#n#5, the grandparent of airline#n#2; thus the factor [?]T(H3(continental, carrier#n#5)) is included in the factor of the set of implied relations [?]TparenleftbigI(H1(continental, airline#n#2))parenrightbig. Suppose we instead evaluate the first synset of airline, i.e., airline#n#1, with the gloss &quot;a hose that carries air under pressure.&quot; For this synset none of the other 20 relationships directly implied by hypernym evidence or the idence are implied by adding the single link H1(continental,airline#n#1); thus the resulting change in the set of implied links given by the correct &quot;carrier&quot; sense of airline is much higher than that of the &quot;hose&quot; sense. In fact it is the largest of all the 99 considered hypernym links for continental; H1(continental, airline#n#2) is link #18,736 added to the taxonomy by our algorithm.</Paragraph> </Section> </Section> class="xml-element"></Paper>