File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0901_metho.xml
Size: 27,168 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0901"> <Title>Hiding a Semantic Hierarchy in a Markov Model</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> There' have been a number of attempts to derive selectional preferences using parsed corpora and a semantic class hierarchy. Our work is closely related to that of (Resnik, 1993). His system provides a distribution over classes conditioned on a predicate-role pair: p( clv, r ). It estimates p( clv , r ) as f(v,r,c)/~c' f(v,r,c'), where f(v,r,c) is in turn approximated by allocating the frequency of the co-occurrence tuple (v, r, n) among the classes C(n) to which the senses of n belong. For example, suppose the word bread has two senses, BREAD and MONEY. Suppose further that BREAD is a hyponym of BAKED-GOODS, FOOD, ARTIFACT, and TOP, and MONEY is a hyponym solely of TOP. Then C(bread) is (BREAD, BAKED-GOODS, FOOD, ARTIFACT, TOP, MONEY). Tokens of bread are taken as ambiguous evidence for all concepts in C(bread); the weight of evidence is divided uniformly across C(bread). Hence each token of (eat, obj, bread) counts as 1/6 of a token of (eat, obj, BREAD), 1/6 of a token of (eat, obj, BAKED-GOODS), and so on. Such a uniform allotment is does not reflect empirical distributions of senses, which are Zipf-like, but does produce reasonable results. It is important to note that Resnik is not very explicit about how the probability p(clv, r) is to be interpreted; there is no explicit stochastic generation model involved.</Paragraph> <Paragraph position="1"> Resnik uses p(clv, r) to quantify selectional preference by comparing it to p(c), the marginal probability of class c appearing as an argument. He measures the difference between these distributions as their relative entropy (D). The total amount of &quot;se- null lection&quot; that a predicate v imposes on the filler of role r is quantified as D(p(c\[v,r)\[\[p(c)). The selectional preference of v for c in role r is quantified as the contribution of the c to the total amount of selection:</Paragraph> <Paragraph position="3"> The class or classes produced as the output for the predicate are those with the highest selpref value.</Paragraph> <Paragraph position="4"> Other work on the induction of selectional preferences includes (Li and Abe, 1995). They characterize the selectional restriction of a predicate with a horizontal cut through a semantic hierarchy, and use the principle of Minimum Description Length (MDL) to choose a cut that optimally balances simplicity and descriptive adequacy. More specifically, a cut is a set of concepts that partition the set of nouns belonging to the hierarchy. A cut is deemed simpler if it cuts the hierarchy at a higher place (i.e., the cut contains fewer concepts), and descriptive adequacy is measured by comparing the actual distribution of nouns filling a slot (v, r) to the closest approximation one can obtain by estimating p(nlc ) for only the concepts c in the cut. Again, the intended stochastic generation model is not clear.</Paragraph> <Paragraph position="5"> As mentioned, the interpretation of expressions such as p(clv, r) is obscure in these previous models. Without clarity about what stochastic process is producing the data, it is difficult to gauge how well probabilities are being estimated. In addition, * having an explicit stochastic generation model enables one to do a number of things. First, one can experiment with different methods of eliminating word sense ambiguity in the training corpus in a principled fashion. Second, it is often possible to calculate a number of useful distributions.</Paragraph> <Paragraph position="6"> From our model, the following distributions can be efficiently estimated: Pr(word\[predicate, role), Pr (word\[ semantic-class, predicate, role), and Pr( word-senselword ,predicate,role). These distributions can be used directly to help solve ambiguity resolution problems such as syntactic structure disambiguation. In addition, the Pr(wordlpredicate, role ) distribution can be seen as a very specific language model, i.e., a language model for the head of the argument of the predicate.</Paragraph> </Section> <Section position="5" start_page="2" end_page="3" type="metho"> <SectionTitle> 3 Our Stochastic Generation Model </SectionTitle> <Paragraph position="0"> Our model generates co-occurrence tuples (e.g., (eat, obj, bee\])) as follows. The probability p(v, r, n) of a co-occurrence tuple can be expressed as p(v,r)p(nlv, r ). Our central concern is the conditional probability p(nlv, r). We associate a separate HMM with each pair (v,r) in order to characterize the distribution p(nlv ,r). Thus, the HMM for (eat, obj) would be different than that for (drink, subj).</Paragraph> <Paragraph position="1"> That is, the general structure of the HMM would be the same but the parameters would be different.</Paragraph> <Paragraph position="2"> The states and transitions of the HMMs are identified with the nodes and arcs of a given semantic class hierarchy. The nodes of the hierarchy represent semantic classes (concepts), and the arcs represent hyponymy (that is, the &quot;is-a&quot; relation). Some concepts are expressible as words: these concepts are word senses. A sense may be expressible by multiple words (synonyms) and, conversely, a single word may be an expression of more than one sense (word sense ambiguity). For expository reasons, we assume that all and only the terminal nodes of the hierarchy are word senses. In actuality, the only constraint our system places on the shape of the hierarchy is that it have a single root.</Paragraph> <Paragraph position="3"> A &quot;run&quot; of one of our HMMs begins at the root of the semantic hierarchy. A child concept is chosen in accordance with the HMM's transition probabilities. This is done repeatedly until a terminal node (word sense) c is reached, at which point a word w is emitted in accordance with the probability of expressing sense c as word w. Hence, each HMM &quot;run&quot; can be identified with a path through the hierarchy from the root to a word sense, plus the word that was generated from the word sense. Also, every observation sequence generated by our HMMs consists of a single noun: each run leads to a final state, at which point exactly one word is emitted.</Paragraph> <Paragraph position="4"> More formally, a concept graph is given, and an expressibility relation from nodes to words. Tile nodes of the graph are identified with concepts C = {cl,...,cn}, and the expressibility relation relates concepts to words kY = {wl,...,wm}. The HMM consists of a set of states {ql,...,qn}, which we identify with the nodes of the concept graph; a set of possible emissions which we identify with W U {e} (that is, we permit non-emitting states); and three parameter matrices: A = {aij} The transition probabilities. The value aij represents the probability of making a transition from state qi to state qj. aij is nonzero only if there is an arc in the concept graph from concept ci to concept cj.</Paragraph> <Paragraph position="5"> B = {bj(k)} The emission probabilities. The value bj(k) represents the probability of emitting word Wk while in state q/. States corresponding to nonterminal nodes in the concept graph are non-emitting (that is, they emit e with probability 1), and states corresponding to termi- null nal nodes are emitting states (they emit e with probability 0).</Paragraph> <Paragraph position="6"> ~- = {Tri} The initial state distribution, rri is identically 1 for the start state (corresponding to the root node), and 0 for all other states.</Paragraph> <Paragraph position="7"> As mentioned, we associate an HMM with each pair (v,r). Each HMM has the same structure, determined by the semantic hierarchy. Where they differ is in the values of the associated parameters.</Paragraph> <Paragraph position="8"> To estimate parameters, we require a training sample of observation sequences. Since each observation sequence consists of a single word, a training sample is simply a collection of word tokens. The training sample consists of the nouns filling the associated &quot;slot&quot; (v,r)--that is, a token of the noun n is included in the training sample for each token of the tuple (v,r,n) that occurs in the corpus. Table 2 provides an example corpus.</Paragraph> <Paragraph position="9"> This approach permits us to address both word sense disambiguation and selectional preference. An ambiguous word is one that could have been generated by means of more than one state sequence.</Paragraph> <Paragraph position="10"> For a given ambiguous word n appearing in a slot (v,r), we can readily compute the posterior probability that word sense c was used to generate n, according to the (v, r) model. We can disambiguate by choosing the word sense with maximum posterior probability, or we can use the probabilities in a more sophisticated model that uses more contextual information than just the slot in which the word appears.</Paragraph> <Paragraph position="11"> Selectional preferences for (v, r), can be extracted from these models by calculating the distribution over classes p(clv, r) from the model trained for (v, r) and tile distribution p(c) from a model trained on all nouns. One can then follow Resnik and use selpref(v,r,c) as defined above. These distributions can be calculated by considering our HMMs with additional transitions going from all leaf states to the root state. Such HMMs are ergodic and thus the probability of being in a particular state at a time t converges to a single value as t approaches oo.</Paragraph> <Paragraph position="12"> These steady-state probabilities can be put entirely in terms of the parameters of the model. Thus, once an HMM has been trained, the steady state probabilities can be easily calculated. Because of the correspondence between states and classes, these steady state distributions can be interpreted as a distribution over classes.</Paragraph> <Paragraph position="13"> As mentioned earlier, another way of thinking about selectional preference is as a distribution over words. For example, the selectional preference of the verb eat for its direct object would be expressed by high probabilities for words like breakfast, meat, and bread and low probabilities for words like thought, computer, and break. This conception of selectional preference is related to language modeling in speech recognition. In fact, the selectional preference of a predicate-role pair can be thought of as a very specific language model. This way of thinking about selectional preferences is useful because it points to possible applications in speech recognition.</Paragraph> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> 4 Parameter Estimation </SectionTitle> <Paragraph position="0"> We had originally hoped that after turning our semantic hierarchy into an HMM as described above, we could simply run the standard forward-backward algorithm on the training corpus and we would get a useful model. Unfortunately, there are a number of reasons why this does not work. We will describe these problems and our attempted solutions in the context of disambiguating the words in the training data with multiple word senses, a fundamental task in the estimation of selectional preferences. In each of the three sub-sections below we describe a problem we discovered and an attempted solution.</Paragraph> <Paragraph position="1"> In the end, we were not able to produce a system that performed better than Resnik's system on his word-sense disambiguation evaluation. This evaluation is an indirect way of testing whether the training method is word sense disambiguating the training corpora correctly. However, when we derived from our models a ranked list of classes using p(clv ,r) and Divergence as described above, we obtained very good lists. We present some representative lists and the results on Resnik's evaluation in section 5.</Paragraph> <Paragraph position="2"> In addition, we think the attempted solutions are instructive and provide insight into the nature of the problem and the behavior of the EM algorithm.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.1 Smoothing </SectionTitle> <Paragraph position="0"> It was our original hope that, by treating the choice of word sense as just another hidden variable in the HMM, word-sense disambiguation would be accomplished as a side effect of EM estimation. In fact, however, there is no pressure in the model in favor of parameter settings in which occurrences of an ambiguous word are all accounted for by a single word sense. If the initial parameter settings account for an ambiguous word as a mixture of word senses, the converged model does likewise. This should come as no surprise to those with experience using EM, but is not usually stated very clearly in the literature: the EM algorithm estimates a mixture model and (intuitively speaking) strongly prefers mixtures containing small amounts of many solutions over mixtures that are dominated by any one solution.</Paragraph> <Paragraph position="2"> For example, consider Figure 2. We assume a miniature training corpus, containing one instance each of four words, meat, apple, bagel, cheese.</Paragraph> <Paragraph position="3"> The word meat is ambiguous, having both sense ESSENCE and sense FLESH. The training corpus is perfectly accounted for by the weights in Figure 2, and this is indeed a fixed point of the EM algorithm.</Paragraph> <Paragraph position="4"> One would like to introduce some pressure toward consolidating word occurrences under a single word sense. Further, one would like the set of word senses one ends up with to be as closely related as possible.</Paragraph> <Paragraph position="5"> In Figure 2, for example, one would like word meat to shift as much of its weight as possible to sense FLESH, not sense the ESSENCE.</Paragraph> <Paragraph position="6"> We sought to accomplish this in a natural way by smoothing transition probabilities, as follows. The transition probabilities out of a given state constitute a probability distribution. At a given iteration of the EM algorithm, the &quot;empirical&quot; distribution for a given state is the distribution of counts across outgoing transitions, where the counts are estimated using the model produced by the previous iteration.</Paragraph> <Paragraph position="7"> (Hence the scare quotes around empirical. For want of a better term, let us call this distribution pseudoempirical.) null For example, assume the parameter settings shown in Figure 2 to be the output of the previous iteration, and assume that each word appears once in the training corpus. Then the (estimated) count for the path through transition FOOD ~ FLESH is 1/2, and the count for the paths through transitions FOOD --~ FRUIT, FOOD ~ BREAD, FOOD DAIRY is 1 each. Hence, the total count for the state FOOD is 3.5. Dividing each transition count by the count for state FOOD yields the pseudo-empirical probabilities {1/7, 2/7, 2/7, 2/7}.</Paragraph> <Paragraph position="8"> The pseudo-empirical probabilities would normally be installed as transition weights in the new model. Instead, we mix them with the uniform distribution {1/4, 1/4, 1/4, 1/4}. Let p(t) be the pseudo-empirical probability of transition t, and let u(t) be the uniform probability of transition t. Instead of setting the new weight for t to p(t), we set it to E u(t) / (1 - v)p(t).</Paragraph> <Paragraph position="9"> Crucially, we make the mixing parameter, e, a function of the total count for the state. Intuitively, if there is a lot of empirical evidence for the distribution, we rely on it, and if there is not much empirical evidence, we mix in a larger proportion of the uniform distribution. To be precise, we compute ~ as 1/(c+ 1), for c the total count of the state. This has the desirable property that E is 1 when c is 0, and decreases exponentially with increasing c.</Paragraph> <Paragraph position="10"> It is probably not immediately obvious how smoothing in this manner helps to prune undesired word senses. To explain, consider what happens in ESSENCE and the other leading through the word sense FLESH. In the &quot;previous&quot; model (i.e., the weights shown), each of those paths has the same weight (namely, 1/8), hence each instance of the word meat in the training corpus is taken as evidence in equal parts for word senses ESSENCE and FLESH.</Paragraph> <Paragraph position="11"> The difference lies in the states COGNITION and FOOD. Words apple, bagel, and cheese, along with half of meat, provide evidence for the state FOOD, giving it a total count of 31/2; but the only evidence for state COGNITION is the other half of meat, giving it a total count of 1/2. The new distribution for COGNITION has a large admixture of the uniform distribution, whereas the distribution of FOOD has a much smaller uniform component.</Paragraph> <Paragraph position="12"> The large proportion of uniform probability for the state COGNITION causes much of its probability mass to be &quot;bled off&quot; onto siblings of ESSENCE (not shown, but indicated by the additional outgoing edges from COGNITION). Since none of these sibling are attested in the training corpus, this makes COGNITION's fit to the training corpus very poor.</Paragraph> <Paragraph position="13"> Intuitively, this creates pressure for TOP to reduce the weight it apportions to COGNITION and increase its weight for FOOD; doing so improves the model's overall fit to the training corpus.</Paragraph> <Paragraph position="14"> This decreases the relative count for the word sense ESSENCE in the next iteration, increasing the pressure to shift weight from COGNITION to FOOD. Ultimately, an equilibrium is reached in which most of the count for word meat is assigned to the word sense FLESH. (What prevents a total shift to the word Sense FLESH is smoothing at TOP , which keeps a small amount of weight on COGNI-TION. In a large hierarchy, this translates to a vanishingly small amount of weight on ESSENCE.)</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Sense Balancing </SectionTitle> <Paragraph position="0"> In Figure 2, our smoothing method produces the desired bias for the corpus meat, apple, bagel, cheese.</Paragraph> <Paragraph position="1"> However, in different circumstances the bias produced is not the desired one. Consider training the hierarchy in Figure 3 on a corpus made up of one token of meat.</Paragraph> <Paragraph position="2"> The hierarchy in Figure 3 differs from the hierarchy in Figure 2 in that meat has three senses, two of which share a prefix path, i.e., the transition from TOP to FOOD. When training on the corpus of one token of meat, 2/3 of the count would go down the FOOD side and the other third down the COGNITION side; thus, with respect to the forward-backward algorithm, there is little difference between the current example and the previous one.</Paragraph> <Paragraph position="3"> Therefore, the two senses of meat under FOOD will be preferred. Intuitively this is wrong, because there is no information in the corpus on which to derive a bias for any one sense and we would like our parameter settings to reflect this. In addition, this is also not simply a border case problem, since if meat is very frequent, as in the corpus in Table 2, it could easily happen that such an a priori bias for certain senses of meat drowns out the bias that should result from the other words in the corpus.</Paragraph> <Paragraph position="4"> In concrete terms, the problem is the shared path prefix that exists for the senses under FOOD, namely the transition from TOP to FOOD. More abstractly, the problem is that the hierarchy is not balanced with respect to the senses of meat--if there were another sense under ESSENCE there would be no problem (see Figure 4).</Paragraph> <Paragraph position="5"> One can simulate such a phantom sense within the forward-backward algorithm. First the count for the transitions in the prefix path have to be reduced.</Paragraph> <Paragraph position="6"> This can be done by modifying the E step such that the expectation, Ew(Xi,j), for the random variable, Xi~j, which corresponds to the transition from state i to state j for a single token of word w, is calculated</Paragraph> <Paragraph position="8"> where Ew() is the expectation based on the model and corpus and D(j,w) is the number of unique paths starting at j and ending in a state that can generate w. One then sums over all tokens of the corpus to get the expectation for the corpus.</Paragraph> <Paragraph position="9"> The second step is to reduce tile probability of the paths to the sister sense of the phantom sense, e.g., COGNITIONs ESSENCE. This can be achieved by increasing the normalization factor used in the M step: : Ew( (r,w) Once again, we focus on the contribution of a single token of a word w and thus the normalization factor used in the M step would be the sum Aw over the tokens in the corpus. The state r is the starting state of the model, i.e., the state corresponding to the root of the hierarchy. The exception to this formula occurs when D(r,w) -D(i,w) = O, in which case A,~ = Ew.</Paragraph> <Paragraph position="10"> There are other ways of modifying the algorithm to simulate the phantom sense. However, this method is easy and efficient to implement since the E and M steps remain simple local calculations--the only global information comes through the function d which can be efficiently and easily computed.</Paragraph> <Paragraph position="11"> Another kind of sense imbalance is shown in Figure 5. This imbalance can be corrected by further</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> Shorter Length </SectionTitle> <Paragraph position="0"> modifying the E step as follows: v(j, w)u(j) where U(j) is the number of unique paths up to the root from j.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Length and Width Balancing </SectionTitle> <Paragraph position="0"> Most of the example hierarchies/models we have considered so far have been balanced with respect to length and width, i.e., the length of the paths to the generating states has been uniform and the number of transitions out of a state has been uniform across states. It turns out that uniform length and width are important characteristics with respect to our modified forward-backward algorithm: shorter paths are preferred to longer ones (see Figure 6) and paths that go through states with few exiting transitions are preferred to ones that go through states with many (see Figure 7). In fact, short paths are preferred to longer ones by the standard forward-backward algorithm, since in an HMM the probabilities of events in a sequence are multiplied to get the probability of the sequence as a whole. Width only comes into play when one introduces smoothing. Remember that in our smoothing, we mix in the uniform probability. Consider the transitions coming out of the state COGNITION in Figure 7; there are four transitions and thus the uniform probability would be 1/4. In contrast, the transitions coming out of the state FOOD in the same figure number only 2 and thus the uniform distribution would be 1/2. If there are many transitions the probability mixed for the uniform distribution will be smaller than if there were fewer transitions.</Paragraph> <Paragraph position="1"> We can solve the problem by balancing the hierarchy: all paths that result in generating a symbol should be of the same length and all distributions should contain the same number of members. As in</Paragraph> </Section> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> TOP CC~NITION FOOD ESSENCE FLESH </SectionTitle> <Paragraph position="0"> the previous section, we can simulate this balancing by modifying the forward-backward algorithm.</Paragraph> <Paragraph position="1"> First, to balance for width, the smoothing can be modified as follows: instead of mixing in the uniform probability for a particular parameter, always mix in the same probability, namely the uniform probability of the largest distribution, umax (i.e., the state with the largest number of exiting transitions; in Figure 7, this maximum uniform probability would be 1/4). Thus the smoothing formula becomes E u,~x + (1 - c)p(t). This modification has the following effect: it is as if there are always the same number of transitions out of a class. Width balancing for emission parameters is performed in an analogous fashion.</Paragraph> <Paragraph position="2"> Let us turn to length balancing. Conceptually, in order to balance for length, extra transitions and states need to be added to short paths so that they are as long as the maximum length path of the hierarchy. It should be noted that we are only concerned with paths that end in a state that generates words. The extension of short paths can be simulated by multiplying the probability of a path by a factor that is dependent on its length:</Paragraph> <Paragraph position="4"> This additional factor can be worked into the forward and backward variable calculations so that there is no loss in efficiency. It is, thus, as if lengthmaz - length(p) states have been added and that each of these states has Urnax -1 exiting transitions. null</Paragraph> </Section> <Section position="8" start_page="3" end_page="7" type="metho"> <SectionTitle> 5 Preliminary Results </SectionTitle> <Paragraph position="0"> As mentioned above, we tested our trained models on a word-sense disambiguation evaluation, reasoning that if it performed poorly on this evaluation, then it must not be disambiguating the training corpus very well. The bottom line is that we were not able to advance the state of the art--the performance results are comparable to, but not better than, those obtained by Resnik. We used the training sets, test sets, and evaluation method described in (Resnik, 1997). 1 Table 3 presents performance results. The Random method is simply to randomly pick a sense with a uniform distribution. The First Sense method is to always pick the most common sense as listed in WordNet. The HMM smoothed method is to use models trained with smoothing but no balancing modifications. HMM balanced uses smoothing and all three balancing modifications.</Paragraph> <Paragraph position="1"> Next we give examples of the preferences derived from trained models for three verbs, , represented as weights on classes. These are typical rather than best-case examples. We have not yet attempted any formal evaluation of these lists.</Paragraph> <Paragraph position="2"> us with the training and test data that he used in the above mentioned work.</Paragraph> </Section> class="xml-element"></Paper>