XML Viewer - w96-0112

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0112_metho.xml
Size: 25,094 bytes
Last Modified: 2025-10-06 14:14:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0112">
  <Title>A Probabilistic Disambiguation Method Based on Psycholinguistic Principles</Title>
  <Section position="4" start_page="141" end_page="142" type="metho">
    <SectionTitle>
2 Psycholinguistic Principles of Disambiguation
</SectionTitle>
    <Paragraph position="0"> In this section, we introduce the psycholinguistic principles of disambiguation. Kimball has proposed the Right Association Principle (RAP) (Kimball, 1973), which states that (in English) a phrase on the right should be attached to the nearest phrase on the left if possible. Hobbs &amp; Bear have generalized RAP to the Attach Low and Parallel Principle (ALPP) (Hobbs and Bear, 1990).</Paragraph>
    <Paragraph position="1"> ALPP states that a phrase on the right should be attached to the nearest phrase on the left if possible, and that phrases should be attached to a phrase in parallel if possible. (When we refer to ALPP. we ordinarily mean just the part concerning attachments in parallel. ) Ford et ah have proposed the Lexica\] Preference Rule (LPR) which states that an interpretation is to be preferred whose case frame assumes more semantically consistent values (Ford et al., 1982). Classically, lexical preference is realized by checking consistencies between 'semantic features' of slots and those of slot vMues, namely the 'selectionM restrictions' (Katz and Fodor, 1963). The realization of lexical preference in terms of selectional restrictions has some disadvantages, however. Interpretations obtained in an analysis cannot, for example, be ranked in their preferential order. Thus one cannot adopt a strategy of retaining the N most plausible interpretations in an analysis, which is the most widely accepted practice at present. In fact it is more appropriate to treat the lexical preference as a kind of score representing the association between slots and their values. In the present paper, we refer to this kind of score as 'lexical preference.' For the same reason, we also treat 'syntactic preference' as a kind of score.</Paragraph>
    <Paragraph position="2"> LPR is a lexical semantic principle, while RAP and ALPP are syntactic ones, and in psycholinguistics it is commonly claimed that LPR overrides RAP and ALPP (Hobbs and Bear, 1990). Let us consider some examples of LPR and RAP in this regard. For the sentence I ate ice cream with a spoon, (1) there are two interpretations; one is 'I ate ice cream using a spoon' and the other 'I ate ice cream and a spoon.' In this sentence, a human speaker would certainly assume the former interpretation over  the latter. From the psycholinguistic perspective, this can be explained in the following way: the former interpretation has a stronger lexical preference than the latter, and thus is to be preferred according to LPR. Moreover, since LPR overrides RAP, the preference is solely determined by LPR.</Paragraph>
    <Paragraph position="3"> For the sentence John phoned a man in Chicago, (2) there are two interpretations; one is 'John phoned a man who is in Chicago' and the other 'John, while in Chicago, phoned a man.' In this sentence, a human speaker would probably assume the former interpretation over the latter. The two interpretations have an equal lexical preference value, and thus the preference of the two cannot be determined by LPR. After LPR fails to work, the former interpretation is to be preferred according to RAP, because 'a man' is closer to 'in Chicago' than 'phone' in the sentence.</Paragraph>
    <Paragraph position="4"> LPR implies that (in natural language) one should communicate as relevantly as possible, while RAP and ALPP implies that one should communicate as efficiently as possible. Although the phenomena governed by these principles vary from language to language, the principles themselves, we think, are language independent, and thus can be regarded as fundamental principles of human communication. According to Whittemore et al. and Hobbs &amp; Bear, nearly all of the ambiguities can be resolved by first applying LPR and then RAP and ALPP (Hobbs and Bear, 1990; Whittemore et al., 1990). These observations motivate us strongly to implement these principles for disambiguation purposes.</Paragraph>
    <Paragraph position="5"> While there are also other principles proposed in the literature, including the Minimal Attachment Principle (Frazier and Fodor, 1979), they are generally either not highly functional or are covered by the above three principles in any case (Hobbs and Bear, 1990; Whittemore et al., 1990). The necessity of developing a disambiguation method with learning ability has recently come to be widely recognized. The realization of such a method would make it possible to (a) save the cost of defining knowledge by hand (b) do away with the subjectivity inherent in human definition (c) make it easier to adapt a natural language analysis system to a new domain. We think that a probabilistic approach is especially attractive because it is able to employ a principled methodology for acquiring the knowledge necessary for disambiguation. In our research, we implement LPR, RAP and ALPP by means of a probabilistic methodology.</Paragraph>
  </Section>
  <Section position="5" start_page="142" end_page="144" type="metho">
    <SectionTitle>
3 LPR and Lexical Likelihood
</SectionTitle>
    <Paragraph position="0"> In this section, we briefly describe our LPR-based probabilistic disambiguation method.</Paragraph>
    <Section position="1" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
3.1 The three-word probability
</SectionTitle>
      <Paragraph position="0"> We refer to a syntactic tree and its corresponding case frame, as obtained in an analysis, 'an interpretation. '3 After analyzing the sentence in (1), for example, we obtain the case frames of the interpretations:</Paragraph>
      <Paragraph position="2"> The value assumed by a case slot of a case frame of a verb can be viewed as being generated according a conditional probability distribution:  where random variable v takes on a value of a set of verbs, n a value of a set of nouns, and .s a value of a set of slot names. Similarly, the value assumed by a case slot of a case frame of a noun can be viewed as being generated by a conditional probability distribution: P(nln , s). We call this kind of conditional probability the 'three-word probability.' Moreover, we assume that the three-word probabilities in the case frame of an interpretation are mutually independent, and define the geometric mean of the three-word probabilities as the 'lexical likelihood' of the interpretation:</Paragraph>
      <Paragraph position="4"> where Pi is the ith three-word probability in the case frame of interpretation I, and m the number of three-word probabilities in it. The lexical likelihood values of the two interpretations in (3) and (4) are thus calculated as</Paragraph>
      <Paragraph position="6"> In disambiguation, we simply rank the interpretations according to their lexical likelihood values. If a verb (or a noun) has a strong tendency to require a certain noun as the value of its case frame slot, the estimated three-word probability for such a co-currence will be very high. To prefer an interpretation with a higher lexical likelihood value, then, is to prefer it based on its lexical preference.</Paragraph>
      <Paragraph position="7"> Specifically, in order to perform pp-attachment disambiguation in analysis of sentences like (1), we need only calculate and compare the values of P(spoonleat, with ) and P(spoonlice_cream,with ).</Paragraph>
      <Paragraph position="8"> In sentences like A number of companies sell and buy by computer, (9) the number of three-word probabilities in each of its respective case frames will be different. If we were to define a lexical likelihood as the product of the three-word probabilities in the case frame of an interpretation, an interpretation with fewer case slots would be preferred. We use the definition of lexical likelihood described above to avoid this problem. 4</Paragraph>
    </Section>
    <Section position="2" start_page="143" end_page="144" type="sub_section">
      <SectionTitle>
3.2 The data sparseness problem
</SectionTitle>
      <Paragraph position="0"> Hindle &amp; Rooth have previously proposed resolving pp-attachment ambiguities with 'two-word probabilities' (Hindle and Rooth, 1991), e.g., P(withlice_cream),P(withleat), but these are not accurate enough to represent lexical preference. For example, in the sentences, Britain reopened the embassy in December, Britain reopened the embassy in Teheran, (10) the pp-attachment sites of the two prepositional phrases are different. The attachment sites would be determined to be the same, however, if we were to use two-word probabilities (c:f.(Resnik, 1993)), and thus the ambiguity of only one of the sentences can be resolved. It is very likely, however, that this kind of ambiguity could be resolved satisfactorily by using the three-word probabilities. The number of para.meters that need to be estimated increases drastically when we use three-word probabilities, and the data available for estimation of the probability parameters usually are 4An alternative for resolving coordinate structure ambiguities is to employ a method which examines the similarity that exists between conjuncts (c.f.(Kutohashi and Na~ao, 1994; Resnik, 1993)).</Paragraph>
      <Paragraph position="1">  not sufficient in practice. If we employ the Maximum Likelihood Estimator, we may find most of the parameters are estimated to be 0: a problem often referred to, in statistical natural language processing, as the 'data sparseness problem.' (the motivation for using the two-word probabilities in (Hindle and Rooth, 1991) appears to be a desire to avoid the data sparseness problem. ) One may expect this problem to be less severe in the future, when more data are available. However, as data size increases, new words may appear, and the number of parameters that need to be estimated may increase as well. Thus, the data sparseness problem is unlikely to be resolved. A number of methods have been proposed, however, to cope with the data sparseness problem. Chang et al., for instance, have proposed replacing words with word classes and using class-based co-occurrence probabilities (Chang et al., 1992). However, forcibly replacing words with certain word classes is too loose an approximation, which, in practice, could seriously degrade disambiguation results. Resnik has defined a probabilistic measure called 'selectional association' in terms of the word classes existing in a given thesaurus. While Resnik's method is based on an interesting intuition, the justification of this method from the viewpoint of statistics is still not clear. We have devised a method of estimating the three-word probabilities in an efficient and theoretically sound way (Li and Abe, 1995). Our method selects optimal word classes according to the distribution of given data, and smoothes the three-word probabilities using the selected classes. Experimental results indicate that our method improves upon or is at least as effective as existing methods.</Paragraph>
      <Paragraph position="2"> Using our method of estimating (smoothing) probabilities, we can cope with the data sparseness problem. However, for the same reason as described above, the data sparseness problem cannot be resolved completely. We propose combining the use of three-word probabilities and that of two-word probabilities. Specifically, we first use the lexical likelihood value calculated as the geometric mean of the three-word probabilities of an interpretation, and when the lexical likelihood values of obtained interpretations are equal, including the case in which all of them are 0, we use the lexical likelihood value calculated as the geometric mean of the two-word probabilities of an interpretation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="144" end_page="148" type="metho">
    <SectionTitle>
4 RAP,ALPP, and Syntactic Likelihood
</SectionTitle>
    <Paragraph position="0"> In this section, we describe our probabilistic disambiguation method based on RAP and ALPP.</Paragraph>
    <Section position="1" start_page="144" end_page="144" type="sub_section">
      <SectionTitle>
4.1 The deterministic approach
</SectionTitle>
      <Paragraph position="0"> Shieber has previously proposed incorporating RAP into the mechanism of a shift-reduce parser (Shieber, 1983). When RAP is implemented, the parser prefers shift to reduce whenever a 'shiftreduce conflict' occurs. The advantage of this deterministic approach is its simple mechanism, while the disadvantage is that although it can output the most preferred interpretation, it cannot rank interpretations in their preferential order. In order to be able to rank interpretations in this way, it is necessary to construct a parser which operates stochastically, not deterministically.</Paragraph>
    </Section>
    <Section position="2" start_page="144" end_page="146" type="sub_section">
      <SectionTitle>
4.2 Formalizing a syntactic preference
</SectionTitle>
      <Paragraph position="0"> In this subsection, we formalize a syntactic preference based on RAP and ALPP. While we borrow from the terminology of HPSG (Pollard and Sag, 1987) in our reference to 'head' categories, we also use the single term 'modifier' categories to refer to categories which HPSG would classify as being either 'complements' or 'adjuncts.' We refer to that word which exhibits the subcategory feature of a category to be that category's 'head word.' Let us consider a simple case in which we are dealing with a modifier category M, a head category H, and the head word of H, w. We first apply CFG rule L -- H, M to H and M, yielding category L (see Figure l(a)). We refer to the number of words in a given sequence as 'distance.'  As may be seen in Figure l(a), the distance between M and w is d. RAP prefers an interpretation with a smaller d. Thus, syntactic preference can be represented by a monotonically decreasing function of d. Since in English the head word w of category H tends to locate near its left corner, we can approximate d as l, the number of words contained in H. In this paper, we call the number of words contained in a category the 'length' of that category. In addition, syntactic preference also depends on type of head category and modifier category. Assuming that 1 is known to be 5, if H is a verb phrase and M is a prepositional phrase, the preference value is likely to be high, but if H is a noun phrase and M is a prepositional phrase, it is likely to be low. Since category type can be specified within a CFG rule, syntactic preference can be defined as a function of a CFG rule.</Paragraph>
      <Paragraph position="1"> Syntactic preference based on RAP can be formalized, then, as a function of CFG rule L -- H, M and length l, namely, S(l, (L ~ H, M)). (11) Suppose that categories R1 and R2 form a coordinate structure, and 11 and 12 are the lengths of R1 and R2, respectively. ALPP prefers categories forming a coordinate structure to be of equal length (see Figure l(b)). Preference value will be high when ll equals 12, and syntactic preference based on ALPP 's can be defined as</Paragraph>
      <Paragraph position="3"> Further, suppose that categories R1, R2,..., Rk are combined into category A, and 11,12,..., lk are the lengths of R1, R2,..., Rk, respectively. Syntactic preference of the attachment can then be defined as S(ll, I2,..., Ik, (L ~ R1, R2,..., Rk)). (13) Note that (13) contains (11) and (12). Furthermore, we assume that the attachments in the syntactic tree of an interpretation are mutually independent, and we define the product (or the sum, depending on the preference function) of the syntactic preference values of the attachments in the syntactic tree of the interpretation as the syntactic preference of the interpretation:</Paragraph>
      <Paragraph position="5"> where Si denotes the syntactic preference value of the ith attachment in the syntactic tree of interpretation I, and m the number of attachments in it.</Paragraph>
    </Section>
    <Section position="3" start_page="146" end_page="147" type="sub_section">
      <SectionTitle>
4.3 The length probability
</SectionTitle>
      <Paragraph position="0"> We now consider how to specify the syntactic preference function in (13). As there are any number of ways to formulate the function (note the fact that syntactic preference is also a function of a CFG rule.), it is nearly impossible to find the most suitable formula experimentally. To cope with this problem, we used machine learning techniques (recall the merits of using machine learning techniques in disambiguation, as described in Section 2). Specifically, we have defined a probability model to calculate syntactic preference. Suppose that attachments represented by CFG rules and lengths are extracted from the correct syntactic trees in training data, and the frequency of each kind of attachment is obtained as f(ll, 12, .... 4, (L ~ R1, R2,. * *, Rk )), (15) where L ~ R1, R2,..., Rk denotes a CFG rule, and 11,12,..., Ik denote the lengths of R1, R2,. * *, Rk~ respectively. RAP prefers an interpretation attached to a nearer phrase, while ALPP prefers interpretations with attachments that are low and in parallel. Many such attachments may be observed in the training data, and we can formulate the frequencies of attachments (15) as a syntactic preference. Considering the fact that individual rules will be applied with different frequency, it is desirable to modify the syntactic preference to</Paragraph>
      <Paragraph position="2"> where f((L -+ R1, R2,..., Rk)) denotes the frequence of application of CFG rule L --+ R1, R~,..., Rk.</Paragraph>
      <Paragraph position="3"> This is precisely the 'length probability' we propose in this paper.</Paragraph>
      <Paragraph position="4"> Let us now define the length probability more formally. Suppose that an attachment is obtained after the application of C FG rule L -+ R1, R2,. * *, Rk, the lengths of R1, R2,..., Rk are 11,12,. *., 4, respectively. The attachment can be viewed as being generated by the following conditional distribution: null</Paragraph>
      <Paragraph position="6"> We call this kind of conditional probability the 'length probability.' 6 Furthermore, the syntactic likelihood of an interpretation is defined as the geometric mean of the length probabilities of the attachments in the syntactic tree of the interpretation, assuming that the attachments are mutually independent:</Paragraph>
      <Paragraph position="8"> where Pi is the ith length probability in the syntactic tree of interpretation I, and m the number of length probabilities in it. We define syntactic likelihood as the geometric mean of the length probabilities, rather than as the product of the length probabilities, in order to factor out the effect of the different number of attachments in the syntactic trees of individual interpretations. When training the length probabilities, the parameters in (16) may be estimated using the frequences in (15).</Paragraph>
      <Paragraph position="9"> Next, let us consider a simple example illustrating how the operation of this model indicates the functioning of RAP. For the phrase shown in Figure 2(a), there are two interpretations; RAP  hand side of a CFG rule, and N - the maximum value of lengths of a category on the left-hand side of the rule: ~i=k-1 k - 1 - 1 = k - 1. As k is very small (in our case k &lt; 3), the number of parameters in a length probability model is of N's polynomial order.</Paragraph>
      <Paragraph position="10">  would necessarily prefer the former. The difference between the syntactic likelihood values of the two interpretations is solely determined by</Paragraph>
      <Paragraph position="12"> First, let us compare the left-hand length probabilities of (19) and (20). Both represent an attachment of NP to P, and the length of P is 1 in both terms. Thus the two estimated probabilities may not differ so greatly. Next, compare the right-hand length probabilities in (19) and (20). While both represent an attachment of PP to NP, the length of NP of the former is 2 and that of the latter is 5. Thus the second length probability in (19) is likely to be higher than that in (20), as in training data there are more phrases attached to nearby phrases than are attached to distant ones.</Paragraph>
      <Paragraph position="13"> Therefore, when we use only the syntactic likelihood to perform disambiguation, we can expect the former interpretation in Figure 2(a) to be preferred, i.e., we have an indication of the functioning of RAP.</Paragraph>
      <Paragraph position="14"> Let us consider another example illustrating how the operation of the length probability model indicates the functioning of ALPP. For the sentence shown in Figure 2(b), there are two interpretations; ALPP would necessarily prefer the former. The difference between the syntactic likelihood values of the two interpretations is solely determined by</Paragraph>
      <Paragraph position="16"> First, let us compare the left-hand length probabilities in (21) and (22). Both represent an attachment of PP to VP, but the length of VP of the former is 3 and that of the latter is I. The left-hand probability in (21) is likely to be lower than that in (22). Next, compare the right-hand length probabilities in (21) and (22). Both represent a coordinate structure consisting of VPs. The lengths of VPs in the latter are equal, while the lengths of VPs in the former are not. Thus the right-hand probability in (21) is likely to be higher than that in (22). Moreover, the difference between the right-hand probabilities is likely to be higher than that between the left-hand probabilities, and thus the syntactic likelihood value of the former interpretation will be higher than that of the latter. Therefore, when we use only the syntactic likelihood to perform disambiguation, we can expect the former interpretation in Figure 2(b) to be preferred.</Paragraph>
    </Section>
    <Section position="4" start_page="147" end_page="148" type="sub_section">
      <SectionTitle>
4.4 Related work
</SectionTitle>
      <Paragraph position="0"> Another approach to disambiguation is to define a probability model and to rank interpretations on the basis of syntactic parsing. One method of this type employs the well-known PCFG (Probabilistic Context Free Grammar) model (Fujisaki, 1989; Jelinek et al., 1990; Lari and Young, 1990). In PCFG, a CFG rule having the form of a, -- 3 is associated with a conditional probability P(~Ia), and the likelihood of a syntactic tree is defined as the product of the conditional probabilities of the rules which are applied in the derivation of that tree. Other methods have also been proposed.</Paragraph>
      <Paragraph position="1"> Magerman ~ Marcus, for instance, have proposed making use of a conditional probability model specifying a conditional probability of a CFG rule, given the part-of-speech trigram it dominates and its parent rule (Magerman and Marcus, 1991). Black et al. have defined a richer model to utilize all the information in the top-down derivation of a non-terminal (Black et al., 1992). Briscoe &amp; Carroll have proposed using a probabilistic model specific to LR parsing (Briscoe and Carroll, 1993).</Paragraph>
      <Paragraph position="2">  The advantage of the syntactic parsing approach is that it mGv embody heuristics (principles) effective in disambiguation, which would not have been thought of by humans, but it also risks not embodying heuristics (principles) already known to be effective in disambiguation. For example, the two interpretations of the noun phrase shown in Figure 2(a) have an equal likelihood value, if we employ PCFG, although the former would be preferred according to RAP.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="148" end_page="148" type="metho">
    <SectionTitle>
5 The Back-Off Method
</SectionTitle>
    <Paragraph position="0"> Having defined a lexical likelihood based on LPR and a syntactic likelihood based on RAP and ALPP, we may next consider how to combine the two kinds of likelihood in disambiguation. One choice is to calculate total preference as a weighted average of likelihood values, as proposed in (Alshawi and Carter, 1995). However since LPR overrides RAP and ALPP, a simpler approach is to adopt the back-off method, i.e., to rank interpretations/1 and I2 as follows:  1. if Plex(I1)- Pl=(Is) &gt; r/ then /1 &gt;/2 2. else if Plex(I2) - Plex(I1) &gt; 7/ then Is &gt;/1 3. else if P~yn(I1)- P~yn(Is) &gt; r then /1 &gt; Is 4. else if P~yn(Is)- P~yn(I1) &gt; r then /2 &gt;/1 (23)  where/1 and/2 denote any two interpretations, Pl=() denotes the lexical likelihood of an interpretation, and Psyn() the syntactic likelihood of an interpretation. ~ &gt; 0 and r &gt; 0 are thresholds (in the experiment described later, both are set to 0). Note that in lines 3 and 4, IPtex(I1)-Pzex(I2)l &lt; r I holds. Further note that the preferential order cannot be determined (or can only be determined</Paragraph>
    <Paragraph position="2"/>
  </Section>
class="xml-element"></Paper>
Download Original XML