XML Viewer - h93-1054

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1054_metho.xml
Size: 24,843 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1054">
  <Title>SEMANTIC CLASSES AND SYNTACTIC AMBIGUITY</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> The problem of syntactic ambiguity is a pervasive one. As Church and Patil \[2\] point out, the class of&amp;quot;every way ambiguous&amp;quot; constructions -- those for which the number of analyses is the number of binary trees over the terminal elements includes such frequent constructions as prepositional phrases, coordination, and nominal compounds. They suggest that until it has more useful constraints for resolving ambiguities, a parser can do little better than to efficiently record all the possible attachments and move on.</Paragraph>
    <Paragraph position="1"> In general, it may be that such constraints can only be supplied by analysis of the context, domain-dependent knowledge, or other complex inferential processes. However, we will suggest that in many cases, syntactic ambiguity can be resolved with the help of an extremely limited form of semantic knowledge, closely tied to the lexical items in the sentence.</Paragraph>
    <Paragraph position="2"> We focus on two relationships: selectional preference and semantic similarity. From one perspective, the proposals here can be viewed as an attempt to provide new formalizations for familiar but seldom carefully defined linguistic notions; elsewhere we demonstrate the utility of this approach in linguistic explanation \[11\]. From another perspective, the work reported here can be viewed as an attempt to generalize statistical natural language techniques based on lexical associations, using knowledge-based rather than distributionally derived word classes.</Paragraph>
    <Paragraph position="3"> * This research has been supported by an IBM graduate fellowship and by DARPA grant N00014-90-J-1863. The comments of Eric Bnll, Marti Hearst, Jarnie Henderson, Aravind Joshi, Mark Liberman, Mitch Marcus, Michael Niv, and David Yarowsky are gratefully acknowledged.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="278" type="metho">
    <SectionTitle>
2. CLASS-BASED STATISTICS
</SectionTitle>
    <Paragraph position="0"> A number of researchers have explored using lexical co-occurrences in text corpora to induce word classes \[ 1,5, 9, 12\], with results that are generally evaluated by inspecting the semantic cohesiveness of the distributional classes that result. In this work, we are investigating the alternative of using Word-Net, an explicitly semantic, broad coverage lexical database, to define the space of semantic classes. Although Word-Net is subject to the attendant disadvantages of any hand-constructed knowledge base, we have found that it provides an acceptable foundation upon which to build corpus-based techniques \[10\]. This affords us a clear distinction between domain-independent and corpus-specific sources of information, and a well-understood taxonomic representation for the domain-independent knowledge.</Paragraph>
    <Paragraph position="1"> Although WordNet includes data for several parts of speech, and encodes numerous semantic relationships (meronymy, antonymy, verb entailment, etc.), in this work we use only the noun taxonomy -- specifically, the mapping from words to word classes, and the traditional IS-A relationship between classes. For example, the word newspaper belongs to the classes (newsprint) and (paper), among others, and these are immediate subclasses of (material) and (publisher), respectively, t Class frequencies are estimated on the basis of lexical frequencies in text corpora. The frequency of a class c is estimated using the lexical frequencies of its members, as follows:</Paragraph>
    <Paragraph position="3"> {nln is subsumed by c) The class probabilities used in the section that follows can then be estimated by simply normalizing (MLE) or by other</Paragraph>
  </Section>
  <Section position="5" start_page="278" end_page="279" type="metho">
    <SectionTitle>
3. CONCEPTUAL RELATIONSHIPS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="278" end_page="278" type="sub_section">
      <SectionTitle>
3.1. Selectional Preference
</SectionTitle>
      <Paragraph position="0"> The term &amp;quot;selectional preference&amp;quot; has been used by linguists to characterize the source of anomaly in sentences such as (lb), and more generally to describe a class of restrictions on co-occurrence that is orthogonal to syntactic constraints.</Paragraph>
      <Paragraph position="1">  (1) a. John admires sincerity.</Paragraph>
      <Paragraph position="2"> b. Sincerity admires John.</Paragraph>
      <Paragraph position="3"> (2) a. Mary drank some wine.</Paragraph>
      <Paragraph position="4"> b. Mary drank some gasoline.</Paragraph>
      <Paragraph position="5"> c. Mary drank some pencils.</Paragraph>
      <Paragraph position="6"> d. Mary drank some sadness.</Paragraph>
      <Paragraph position="7">  Although selectional preference is traditionally formalized in terms of feature agreement using notations like \[+Animate\], such formalizations often fail to specify the set of allowable features, or to capture the gradedness of qualitative differences such as those in (2).</Paragraph>
      <Paragraph position="8"> As an alternative, we have proposed the following formalization of selectional preference \[11\]: Definition. The selectional preference of w for C is the relative entropy (Kullback-I.,eibler distance) between the prior distribution Pr(C) and the posterior distribution Pr(C \] w).</Paragraph>
      <Paragraph position="10"> Here w is a word with selectional properties, C ranges over semantic classes, and co-occurrences are counted with respect to a particular argument -- e.g. verbs and direct objects, nominal modifiers and the head noun they modify, and so forth. Intuitively, this definition works by comparing the distribution of argument classes without knowing what the word is (e.g., the a priori likelihood of classes in direct object position), to the distribution with respect to the word. If these distributions are very different, as measured by relative entropy, then the word has a strong influence on what can or cannot appear in that argument position, and we say that it has a strong selectional preference for that argument.</Paragraph>
      <Paragraph position="11"> The &amp;quot;goodness of fit&amp;quot; between a word and a particular class of arguments is captured by the following definition: Definition. The selectional association of w with c is the contribution c makes to the selectional preference of w.</Paragraph>
      <Paragraph position="13"> The selectional association A(wl, w2) of two words is taken to be the maximum of A(wl, e) over all classes c to which w2 belongs.</Paragraph>
      <Paragraph position="14"> VERB, ARGUMENT &amp;quot;BEST&amp;quot; ARGUMENT CLASS A \] drink wine (beverage) 0.088 drink gasoline (substance) 0.075 drink pencil (object) 0.030 dnnk sadness {psychological_feature) -0.001 The above table illustrates how this definition captures the qualitative differences in example (2). The &amp;quot;best&amp;quot; class for an argument is the class that maximizes selectional association. Notice that finding that class represents a form of sense disambiguation using local context (cf. \[15\]): of all the classes to which the noun wine belongs -- including (alcohol), (substance), (red), and (color), among others -- the class (beverage) is the sense of wine most appropriate as a direct object for drink.</Paragraph>
    </Section>
    <Section position="2" start_page="278" end_page="279" type="sub_section">
      <SectionTitle>
3.2. Semantic Similarity
</SectionTitle>
      <Paragraph position="0"> Any number of factors influence judgements of semantic similarity between two nouns. Here we propose to use only one source of information: the relationship between classes in the WordNet IS-A taxonomy. Intuitively, two noun classes can be considered similar when there is a single, specific class that subsumes them both -- if you have to travel very high in the taxonomy to find a class that subsumes both classes, in the extreme case all the way to the top, then they cannot have all that much in common. For example, (nickel)and (dime) are both immediately subsumed by (coin), whereas the most specific superclass that (nickel) and (mortgage) share is (possession).</Paragraph>
      <Paragraph position="1"> The difficulty, of course, is how to determine which superclass is &amp;quot;most specific.&amp;quot; Simply counting IS-A links in the taxonomy can be misleading, since a single link can represent a fine-grained distinction in one part of the taxonomy (e.g. (zebra) IS-A (equine)) and a very large distinction elsewhere (e.g.</Paragraph>
      <Paragraph position="2"> (carcinogen) IS-A (substance)).</Paragraph>
      <Paragraph position="3"> Rather than counting links, we use the information content of a class to measure its specificity (i.e., - log Pr(c)); this permits us to define noun similarity as follows: Definition. The semantic similarity of nl and n2 is</Paragraph>
      <Paragraph position="5"> where {el} is the set of classes dominating both nl and n2.</Paragraph>
      <Paragraph position="6"> The ai, which sum to 1, are used to weight the contribution of each class -- for example, in accordance with word sense probabilities. In the absence of word sense constraints we can compute the &amp;quot;globally&amp;quot; most specific class simply by setting c~i to 1 for the class maximizing \[- log Pr(c)\],  and 0 otherwise. For example, according to that &amp;quot;global&amp;quot; measure, sim(nickel,dime) = 12.71 (= -log Pr((coin))) and sim(nickel,mortgage) = 7.61 (= - log Pr( (posse ssion ) )).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="279" end_page="282" type="metho">
    <SectionTitle>
4. SYNTACTIC AMBIGUITY
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="279" end_page="280" type="sub_section">
      <SectionTitle>
4.1. Coordination and Nominal Compounds
</SectionTitle>
      <Paragraph position="0"> Having proposed formalizations of selectional preference and semantic: similarity as information-theoretic relationships involving conceptual classes, we now turn to the application of these ideas to the resolution of syntactic ambiguity.</Paragraph>
      <Paragraph position="1"> Ambiguous coordination is a common source of parsing difficulty. In this study, we investigated the application of class-based statistical methods to a particular subset of coordinations, noun phrase conjunctions of the form nounl and noun2 noun3, as in (3): (3) a. a (bank and warehouse) guard b. a (policeman) and (park guard) Such structures admit two analyses, one in which nounl and noun2 are the two heads being conjoined (3a) and one in which the conjoined heads are noun1 and noun3 (3b).</Paragraph>
      <Paragraph position="2"> As pointed out by Kurohashi and Nagao \[7\], similarity of form and similarity of meaning are important cues to conjoinability. In English, similarity of form is to a great extent captured by  agreement in number: (4) a. several business and university groups b. several businesses and university groups Semantic similarity of the conjoined heads also appears to play an important role: (5) a. a television and radio personality b. a psychologist and sex researcher In addition, for this particular construction, the appropriateness of noun-noun modification for noun1 and noun3 is relevant: null (6) a. mail and securities fraud b. corn and peanut butter  We investigated the roles of these cues by conducting a disambiguation experiment using the definitions in the previous section. Two sets of 100 noun phrases of the form \[NP noun1 and noun2 noun3\] were extracted from the Wall Street Journal (WSJ) corpus in the Penn Treebank and disambiguated by hand, with one set to be used for development and the other for testing. 3 A set of simple transformations were applied to all WSJ data, including the mapping of all 3Hand disambiguation was necessary because the Penn Treebank does not encode NP-internal structure. These phrases were disambiguated using the full sentence in which they occurred, plus the previous and following sentence, as context.</Paragraph>
      <Paragraph position="3"> proper names to the token someone, the expansion of month abbreviations, and the reduction of all nouns to their root forms.</Paragraph>
      <Paragraph position="4"> Similarity of form, defined as agreement of number, was determined using a simple analysis of suffixes in combination with WordNet's database of nouns and noun exceptions. Similarity of meaning was determined &amp;quot;globally&amp;quot; as in equation (5) and the example that followed; noun class probabilities were estimated using a sample of approximately 800,000 noun occurrences in Associated Press newswire stories. 4 For the purpose of determining semantic similarity, nouns not in WordNet were treated as instances of the class (thing). Appropriateness of noun-noun modification was determined using selectional association as defined in equation (4), with co-occurrence frequencies calculated using a sample of approximately 15,000 noun-noun compounds extracted from the WSJ corpus. (This sample did not include the test data.) Both selection of the modifier for the head and selection of the head for the modifier were considered.</Paragraph>
      <Paragraph position="5"> Each of the three sources of information -- form similarity, meaning similarity, and modification relationships-- was used alone as a disambiguation strategy, as follows:  * Form: - If noun 1 and noun2 match in number and noun 1 and noun3 do not then conjoin nounl and noun2; - if nounl and noun3 match in number and noun 1 and noun2 do not then conjoin nounl and noun3; - otherwise remain undecided.</Paragraph>
      <Paragraph position="6"> * Meaning: - If sim(nounl,noun2) &gt; sim(nounl,noun3) then conjoin nounl and noun2; - ifsim(nounl,noun3) &gt; sim(nounl,noun2) then conjoin nounl and noun3; - otherwise remain undecided.</Paragraph>
      <Paragraph position="7"> * Modification: - If A(nounl,noun3) &gt; r, a threshold, or if A(noun3,nounl) &gt; r, then conjoin nounl and noun3; - If A(nounl,noun3) &lt; C/r and A(noun3,nounl) &lt; ~r then conjoin nounl and noun2; - otherwise remain undecided. 5  strategies in that order, use the first strategy that isn't undecided); (b) taking a &amp;quot;vote&amp;quot; among the three strategies and choosing the majority; (c) classifying using the results of a linear regression; and (d) constructing a decision tree classifier. null The training set contained a bias in favor of conjoining nounl and noun2, so a &amp;quot;default&amp;quot; strategy -- always choosing that  Not surprisingly, the individual strategies perform reasonably well on the instances they can classify, but recall is poor; the strategy based on similarity of form is highly accurate, but arrives at an answer only half the time. Of the combined strategies, the &amp;quot;backing off'' approach succeeds in answering 95 % of the time and achieving 81.1% precision -- a reduction of 44.4% in the baseline error rate.</Paragraph>
      <Paragraph position="8"> We have recently begun to investigate the disambiguation of more complex coordinations of the form \[NP noun1 noun2 and noun3 noun4\], which permit five pos- null sible bracketings: (7) a. freshman ((business and marketing) major) b. (food (handling and storage)) procedures c. ((mail fraud) and bribery) charges d. Clorets (gum and (breath mints)) e. (baby food) and (puppy chow)  These bracketings comprise two groups, those that conjoin noun2 and noun3 (a-c) and those that conjoin noun2 and noun4 (d-e). Rather than tackling the five-way disambiguation problem immediately, we began with an experimental task of classifying a noun phrase as belonging to one of these two groups.</Paragraph>
      <Paragraph position="9"> We examined three classification strategies. First, we used the form-based strategy described above. Second, as before, we used a strategy based on semantic similarity; this time, however, selectional association was used to determine the czi in equation (5), incorporating modifier-head relationships into the semantic similarity strategy. Third, we used &amp;quot;backing off&amp;quot; (from form similarity to semantic similarity) to combine the two individual strategies. As before, one set of items was used for development, and another set (89 items) was set aside for testing. As a baseline, results were evaluated against a simple default strategy of always choosing the group that was more common in the development set.</Paragraph>
      <Paragraph position="10">  In this case, the default strategy defined using the development set was misleading, leading to worse than chance precision. However, even if default choices were made using the bias found in the test set, precision would be only 55.1%. The results in the above table make it clear that the strategies using form and meaning are far more accurate, and that combining them leads to good coverage and precision.</Paragraph>
      <Paragraph position="11"> The pattern of results in these two experiments demonstrates a significant reduction in syntactic misanalyses for this construction as compared to the simple baseline, and it confirms that form, meaning, and modification relationships all play a role in disambiguation. In addition, these results confirm the effectiveness of the proposed definitions of selectional preference and semantic similarity.</Paragraph>
    </Section>
    <Section position="2" start_page="280" end_page="282" type="sub_section">
      <SectionTitle>
4.2. Prepositional Phrase Attachment 6
</SectionTitle>
      <Paragraph position="0"> Prepositional phrase attachment represents another important form of parsing ambiguity. Empirical investigation \[ 14\] suggests that lexical preferences play an important role in disambiguation, and Hindle and Rooth \[5\] have demonstrated that these preferences can be acquired and utilized using lexical  co-occurrence statistics.</Paragraph>
      <Paragraph position="1"> (8) a. They foresee little progress in exports.</Paragraph>
      <Paragraph position="2"> b. \[VP foresee \[NP little progress \[PP in exports\]\]\] c. \[VP foresee \[NP little progress\] \[PP in exports\]\]  Given an example such as (8a), Hindle and Rooth's &amp;quot;lexical association&amp;quot; strategy chooses between bracketings (8b) and (8c) by comparing Pr(in~foresee) with Pr(inlprogress ) and evaluating the direction and significance of the difference between the two conditional probabilities. The object of the preposition is ignored, presumably because the data would be far too sparse if it were included.</Paragraph>
      <Paragraph position="3"> As Hearst and Church \[4\] observe, however, the object of the preposition can provide crucial information for determining attachment, as illustrated in (9):  (9) a. Britain reopened its embassy in December.</Paragraph>
      <Paragraph position="4"> b. Britain reopened its embassy in Teheran.</Paragraph>
      <Paragraph position="5">  Hoping to overcome the sparseness problem and use this information, we formulated a strategy of&amp;quot;conceptual association,&amp;quot; according to which the objects of the verb and preposition are treated as members of semantic classes and the two potential attachment sites are evaluated using class-based rather than lexical statistics.</Paragraph>
      <Paragraph position="6"> The alternative attachment sites-- verb-attachment and noun-attachment -- were evaluated according to the following criteria:</Paragraph>
      <Paragraph position="8"> where PP is an abbreviation for (preposition,class2), and class 1 and class2 are classes to which the object of the verb and object of the preposition belong, respectively. These scores were used rather than conditional probabilities Pr(PP \[ v) and Pr(PP \] class 1) because, given a set of possible classes to use as class2 (e.g. export is a member of (export), (commerce), (group_action), and (human_action)), conditional probability will always favor the most general class. In contrast, comparing equations (6) and (7) with equation (4), the verb- and noun-attachment scores resemble the selectional association of the verb and noun with the prepositional phrase.</Paragraph>
      <Paragraph position="9"> Because nouns belong to many classes, we required some way to combine scores obtained under different classifications. Rather than considering the entire cross-product of classifications for the object of the verb and the object of the preposition, we chose to first consider all possible classifications of the object of the preposition, and then to classify the object of the verb by choosing classl so as to maximize I(classl ;PP). For example, sentence (8a) yields the following  The &amp;quot;conceptual association&amp;quot; strategy merges evidence from alternative classifications in an extremely simple way: by performing a paired samples t-test on the nscores and vscores, and preferring attachment to the noun if t is positive, and to the verb if negative. A combined strategy uses this preference if t is significant at p &lt; . 1, and otherwise uses the lexical association preference. For example (8a), t(3) = 3.57, p &lt; .05, with (8b) being the resulting choice of bracketing.</Paragraph>
      <Paragraph position="10"> We evaluated this technique using the Penn Treebank Wall Street Journal corpus, comparing the performance of lexical association alone (LA), conceptual association alone (CA), and the combined strategy (COMBINED) on a held-out set of 174 ambiguous cases. The results were as follows:</Paragraph>
      <Paragraph position="12"> When the individual strategies were constrained to answer only when confident (Itl &gt; 2.1 for lexical association, p &lt; .1 for conceptual association), they performed as follows:</Paragraph>
      <Paragraph position="14"> Despite the fact that this experiment used an order of magnitude less training data than Hindle and Rooth's, their lexical association strategy performed quite a bit better than in the experiments reported in \[5\], presumably because this experiment used hand-disambiguated rather than heuristically disambiguated training data.</Paragraph>
      <Paragraph position="15"> In this experiment, the bottom-line performance of the conceptual association strategy is worse than that of lexical association, and the combined strategy yields at best a marginal improvement. However, several observations are in order.</Paragraph>
      <Paragraph position="16"> First, the coverage and precision achieved by conceptual association demonstrate some utility of Class information, since the lexical data are impossibly sparse when the object of the preposition is included. Second, a qualitative evaluation of what conceptual association actually did shows that it is capturing relevant relationships for disambiguation.</Paragraph>
      <Paragraph position="17"> (10) To keep his schedule on track, he flies two personal secretaries in from Little Rock to augment his staff in Dallas.</Paragraph>
      <Paragraph position="18"> For example, augment and in never co-occur in the training corpus, and neither do staff and in; as a result, the lexical association strategy makes an incorrect choice for the ambiguous verb phrase in (10). However, the conceptual association strategy makes the correct choice on the basis of the following  Third, mutual information appears to be a successful way to select appropriate classifications for the direct object, given a classification of the object of the preposition. For example, despite the fact that staff belongs to 25 classes in WordNet -- including (musical_notation) and (rod), for instance-- the classes to which it is assigned in the above table seem contextually appropriate. Finally, it is clear that in many instances  the paired t-test, which effectively takes an unweighted average over multiple classifications, is a poor way to combine sources of evidence.</Paragraph>
      <Paragraph position="19"> In two additional experiments, we examined the effect of semantic classes on robustness, since presumably a domain-independent source of noun classes should be able to mitigate the effects of a mismatch between training data and test data. In the first of these experiments, we used the WSJ training material, and tested on 173 instances from Associated Press newswire, with the following results:  These additional experiments demonstrate large increases in coverage when confident (55-65%) with only moderate decreases in precision (&lt; 5%). Overall, the results of the three experiments seem promising, and suggest that further work on conceptual association will yield improvements to disambiguation strategies using lexical association alone.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML