File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/j98-2002_evalu.xml
Size: 21,678 bytes
Last Modified: 2025-10-06 14:00:25
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-2002"> <Title>Generalizing Case Frames Using a Thesaurus and the MDL Principle</Title> <Section position="8" start_page="229" end_page="238" type="evalu"> <SectionTitle> 4. Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="229" end_page="231" type="sub_section"> <SectionTitle> 4.1 Experiment 1: A Qualitative Evaluation </SectionTitle> <Paragraph position="0"> We applied our generalization method to large corpora and inspected the obtained tree cut models to see if they agreed with human intuition. In our experiments, we extracted verbs and their case frame slots (verb, slot_name, slot_value triples) from the tagged texts of the Wall Street Journal corpus (ACL/DCI CD-ROM1) consisting of 126,084 sentences, using existing techniques (specifically, those in Smadja \[1993\]), then 9 There are several possible measures that one could take to address this issue, including the incorporation of absolute frequencies of the words (inside and outside the particular slot in question). This is outside the scope of the present paper, and we simply refer the interested reader to one possible approach (Abe and Li 1996).</Paragraph> <Paragraph position="1"> Li and Abe Generalizing Case Frames Table 6 Example input data (for the direct object slot of eat).</Paragraph> <Paragraph position="2"> eat arg2 food 3 eat arg2 lobster 1 eat arg2 seed 1 eat arg2 heart 2 eat arg2 liver 1 eat arg2 plant 1 eat arg2 sandwich 2 eat arg2 crab 1 eat arg2 elephant 1 eat arg2 meal 2 eat arg2 rope 1 eat arg2 seafood 1 eat arg2 amount 2 eat arg2 horse 1 eat arg2 mushroom 1 eat arg2 night 2 eat arg2 bug 1 eat arg2 ketchup 1 eat arg2 lunch 2 eat arg2 bowl 1 eat arg2 sawdust 1 eat arg2 snack 2 eat arg2 month 1 eat arg2 egg 1 eat arg2 jam 2 eat arg2 effect 1 eat arg2 sprout 1 eat arg2 diet 1 eat arg2 debt 1 eat arg2 nail 1 eat arg2 pizza 1 eat arg2 oyster 1 applied our method to generalize the slot_values. Table 6 shows some example triple data for the direct object slot of the verb eat.</Paragraph> <Paragraph position="3"> There were some extraction errors present in the data, but we chose not to remove them, because in general there will always be extraction errors and realistic evaluation should leave them in.</Paragraph> <Paragraph position="4"> When generalizing, we used the noun taxonomy of WordNet (version 1.4) (Miller 1995) as our thesaurus. The noun taxonomy of WordNet has a structure of directed acyclic graph (DAG), and its nodes stand for a word sense (a concept) and often contain several words having the same word sense. WordNet thus deviates from our notion of thesaurus--a tree in which each leaf node stands for a noun, each internal node stands for the class of nouns below it, and a noun is uniquely represented by a leaf node--so we took a few measures to deal with this.</Paragraph> <Paragraph position="5"> First, we modified our algorithm FInd-MDL so that it can be applied to a DAG; now, Find-MDL effectively copies each subgraph having multiple parents (and its associated data) so that the DAG is transformed to a tree structure. Note that with this modification it is no longer guaranteed that the output model is optimal. Next, we dealt heuristically with the issue of word-sense ambiguity by equally dividing the observed frequency of a noun between all the nodes containing that noun. Finally, when an internal node contained nouns actually occurring in the data, we assigned the .frequencies of all the nodes below it to that internal node, and excised the whole subtree (subgraph) below it. The last of these measures, in effect, defines the &quot;starting cut&quot; of the thesaurus from which to begin generalizing. Since (word senses of) nouns that occur in natural language tend to concentrate in the middle of a taxonomy, the starting cut given by this method usually falls around the middle of the thesaurus. 1deg Figure 9 shows the starting cut and the resulting cut in WordNet for the direct object slot of eat with respect to the data in Table 6, where /.../ denotes a node in WordNet. The starting cut consists of nodes/plant.../,/food/,etc, which are the highest nodes containing values of the direct object slot of eat. Since/food/has significantly higher frequencies than its neighbors/solid/and/fluid/, the generalization stops there according to MDL. In contrast, the nodes under/life_form.../have relatively small differences in their frequencies, and thus they are generalized to the node/life_form.../. The same is true of the nodes under /artifact/. Since /..-amount.../ has a much 10 Cognitive scientists have observed that concepts in the middle of a taxonomy tend to be more important with respect to learning, recognition, and memory, and their linguistic expressions occur more frequently in natural language--a phenomenon known as basic level primacy. See Lakoff (1987).</Paragraph> <Paragraph position="7"> An example generalization, result (for the direct object slot of eat).</Paragraph> <Paragraph position="8"> higher frequency than its neighbors /time/ and {space), the generalization does not go up higher. All of these results seem to agree with human intuition, indicating that our method results in an appropriate level of generalization.</Paragraph> <Paragraph position="9"> Table 7 shows generalization results for the direct object slot of eat and some other arbitrarily selected verbs, where classes are sorted in descending order of their probability values. (Classes with probabilities less than 0.05 are discarded due to space limitations.) Table 8 shows the computation time required (on a SPARC &quot;Ultra 1&quot; work station) to obtain the results shown in Table 7. (The computation time for loading the WordNet was excluded since it need be done only once.) Even though the noun taxonomy of WordNet is a large thesaurus containing approximately 50,000 nodes, our method still manages to efficiently generalize case slots using it. The table also shows the average number of levels generalized for each slot, namely, the average number of links between a node in the starting cut and its ancestor node in the resulting cut.</Paragraph> <Paragraph position="10"> (For example, the number of levels generalized for/plant..-/ is one in Figure 9.) One can see that a significant amount of generalization is performed by our method--the resulting tree cut is about 5 levels higher than the starting cut, on the average.</Paragraph> </Section> <Section position="2" start_page="231" end_page="238" type="sub_section"> <SectionTitle> 4.2 Experiment 2: PP-Attachment Disambiguation </SectionTitle> <Paragraph position="0"> Case frame patterns obtained by our method can be used in various tasks in natural language processing. In this paper, we test its effectiveness in a structural (PPattachment) disambiguation experiment.</Paragraph> <Paragraph position="1"> Disambiguation Methods. It has been empirically verified that the use of lexical semantic knowledge is effective in structural disambiguation, such as the PP-attachment problem (Hobbs and Bear 1990; Whittemore, Ferrara, and Brunner 1990). There have been many probabilistic methods proposed in the literature to address the PP-attachment problem using lexical semantic knowledge which, in our view, can be classified into three types.</Paragraph> <Paragraph position="2"> The first approach (Hindle and Rooth 1991, 1993) takes doubles of the form (verb, prep) and (nounl, prep), like those in Table 9, as training data to acquire semantic knowledge and judges the attachment sites of the prepositional phrases in quadruples of the form (verb, nounl, prep, noun2) e.g., (see, girl, with, telescope)--based on the acquired knowledge. Hindle and Rooth (1991) proposed the use of the lexical association measure calculated based on such doubles. More specifically, they estimate P(prep I verb) and P(prep \[ noun1), and calculate the so-called t-score, which is a measure of the statistical significance of the difference between P(prep I verb) and P(prep \[ nounl). If the t-score indicates that the former probability is significantly larger, labels.</Paragraph> <Paragraph position="3"> see girl in park ADV see man with telescope ADV see girl with scarf ADN then the prepositional phrase is attached to verb, if the latter probability is significantly larger, it is attached to nounl, and otherwise no decision is made.</Paragraph> <Paragraph position="4"> The second approach (Sekine et al. 1992; Chang, Luo, and Su 1992; Resnik 1993a; Grishman and Sterling 1994; Alshawi and Carter 1994) takes triples (verb, prep, noun2) and (nounl, prep, noun2), like those in Table 10, as training data for acquiring semantic knowledge and performs PP-attachment disambiguation on quadruples. For example, Resnik (1993a) proposes the use of the selectional association measure calculated based on such triples, as described in Section 2. More specifically, his method compares maxclassi~noun2 A(Classi \[ verb, prep) and maxclassi~no,m2 A(Classi I nounl,prep) to make disambiguation decisions.</Paragraph> <Paragraph position="5"> The third approach (Brill and Resnik 1994; Ratnaparkhi, Reynar, and Roukos 1994; Collins and Brooks 1995) receives quadruples (verb, noun1, prep, noun2) and labels indicating which way the PP-attachment goes, like those in Table 11, and learns a disambiguation rule for resolving PP-attachment ambiguities. For example, Brill and Resnik, (1994) propose a method they call transformation-based error-driven learning (see also Brill \[1995\]). Their method first learns IF-THEN type rules, where the IF parts represent conditions like (prep is with) and (verb is see), and the THEN parts represent transformations from (attach to verb) to (attach to nounl), or vice versa. The first rule is always a default decision, and all the other rules indicate transformations (changes of attachment sites) subject to various IF conditions.</Paragraph> <Paragraph position="6"> We note that, for the disambiguation problem, the first two approaches are basically unsupervised learning methods, in the sense that the training data are merely positive examples for both types of attachments, which could in principle be extracted from pure corpus data with no human intervention. (For example, one could just use unambiguous sentences.) The third approach, on the other hand, is a supervised learning method, which requires labeled data prepared by a human being.</Paragraph> <Paragraph position="7"> The generalization method we propose falls into the second category, although it can also be used as a component in a combined scheme with many of the above methods (see Brill and Resnik \[1994\], Alshawi and Carter \[1994\]). We estimate P(noun2 I verb, prep) and P(noun2 I nount, prep) from training data consisting of triples, and compare them: If the former exceeds the latter (by a certain margin) we attach it to verb, else if the latter exceeds the former (by the same margin) we attach it to noun1. In our experiments, described below, we compare the performance of our proposed method, which we refer to as MDL, against the methods proposed by Hindle and Rooth (1991), Resnik (1993b), and Brill and Resnik (1994), referred to respectively as LA, SA, and TEL.</Paragraph> <Paragraph position="8"> Data Set. We used the bracketed corpus of the Penn Treebank (Wall Street Journal corpus) (Marcus, Santorini, and Marcinkiewicz 1993) as our data. First we randomly selected one of the 26 directories of the WSJ files as the test data and what remains as the training data. We repeated this process 10 times and obtained 10 sets of data consisting of different training data and test data. We used these 10 data sets to conduct cross-validation as described below.</Paragraph> <Paragraph position="9"> From the test data in each data set, we extracted (verb, noun1, prep, noun2) quadruples using the extraction tool provided by the Penn Treebank called &quot;tgrep.&quot; At the same time, we obtained the answer for the PP-attachment site for each quadruple. We did not double-check if the answers provided in the Penn Treebank were actually correct or not. Then from the training data of each data set, we extracted (verb, prep) and (noun, prep) doubles, and (verb, prep, noun2) and (nounl,prep, noun2) triples using tools we developed ourselves. We also extracted quadruples from the training data as before. We then applied 12 heuristic rules to further preprocess the data, which include (1) changing the inflected form of a word to its stem form, (2) replacing numerals with the word number, (3) replacing integers between 1,900 and 2,999 with the word year, (4) replacing co., ltd., etc. with the words company, limited, etc. 11 After preprocessing there still remained some minor errors, which we did not remove further, due to the lack of a good method for doing so automatically. Table 12 shows the number of different types of data obtained by the above process.</Paragraph> <Paragraph position="10"> For MDL, we generalized noun2 given (verb, prep, noun2) and (nounl,prep, noun2) triples as training data for each data set, using WordNet as the thesaurus in the same manner as in experiment 1. When disambiguating, we actually compared P(Classl \[ verb, prep) and P(Class2 I noun1, prep), where Class1 and Class2 are classes in the output tree cut models dominating noun2 in place of P(noun2 \] verb, prep) and P(noun2 \] nounl,prep). 12 We found that doing so gives a slightly better result. For SA, we employed a somewhat simplified version in which noun2 is generalized given (verb, prep, noun2) and (nounl,prep, noun2) triples using WordNet, and maxcl~ss,~,o,,2 A(Classi I verb, prep) and maxctass,~no,n2 A(Classi l nounl, prep) are compared for disambiguation: If the former exceeds the latter then the prepositional phrase is attached to verb, and otherwise to noun1. For LA, we estimated P(prep \] verb) and P(prep \] noun1) from the training data of each data set and compared them for disambiguation. We then evaluated the results achieved by the three methods in terms of accuracy and coverage. Here, coverage refers to the proportion as a percentage, of the test quadruples on which the disambiguation method could make a decision, and accuracy refers to the proportion of correct decisions among them.</Paragraph> <Paragraph position="11"> In Figure 10, we plot the accuracy-coverage curves for the three methods. In plotting these curves, the attachment site is determined by simply seeing if the difference between the appropriate measures for the two alternatives, be it probabilities or selectional association values, exceeds a threshold. For each method, the threshold was set successively to 0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, and 0.75. When the difference between the two measures is less than a threshold, we rule that no decision can be made. These curves were obtained by averaging over the 10 data sets.</Paragraph> <Paragraph position="12"> 12 Recall that a node in WordNet represents a word sense and not a word; noun2 can belong to several classes in the thesaurus. We thus use maxciassignou,2 (P(Classi \[ verb, prep)) and maxclassi gno,m2 ( P( Classi \[ nounl, prep) ) in place of P( Classl \] verb, prep) and P( Class2 \[ nounl, prep). We also implemented the exact method proposed by Hindle and Rooth (1991), which makes disambiguation judgement using the t-score. Figure 10 shows the result as LA.t, where the threshold for t-score is set to 1.28 (significance level of 90 percent.) From Figure 10 we see that with respect to accuracy-coverage curves, MDL outperforms both SA and LA throughout, while SA is better than LA.</Paragraph> <Paragraph position="13"> Next, we tested the method of applying a default rule after applying each method.</Paragraph> <Paragraph position="14"> That is, attaching (prep, noun2) to verb for the part of the test data for which no decision was made by the method in question. 13 We refer to these combined methods as MDL+Default, SA+Default, LA+Default, and LA.t+Default. Table 13 shows the results, again averaged over the 10 data sets.</Paragraph> <Paragraph position="15"> Finally, we used the transformation-based error-driven learning (TEL) to acquire transformation rules for each data set and applied the obtained rules to disambiguate the test data. The average number of obtained rules for a data set was 2,752.3. Table 13 shows the disambiguation result averaged over the 10 data sets. From Table 13, we see that TEL performs the best, edging over the second place MDL+Default by a small margin, and then followed by LA+Default, and SA+Default. Below we discuss further observations concerning these results.</Paragraph> <Paragraph position="16"> MDL and SA. According to our experimental results, the accuracy and coverage of MDL appear to be somewhat better than those of SA. As Resnik (1993b) pointed ~ P(qv,r) out, the use of selectional association Iu~ ~ seems to be appropriate for cognitive modeling. Our experiments show, however, that the generalization method currently employed by Resnik has a tendency to overfit the data. Table 14 shows example generalization results for MDL (with classes with probability less than 0.05 discarded) and SA. Note that MDL tends to select a tree cut closer to the root of the thesaurus tree. This is probably the key reason why MDL has a wider coverage than SA for the same degree of accuracy. One may be concerned that MDL is &quot;overgeneralizing&quot; here, 14 but as shown in Figure 10, its disambiguation accuracy does not seem to be degraded.</Paragraph> <Paragraph position="17"> Another problem that must be dealt with concerning SA is how to remove noise (resulting, for example, from erroneous extraction) from the generalization results.</Paragraph> <Paragraph position="18"> P(Clv,r) Since SA estimates the ratio between two probability values, namely -~y-, the generalization result may be lead astray if one of the estimates of P(C I v, r) and P(C) is unreliable. For instance, a high estimated value for/drop, bead, pearl / at protect against 13 Interestingly, for the entire data set it is more favorable to attach (prep, noun2) to noun1, but for what remains after applying LA and MDL, it turns out to be more favorable to attach (prep, noun2) to verb. 14 Note that in Experiment 1, there were more data available, and thus the data were more appropriately generalized.</Paragraph> <Paragraph position="19"> shown in Table 14 is rather odd, and is because the estimate of P(C) is unreliable (too small). This problem apparently costs SA a nonnegligible drop in disambiguation accuracy. In contrast, MDL does not suffer from this problem since a high estimated probability value is only possible with high frequency, which cannot result just from extraction errors. Consider, for example, the occurrence of car in the data shown in Figure 8, which has supposedly resulted from an erroneous extraction. The effect of this datum gets washed away, as the estimated probability for VEHICLE, to which car has been generalized, is negligible.</Paragraph> <Paragraph position="20"> On the other hand, SA has a merit not shared by MDL, namely its use of the association ratio factors out the effect of absolute frequencies of words, and focuses Some hard examples for LA.</Paragraph> <Paragraph position="21"> Attached to verb Attached to noun1 acquire interest in year buy stock in trade ease restriction on export forecast sale for year make payment on million meet standard for resistance reach agreement in august show interest in session win verdict in winter acquire interest in firm buy stock in index ease restriction on type forecast sale for venture make payment on debt meet standard for car reach agreement in principle show interest in stock win verdict in case on their co-occurrence relation. Since both MDL and SA have pros and cons, it would be desirable to develop a methodology that combines the merits of the two methods (cf. Abe and Li \[1996\]).</Paragraph> <Paragraph position="22"> MDL and LA. LA makes its disambiguation decision completely ignoring noun2. As Resnik (1993b) pointed out, if we hope to improve disambiguation performance by increasing training data, we need a richer model such as those used in MDL and SA. We found that 8.8% of the quadruples in our entire test data were such that they shared the same verb, prep, noun1 but had different noun2, and their PP-attachment sites go both ways in the same data, i.e., both to verb and to noun1. Clearly, for these examples, the PP-attachment site cannot be reliably determined without knowing noun2. Table 15 shows some of these examples. (We adopted the attachment sites given in the Penn Tree Bank, without correcting apparently wrong judgements.) MDL and TEL. We chose TEL as an example of the quadruple approach. This method was designed specifically for the purpose of resolving PP-attachment ambiguities, and seems to perform slightly better than ours.</Paragraph> <Paragraph position="23"> As we remarked earlier, however, the input data required by our method (triples) could be generated automatically from unparsed corpora making use of existing heuristic rules (Brent 1993; Smadja 1993), although for the experiments we report here we used a parsed corpus. Thus it would seem to be easier to obtain more data in the future for MDL and other methods based on unsupervised learning. Also note that our method of generalizing values of a case slot can be used for purposes other than disambiguation.</Paragraph> </Section> </Section> class="xml-element"></Paper>