File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-2124_evalu.xml
Size: 8,485 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2124"> <Title>Word Clustering and Disambiguation Based on Co-occurrence Data</Title> <Section position="9" start_page="752" end_page="754" type="evalu"> <SectionTitle> 8 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="752" end_page="752" type="sub_section"> <SectionTitle> 8.1 Qualitative evaluation </SectionTitle> <Paragraph position="0"> In this experiment, we used heuristic rules to extract verbs and the head words of their direct objects from the lagged texts of the WSJ corpus (ACL/DCI CD- null ROM1) consisting of 126,084 sentences.</Paragraph> <Paragraph position="1"> -- s~are, a~et. data -- stock. ~no, secur~ -- inc..corp..co.</Paragraph> <Paragraph position="2"> i bourne, home -- DenK. group, firm pr~e. tax -- money, ca~ -- c~lr. vl~llicle -- profit, risk -- so.are, network -- pressure+ power We then constructed a number of thesauruses based on these data, using our method. Fig. 3 shows a part of a thesaurus for 100 randomly selected nouns, based on their appearances as direct objects of 20 randomly selected verbs. The thesaurus seems to agree with human intuition to some degree, although it is constructed based on a relatively small amount of co-occurrence data. For example, 'stock,' 'security,' and 'bond' are classified together, despite the fact that their absolute frequencies in the data vary a great deal (272, 59, and 79, respectively.) The results demonstrate a desirable feature of our method, namely, it classifies words based solely on the similarities in co-occurrence data, and is not affected by the absolute frequencies of the words.</Paragraph> </Section> <Section position="2" start_page="752" end_page="753" type="sub_section"> <SectionTitle> 8.2 Compound noun disambiguation </SectionTitle> <Paragraph position="0"> We extracted compound noun doubles (e.g., 'data base') from the tagged texts of the WSJ corpus and used them as training data, and then conducted structural disambiguation on compound noun triples (e.g., 'data base system').</Paragraph> <Paragraph position="1"> We first randomly selected 1,000 nouns from the corpus, and extracted compound noun doubles containing those nouns as training data and compound noun triples containing those nouns as test data. There were 8,604 training data and 299 test data. We hand-labeled the test data with the correct disambiguation 'answers.' We performed clustering on the nouns on the left position and the nouns on the right position in the training data by using both our method ('2D-Clustering') and Brown et al's method ('Brown'). We actually implemented an extended version of their method, which separately conducts clustering for nouns on the left and those on the right (which should only improve the performance).</Paragraph> <Paragraph position="2"> We next conducted structural disambiguation on the test data, using the probabilities estimated based on 2D-Clustering and Brown. We also tested the method of using the probabilities estimated based on word co-occurrences, denoted as 'Word-based.' Fig. 4 shows the results in terms of accuracy and coverage, where coverage refers to the percentage of test data for which the disambiguation method was able to make a decision. Since for Brown the number of classes finally created has to be designed in advance, we tried a number of alternatives and obtained results for each of them. (Note that, for 2D-Clustering, the optimal number of classes is au- null Tab. 1 shows the final results of all of the above methods combined with 'Default,' in which we attach the first noun to the neighboring noun when a decision cannot be made by each of the methods. We see that 2D-Clustering+Default performs the best. These results demonstrate a desirable aspect of 2D-Clustering, namely, its ability of automatically selecting the most appropriate level of clustering, resulting in neither over-generalization nor under-generalization.</Paragraph> </Section> <Section position="3" start_page="753" end_page="754" type="sub_section"> <SectionTitle> 8.3 PP-attachment dlsambiguation </SectionTitle> <Paragraph position="0"> We extracted triples (e.g., 'see, with, telescope') from the bracketed data of the WSJ corpus (Penn Tree Bank), and conducted PP-attachment disambiguation on quadruples. We randomly generated ten sets of data consisting of different training and test data and conducted experiments through 'tenfold cross validation,' i.e., all of the experimental results reported below were obtained by taking av- null We constructed word classes using our method ('2D-Clustering') and the method of Brown et al ('Brown'). For both methods, following the proposal due to (Tokunaga et al., 1995), we separately conducted clustering with respect to each of the 10 most frequently occurring prepositions (e.g., 'for,' 'with,' etc). We did not cluster words for rarely occurring prepositions. We then performed disambiguation based on 2D-Clustering and Brown. We also tested the method of using the probabilities estimated based on word co-occurrences, denoted as 'Word-based.' Next, rather than using the conditional probabilities estimated by our method, we only used the noun thesauruses constructed byour method, and applied the method of (Li and Abe, 1995) to estimate the best 'tree cut models' within the thesauruses a in order to estimate the conditional probabilities like those in (5). We call the disambiguation method using these probability values 'NounClass-2DC.' We also tried the analogous method using thesauruses model' in a given thesaurus with conditional probabilities attached to all the nodes in the tree cut. They use MDL to select the best tree cut model.</Paragraph> <Paragraph position="1"> and estimating the best tree cut models (this is exactly the disambiguation method proposed in that paper). Finally, we tried using a hand-made thesaurus, WordNet (this is the same as the disambiguation method used in (Li and Abe, 1995)). We denote these methods as 'Li-Abe96' and 'WordNet,' respectively.</Paragraph> <Paragraph position="2"> Tab. 2 shows the results for all these methods in terms of coverage and accuracy.</Paragraph> <Paragraph position="3"> We then enhanced each of these methods by using a default rule when a decision cannot be made, which is indicated as '+Default.' Tab. 3 shows the results of these experiments.</Paragraph> <Paragraph position="4"> We can make a number of observations from these results. (1) 2D-Clustering achieves a broader coverage than NounClass-2DC. This is because in order to estimate the probabilities for disambiguation, the former exploits more information than the latter. (2) For Brown, we show here only its best result, which happens to be the same as the result for 2D-Clustering, but in order to obtain this result we had to take the trouble of conducting a number of tests to find the best level of clustering. For 2D-Clustering, this was done once and automatically. Compared with Li-Abe96, 2D-Clustering clearly performs better. Therefore we conclude that our method improves these previous clustering methods in one way or another. (3) 2D-Clustering outperforms WordNet in term of accuracy, but not in terms of coverage. This seems reasonable, since an automatically constructed thesaurus is more domain dependent and therefore captures the domain dependent features better, and thus can help achieve higher accuracy. On the other hand, with the relatively small size of training data we had available, its coverage is smaller than that of a general purpose hand made thesaurus.</Paragraph> <Paragraph position="5"> The result indicates that it makes sense to combine automatically constructed thesauruses and a hand-made thesaurus, as we have proposed in Section 7. This method of combining both types of thesauruses '2D-Clustering+WordNet+Default' was then tested. We see that this method performs the best. (See Tab. 3.) Finally, for comparison, we tested the 'transformation-based error-driven learning' proposed in (Brill and Resnik, 1994), which is a state-of-the-art method for pp-attachment disambiguation. Tab. 3 shows the result for this method as 'Brill-Resnik.' We see that our disambiguation method also performs better than Brill-Resnik. (Note further that for Brill & Resnik's method, we need to use quadruples as training data, whereas ours only requires triples.)</Paragraph> </Section> </Section> class="xml-element"></Paper>