File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1326_metho.xml
Size: 19,811 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1326"> <Title>One Sense per Collocation and Genre/Topic Variations</Title> <Section position="3" start_page="207" end_page="207" type="metho"> <SectionTitle> 1 Resources used </SectionTitle> <Paragraph position="0"> The DSO collection (Ng and Lee, 1996) focuses on 191 frequent and polysemous words (nouns and verbs), and contains around 1,000 sentences per word. Overall, there are 112,800 sentences, where 192,874 occurrences of the target words were hand-tagged with WordNet senses (Miller et al., 1990).</Paragraph> <Paragraph position="1"> The DSO collection was built with examples from the Wall Street Journal (WSJ) and Brown Corpus (BC). The Brown Corpus is balanced, and the texts are classified according some predefined categories (el. Table 1). The examples from the Brown Corpus comprise 78,080 occurrences of word senses, and the examples from the WSJ 114,794 occurrences.</Paragraph> <Paragraph position="2"> The sentences in the DSO collection were tagged with parts of speech using TnT (Brants, 2000) trained on the Brown Corpus itself.</Paragraph> <Paragraph position="3"> A. Press: Reportage B. Press: Editorial C. Press: Reviews (theatre, books, music, dance) D. Religion E. Skills and Hobbies F. Popular Lore G. Belles Lettres, Biography, Memoirs, etc. H. Miscellaneous J. Learned K. General Fiction L. Mystery and Detective Fiction M. Science Fiction N. Adventure and Western Fiction P. Romance and Love Story R. Humor Table 1: List of categories of texts from the Brown Corpus, divided into informative prose (top) and imaginative prose (bottom). 1.1 Categories in the Brown Corpus and genre/topic variation The Brown Corpus manual (Francis & Kucera, 1964) does not detail the criteria followed to set the categories in Table 1: The samples represent a wide range of styles and varieties of prose... The list of main categories and their subdivisions was drawn up at a conference held at Brown University in February 1963.</Paragraph> <Paragraph position="4"> These categories have been previously used in genre detection experiments (Karlgrcn & Cutting, 1994), where each category was used as a genre. We think that the categories not only reflect genre variations but also topic variations (e.g. the Religion category follows topic distinctions rather than genre). Nevertheless we are aware that some topics can be covered in more than one category. Unfortunately there are no topically tagged corpus which also have word sense tags. We thus speak of genre and topic variation, knowing that further analysis would be needed to measure the effect of each of them.</Paragraph> </Section> <Section position="4" start_page="207" end_page="208" type="metho"> <SectionTitle> 2 Experimental setting </SectionTitle> <Paragraph position="0"> In order to analyze and compare the behavior of several kinds of collocations (cf. Section 3), Yarowsky (1993) used a measure of entropy as well as the results obtained when tagging held-out data with the collocations organized as decision lists (el. Section 4). As Yarowsky shows, both measures correlate closely, so we only used the experimental results of decision Word PoS #Senses #Ex. BC #Ex. WSJ lists. Age N 5 243 248 When comparing the performance on Art N 4 200 194 decision lists trained on two different corpora Body N 9 296 110 (or sub-corpora) we always take an equal Car N 5 357 1093 amount of examples per word from each Child N 6 577 484 corpora. This is done to discard the amount-of- Cost N 3 317 1143 data factor. Head N 28 432 434 As usual, we use 10-fold cross-validation Interest N 8 364 1115 Line N 28 453 880 when training and testing on the same corpus. Point N 20 442 249 No significance tests could be found for our State N 6 757 706 comparison, as training and test sets differ. Thing N 11 621 805 Because of the large amount of experiments Work N 6 596 825 involved, we focused on 21 verbs and nouns (el. Become V 4 763 736</Paragraph> </Section> <Section position="5" start_page="208" end_page="208" type="metho"> <SectionTitle> 3 Collocations considered </SectionTitle> <Paragraph position="0"> For the sake of this work we take a broad definition of collocations, which were classified in three subsets: local content word collocations, local part-of-speech and function-word collocations, and global content-word collocations. If a more strict linguistic perspective was taken, rather than collocations we should speak about co-occurrence relations.</Paragraph> <Paragraph position="1"> In fact, only local content word collocations would adhere to this narrower view.</Paragraph> <Paragraph position="2"> We only considered those collocations that could be easily exlracted form a part of speech tagged corpus, like word to left, word to right, etc. Local content word collocations comprise bigrams (word to left, word to right) and trigrams (two words to left, two words to right and both words to right and left). At least one of those words needs to be a content word. Local function-word collocations comprise also all kinds of bigrams and trigrams, as before, but the words need to be function words. Local PoS collocations take the Part of Speech of the words in the bigrams and trigrams. Finally global content word collocations comprise the content words around the target word in two different contexts: a window of 4 words around the target word, and all the words in the sentence. Table 3 summarizes the collocations used. These collocations have been used in other word sense disambiguation research and are also referred to as features (Gale et al., 1993; Ng & Lee, 1996; Escudero et al., 2000).</Paragraph> <Paragraph position="3"> Compared to Yarowsky (1993), who also took into account grammatical relations, we only share the content-word-to-left and the content-word-to-right collocations.</Paragraph> <Paragraph position="4"> We did not lemmatize content words, and we therefore do take into account the form of the target word. For instance, governing body and governing bodies are different collocations for the sake of this paper.</Paragraph> </Section> <Section position="6" start_page="208" end_page="209" type="metho"> <SectionTitle> 4 Adaptation of decision lists to n-way </SectionTitle> <Paragraph position="0"> ambiguities Decision lists as defined in (Yarowsky, 1993; 1994) are simple means to solve ambiguity problems. They have been successfully applied to accent restoration, word&quot; sense disambiguation and homograph disambiguation (Yarowsky, 1994; 1995; 1996). In order to build decision lists the training examples are processed to extract the features (each feature corresponds to a kind of collocation), which are weighted with a log-likelihood measure. The list of all features ordered by log-likelihood values constitutes the decision list. We adapted the original formula in order to accommodate ambiguities higher than two: . , Pr(sense i I features) , weight(sensei, feature,) = ~ogt- )</Paragraph> <Paragraph position="2"> When testing, the decision list is checked in order and the feature with highest weight that is present in the test sentence selects the winning word sense. For this work we also considered negative weights, which were not possible on two-way ambiguities.</Paragraph> <Paragraph position="3"> The probabilities have been estimated using the maximum likelihood estimate, smoothed using a simple method: when the denominator in the formula is 0 we replace it with 0.1. It is not clear how the smoothing technique proposed in (Yarowsky, 1993) could be extended to n-way ambiguities.</Paragraph> <Paragraph position="4"> More details of the implementation can be found in (Agirre & Martinez, 2000).</Paragraph> </Section> <Section position="7" start_page="209" end_page="210" type="metho"> <SectionTitle> 5 In-corpus experiments: </SectionTitle> <Paragraph position="0"> collocations are weak (80%) We extracted the collocations in the Brown Corpus section of the DSO corpus and, using 10-fold cross-validation, tagged the same corpus. Training and testing examples were thus from the same corpus. The same procedure was followed for the WSJ part. The results are shown in Tables 4 and 5. We can observe the following: * The best kinds of collocations are local content word collocations, especially if two words from the context are taken into consideration, but the coverage is low.</Paragraph> <Paragraph position="1"> Function words to right and left also attain remarkable precision.</Paragraph> <Paragraph position="2"> * Collocations are stronger in the WSJ, surely due to the fact that the BC is balanced, and therefore includes more genres and topics.</Paragraph> <Paragraph position="3"> This is a first indicator than genre and topic variations have to be taken into account.</Paragraph> <Paragraph position="4"> * Collocations for fine-gained word-senses are sensibly weaker than those reported by Yarowsky (1993) for two-way ambiguous words. Yarowsky reports 99% precision, while our highest results do not reach 80%. It has to be noted that the test and training examples come from the same corpus, which means that, for some test cases, there are training examples from the same document. In somesense we can say that one sense per discourse comes into play. This point will be further explored in Section 7.</Paragraph> <Paragraph position="5"> 1. state -- (the group of people comprising the government of a sovereign) 2. state, province -- (the territory occupied by one of the constituent administrative districts of a nation) 3. state, nation, country, land, commonwealth, res publica, body politic -- (a politically organized body of people under a single government) 4. state -- (the way something is with respect to its main attributes) 5. Department of State, State Department, State -- (the federal department that sets and maintains foreign policies) 6. country, state, land, nation -- (the territory occupied by a nation) In the rest of this paper, only the overall results for each subset of the collocations will be shown. We will pay special attention to localcontent collocations, as they are the strongest, and also closer to strict definitions of collocation.</Paragraph> <Paragraph position="6"> As an example of the learned collocations Table 6 shows some strong local content word collocations for the noun state, and Figure 1 shows the word senses of state (6 out of the 8 senses are shown as the rest were not present in the corpora).</Paragraph> </Section> <Section position="8" start_page="210" end_page="212" type="metho"> <SectionTitle> 6 Cross-corpora experiments: </SectionTitle> <Paragraph position="0"> one sense per collocation in doubt.</Paragraph> <Paragraph position="1"> In these experiments we train on the Brown Corpus and tag the WSJ corpus and vice versa. Tables 7 and 8, when compared to Tables 4 and 5 show a significant drop in performance (both precision and coverage) for all kind of collocations (we only show the results for each subset of collocations). For instance, Table 7 shows a drop in .16 in precision for local content collocations when compared to Table 4.</Paragraph> <Paragraph position="2"> These results confirm those by (Escudero et al. 2000) who conclude that the information learned in one corpus is not useful to tag the other.</Paragraph> <Paragraph position="3"> In order to analyze the reason of this performance degradation, we compared the local content-word collocations extracted from one corpus and the other. Table 9 shows the amount of collocations extracted from each corpus, how many of the collocations are shared on average and how many of the shared collocations are in contradiction. The low amount of collocations shared between both corpora could explain the poor figures, but for some words (e.g. point) there is a worrying proportion of contradicting collocations.</Paragraph> <Paragraph position="4"> We inspected some of the contradicting collocations and saw that m all the cases they were caused by errors (or at least differing criteria) of the hand-taggers when dealing with words with difficult sense: distinctions. For instance, Table 10 shows some collocations of point which receive contradictory senses in the BC and the WSJ. The collocation important point, for instance, is assigned the second sense I in all 3 occurrences in the 13C, and the fourth sense 2 in all 2 occurrences in the WSJ.</Paragraph> <Paragraph position="5"> We can therefore conclude that the one sense per collocation holds across corpora, as the contradictions found were due to tagging errors. The low amount of collocations in common would explain in itself the low figures on cross-corpora tagging.</Paragraph> <Paragraph position="6"> But yet, we wanted to further study the reasons of the low number of collocations in common, which causes the low cross-corpora performance. We thought of several factors that could come into play: a) As noted earlier, the training and test examples from the in-corpus experiments are taken at random, and they could be drawn from the same document. This could make the results appear better for in-corpora experiments. On the contrary, in the cross-corpora experiments training and testing example come from different documents.</Paragraph> <Paragraph position="7"> b) The genre and topic changes caused by the shift from one corpus to the other.</Paragraph> <Paragraph position="8"> c) Corpora have intrinsic features that carmot be captured by sole genre and topic variations.</Paragraph> <Paragraph position="9"> d) The size of the data, being small, would account for the low amount of collocations shared.</Paragraph> <Paragraph position="10"> We explore a) in Section 7 mad b) in Section 8. c) and d) are commented in Section 8.</Paragraph> <Paragraph position="11"> 7 Drawing training and testing examples from the same documents affects performance In order to test whether drawing training and testing examples from the same document or not explains the different performance in in-corpora and cross-corpora tagging, low cross-corpora results, we performed the following experiment. Instead of organizing the 10 random subsets for cross-validation on the examples, we choose 10 subsets of the documents (also at random). This i The second sense of point is defined as the precise location of something; a spatially limited location. 2 Defined as an isolated fact that is considered separately from the whole.</Paragraph> <Paragraph position="12"> # Coll. # Coll. % Coil % Coll. Word PoS way, the testing examples and training examples are guaranteed to come from different documents. We also think that this experiment would show more realistic performance figures, as a real application can not expect to find examples from the documents used for training. Unfortunately, there are not any explicit document boundaries, neither in the BC nor in the WSJ.</Paragraph> <Paragraph position="13"> In the BC, we took files as documents, even if files might contain more than one excerpt from different documents. This guarantees that document boundaries are not crossed. It has to be noted that following this organization, the target examples would share fewer examples from the same topic. The 168 files from the BC were divided in 10 subsets at random: we took 8 subsets with 17 files and 2 subsets with 16 files. For the WSJ, the only cue was the directory organization. In this case we were unsure about the meaning of this organization, but hand inspection showed that document boundaries were not crossing discourse boundaries. The 61 directories were divided in 9 subsets with 6 directories and 1 subset with 7.</Paragraph> <Paragraph position="14"> Again, 10-fold cross-validation was used, on these subsets and the results in Tables 11 and 12 were obtained. The ,5 column shows the change in precision with respect to Tables 5 and 6.</Paragraph> <Paragraph position="15"> Table 12 shows that, for the BC, precision and coverage, compared to Table 5, are degraded significantly. On the contrary results for the WSJ are nearly the same (el. Tables 11 and 4).</Paragraph> <Paragraph position="16"> The results for WSJ indicate that drawing training and testing data from the same or different documents in itself does not affect so much the results. On the other hand, the results for BC do degrade significantly. This could be explained by the greater variation in topic and genre between the files in the BC corpus. This will be further studied in Section 8.</Paragraph> <Paragraph position="17"> Table 13 summarizes the overall results on WSJ and BC for each of the different experiments performed. The figures show that drawing training and testing data from the same or different documents would not in any case explain the low figures in cross-corpora tagging. 8 Genre and topic variation affects performance Trying to shed some light on this issue we observed that the category press:reportage, is related to the genre/topics of the WSJ. We therefore designed the following experiment: we tagged each category in the BC with the decision lists trained on the WSJ, and also with the decision lists trained on the rest of the categories in the BC.</Paragraph> <Paragraph position="18"> Table 14 shows that the local content-word collocations trained in the WSJ attain the best precision and coverage for press:reportage, both compared to the results for the other categories, and to the results attained by the rest of the BC on press:reportage. That is: * From all the categories, the collocations from press:reportage are the most similar to those of WSJ.</Paragraph> <Paragraph position="19"> * WSJ contains collocations which are closer to those of press:reportage, than those from the rest of the BC.</Paragraph> <Paragraph position="20"> In other words, having related genre/topics help having common collocations, and therefore, warrant better word sense disambiguation performance.</Paragraph> <Paragraph position="21"> Overall Localcontent pr. coy. Apr. pr. cov. Apr.</Paragraph> <Paragraph position="22"> pr. cov. Apr. pr. cov. Apr.</Paragraph> <Paragraph position="23"> Best precision results are shown in bold.</Paragraph> </Section> <Section position="9" start_page="212" end_page="213" type="metho"> <SectionTitle> 9 Reasons for cross-corpor a degradation </SectionTitle> <Paragraph position="0"> The goal of sections 7 and 8 was to explore the possible causes for the low number of collocations in common between BC and WSJ.</Paragraph> <Paragraph position="1"> Section 7 concludes that drawing the examples from different files is not the main reason for the degradation. This is specially true when the corpus has low genre/topic variation (e.g. WSJ).</Paragraph> <Paragraph position="2"> Section 8 shows that sharing genre/topic is a key factor; as the WSJ corpus attains better results on the press:reportage category than the rest of the categories on the BC itself. Texts on the same genre/topic share more collocations than texts on disparate genre/topics, even if they come from different corpora.</Paragraph> <Paragraph position="3"> This seems to also rule out explanation c) (cf. Section 6), as a good measure of topic/genre similarity would help overcome cross-corpora problems.</Paragraph> <Paragraph position="4"> That only leaves the low amount of data available for this study (explanation d). It is true that data-scarcity can affect the number of collocations shared across corpora. We think that larger amounts will make', this number grow, especially if the corpus draws texts from different genres and topics. Nevertheless, the figures in Table 14 indicate that even in those conditions genre/topic relatedness would help to find common collocations.</Paragraph> </Section> class="xml-element"></Paper>