File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1326_intro.xml
Size: 5,426 bytes
Last Modified: 2025-10-06 14:01:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1326"> <Title>One Sense per Collocation and Genre/Topic Variations</Title> <Section position="2" start_page="0" end_page="207" type="intro"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper revisits the one sense per collocation hypothesis using fine-grained sense distinctions and two different corpora.</Paragraph> <Paragraph position="1"> We show that the hypothesis is weaker for fine-grained sense distinctions (70% vs.</Paragraph> <Paragraph position="2"> 99% reported earlier on 2-way ambiguities).</Paragraph> <Paragraph position="3"> We also show that one sense per collocation does hold across corpora, but that collocations vary from one corpus to the other, following genre and topic variations.</Paragraph> <Paragraph position="4"> This explains the low results when performing word sense disambiguation across corpora. In fact, we demonstrate that when two independent corpora share a related genre/topic, the word sense disambiguation results would be better.</Paragraph> <Paragraph position="5"> Future work on word sense disambiguation will have to take into account genre and topic as important parameters on their models.</Paragraph> <Paragraph position="6"> Introduction In the early nineties two famous papers claimed that the behavior of word senses in texts adhered to two principles: one sense per discourse (Gale et al., 1992) and one sense per collocation (Yarowsky, 1993).</Paragraph> <Paragraph position="7"> These hypotheses were shown to hold for some particular corpora (totaling 380 Mwords) on words with 2-way ambiguity. The word sense distinctions came from different sources (translations into French, homophones, homographs, pseudo-words, etc.), but no dictionary or lexical resource was linked to them. In the case of the one sense per collocation paper, several corpora were used, but nothing is said on whether the collocations hold across corpora.</Paragraph> <Paragraph position="8"> Since the papers were published, word sense disambiguation has moved to deal with fine-grained sense distinctions from widely recognized semantic lexical resources; ontologies like Sensus, Cyc, EDR, WordNet, EuroWordNet, etc. or machine-readable dictionaries like OALDC, Webster's, LDOCE, etc. This is due, in part, to the availability of public hand-tagged material, e.g. SemCor (Miller et al., 1993) and the DSO collection (Ng & Lee, 1996). We think that the old hypotheses should be tested under the conditions of this newly available data. This paper focuses on the DSO collection, which was tagged with WordNet senses (Miller et al. 1990) and comprises sentences extracted from two different corpora: the balanced Brown Corpus and the Wall Street Journal corpus.</Paragraph> <Paragraph position="9"> Krovetz (1998) has shown that the one sense per discourse hypothesis does not hold for fine-grained senses in SemCor and DSO. His results have been confirmed in our own experiments.</Paragraph> <Paragraph position="10"> We will therefore concentrate on the one sense per collocation hypothesis, considering these two questions: * Does the collocation hypothesis hold across corpora, that is, across genre and topic variations (compared to a single corpus, probably with little genre and topic variations)? * Does the collocation hypothesis hold for freegrained sense distinctions (compared to homograph level granularity)? The experimental tools to test the hypothesis will be decision lists based on various kinds of collocational information. We will compare the performance across several corpora (the Brown Corpus and Wall Street Journal parts of the DSO collection), and also across different sections of the Brown Corpus, selected according to the genre and topics covered. We will also perform a direct comparison, using agreement statistics, of the collocations used and of the results obtained.</Paragraph> <Paragraph position="11"> This study has special significance at this point of word sense disambiguation research. A recent study (Agirre & Martinez, 2000) concludes that, for currently available hand-tagged data, the precision is limited to around 70% when tagging all words in a running text. In the course of extending available data, the efforts to use corpora tagged by independent teams of researchers have been shown to fail (Ng et al., 1999), as have failed some tuning experiments (Escudero et al., 2000), and an attempt to use examples automatically acquired from the Internet (Agirre & Martinez, 2000). All these studies obviated the fact that the examples come from different genre and topics. Future work that takes into account the conclusions drawn in this paper will perhaps be able to automatically extend the number of examples available and tackle the acquisition problem. The paper is organized as follows. The resources used and the experimental settings are presented first. Section 3 presents the collocations considered and Section 4 explains how decision lists have been adapted to n-way ambiguities. Sections 5 and 6 show the in-corpus and cross-corpora experiments, respectively. Section 7 discusses the effect of drawing training and testing data from the same documents. Section 8 evaluates the impact of genre and topic variations, which is fiarther discussed in Section 9. Finally, Section 10 presents some conclusions.</Paragraph> </Section> class="xml-element"></Paper>