File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2710_metho.xml
Size: 26,600 bytes
Last Modified: 2025-10-06 14:09:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2710"> <Title>Annotating WordNet</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Prior work </SectionTitle> <Paragraph position="0"> Previous efforts to sense-tag corpora manually have demonstrated that the task is not trivial. To begin with, the possibility of distinguishing word senses definitively in general is recognized as being problematic (Atkins and Levin, 1988; Kilgarriff, 1997; Hanks, 2000); indeed the notion of a &quot;sense&quot; has itself been the subject of long debate (see the Hanks and Kilgarriff papers for two recent contributions). These are topics in need of serious consideration, but outside of the scope of this paper. Certain issues emerge that are particularly relevant to designing sense. A word with more than one unrelated sense is called a homonym. &quot;Bank&quot; is the classic example, its unrelated senses being &quot;river bank&quot; and &quot;financial institution.&quot; WordNet does not make a distinction between homonymy and polysemy, therefore a monosemous word in WordNet is one which has neither related nor unrelated senses.</Paragraph> <Paragraph position="1"> concentrate on here.</Paragraph> <Paragraph position="2"> The difficulties inherent in the sense-tagging task include the order in which words are presented to tag, a word's degree of polysemy and part of speech, vagueness of the context, the order in which senses are presented, granularity of the senses, and level of expertise of the per-son doing the tagging. Each will be addressed briefly in the following sections.</Paragraph> <Paragraph position="3"> 2.1 Targeted vs. sequential or 'lexical' vs. 'textual' There are two approaches one can take to the order in which words are tagged. In the sequential approach, termed 'textual' by Kilgarriff (1998), the tagger proceeds through the text one word at a time, assigning the context-appropriate sense to each open class word as it is encountered. The targeted approach ('lexical' in Kilgarriff's terms) involves tagging all corpus instances of a pre-selected word, jumping through the text to each occurrence. The corpora produced by the SEMCOR and READER projects were tagged sequentially; the Kilo and HECTOR projects used the targeted approach (Kilo and READER are described in Miller et al. (1998), HECTOR in Atkins (1993), and SEMCOR in Fellbaum, ed. (1998)).</Paragraph> <Paragraph position="4"> In sequential tagging, the tagger is following the narrative, and so has the meaning of the text in mind when selecting senses for each new word encountered (context is foremost). In targeted tagging, the tagger has the various meanings of the word in mind when presented with each new context (sense distinctions are foremost).</Paragraph> <Paragraph position="5"> In their comparisons of the two approaches, Fellbaum et al. (2003) and Kilgarriff (1998) both conclude that sequential tagging increases the difficulty of the task by requiring the tagger to acquaint (and then reacquaint) themselves with the senses of a word each time they are confronted with it in the text. The targeted approach, on the other hand, enables the tagger to gain mastery of the sense-distinctions of a single word at a time, reducing the amount of effort required to tag each new instance. Miller et al. (1998) present a contrasting view. In evaluating the Kilo and READER tagging tasks, they find targeted tagging to be more tedious for the taggers than sequential tagging, and no faster, as time is needed to assimilate new contexts for each word occurrence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Polysemy, POS, and sense order </SectionTitle> <Paragraph position="0"> In their 1997 paper, Fellbaum et al. analyzed the results of the SEMCOR project, in which part of the Brown Corpus (KuVcera and Francis, 1967) was tagged to WordNet 1.4 senses. Their analysis identified three factors that influenced the difficulty level, and thus the accuracy, of the tagging task: degree of polysemy of the word being tagged, the word's part of speech, and the order in which the WordNet sense choices are displayed to the person doing the tagging.</Paragraph> <Paragraph position="1"> The effect of a high degree of polysemy is to present more choices to the tagger, usually with finer distinctions among the senses, increasing the difficulty of selecting one out of several closely-related senses.</Paragraph> <Paragraph position="2"> The correspondence of a word's part of speech with accuracy of tagging stems from the nature of the objects that words of a certain class denote. Words that refer to concepts that are concrete tend to have relatively fixed, easily distinguishable senses. Words with more abstract or generic referents tend to have a more flexible semantics, with meanings being assigned in context, and hence more difficult to pin down. Nouns tend to be in the former category, verbs in the latter. More abstract classes also tend to have a higher degree of polysemy, adding to the effect.</Paragraph> <Paragraph position="3"> Finally, the presentation of the sense choices in Word-Net order, with the most frequent sense first, creates a bias towards selecting the first sense. Their study shows that randomly ordering the senses removes this effect.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Granularity of senses </SectionTitle> <Paragraph position="0"> Palmer et al. (to appear, 2004) examines the relationship between manual and automatic tagging accuracy and the granularity of the sense inventory. Granularity has to do with fineness of distinctions made from a lexicographer's point of view, and not the number of senses that a word form exhibits in context. It is related to polysemy, in that the greater a word's degree of polysemy, the finer the distinctions that can be made in defining senses for that word. In their experiment, Palmer et al. have lexicographers group WordNet 1.75 senses according to syntactic and semantic criteria, which are used by taggers to tag corpus instances. An automatic WSD system trained on the data tagged using grouped senses shows a 10% overall improvement in performance against running it on data tagged without using the groupings. Their study shows that improvement came not from the fewer number of senses resulting from the groupings, but from the groupings themselves, which increased the manual tags' accuracy (defined as agreement between taggers), thereby increasing the accuracy of the systems that learned from them. This effect arises from the slippery nature of word senses and the impossibility of capturing them in neatly delimited, universally agreed-upon sense-boxes. New usages of words extending old meanings, vague contexts that select for multiple senses, and the limits of the tagger's own knowledge of a specialized domain, all defy the assignment of a single, unequivocal sense to a word's instance across annotators. Palmer et al. propose sense groupings as a practical solution in these situations.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Tagger expertise </SectionTitle> <Paragraph position="0"> Finally, there is the question of whether novice taggers with adequate training can attain the level of accuracy of experienced lexicographers and linguists. Fellbaum et al. (1997) answer this in the negative. Their findings show novice tagger accuracy decreasing as the number of senses, or fineness of distinctions among the senses, increases. Level of expertise likely influenced the slow pace of tagging reported for the Kilo and READER projects, which employed novice taggers. During the tagging of the evaluation dataset for SENSEVAL-1, the highly experienced lexicographers who did the tagging reported the time spent absorbing new contexts dropped off rapidly after a slow start-up period (Krishnamurthy and Nicholls, 2000).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 The present approach </SectionTitle> <Paragraph position="0"> To the extent that these difficulties can be addressed, we have attempted to do so. We feel the most accurate results can be obtained from the targeted approach using linguistically-trained taggers. The nature of the glosses (relatively short, completely self-contained) means that a fairly restricted context will need to be assimilated for each instance of a token, eliminating one factor of difficulty associated with the targeted approach. Since a definition is, by definition, unambiguous, the context provided by a gloss should, in theory, never be insufficient to disambiguate the words used within it. In this respect, the glosses differ from KWIC (Key Word In Context) lines in a concordance, with which they can be compared. KWIC concordances, so named because they display corpus instances of a (key) word along with surrounding text, are used by lexicographers in a manner very similar to the targeted approach to sense-tagging. There the task is to define the word, to determine and delineate its senses given its contexts of use. One further difference to be exploited is the fact that, unlike a sentence in a typical corpus, a gloss is embedded within a network of WordNet relations. This means that immediate hypernym, domain category, and other relations can be made available to the user as additional disambiguation aids.</Paragraph> <Paragraph position="1"> The order of the senses will be scrambled in the manual tagging interface so as to prevent a bias towards the first sense listed. To avoid putting any additional burden on the tagger, the order of senses will be fixed at the beginning of the session, and kept constant until the tagger exits the program or selects another word to tag.</Paragraph> <Paragraph position="2"> Underspecified word senses are expressed in WordNet in the form of verb groups6, which will be presented to the tagger in the sense display with the option to select either the entire group, or individual senses within the group. Where no appropriate grouping exists, and context and domain category are not enough to fully disambiguate, multiple tags can be assigned. Precise guidelines for when multiple senses can be assigned, and under what criteria, will need to be developed, and taggers will need to be extensively trained on them.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Annotating the glosses </SectionTitle> <Paragraph position="0"> There are six major components to the present sense-tagging system. They function in pipeline fashion, with the output of one being fed as input to the next. Each pass through the data produces output in valid XML, the structure of which is covered in Subsection 3.1. The six components are: (1) a gloss parser, (2) a tokenizer, (3) an ignorable text chunker and classifier, (4) a WordNet collocation7 recognizer, (5) an automatic sense tagger, and (6) a manual tagging tool.</Paragraph> <Paragraph position="1"> Prior to and in conjunction with building the preprocessor (the first four components), analysis of the Word-Net glosses was undertaken to determine what should be presented for tagging, and what was not to be tagged.</Paragraph> <Paragraph position="2"> Ignorable classes of word forms and multi-word forms were determined during this phase. These were used as a basis for the development of a stop list of words and phrases to ignore completely, and a second, semi-stop list that we have dubbed the &quot;wait list&quot;. The stop list is reserved for unequivocally closed-class words and phrases including prepositions, conjunctions, determiners, pronouns and modals, plus multi-word forms that function as prepositions (e.g., &quot;by means of&quot;). Words on the wait list will be held out from the automatic tagging stage for manual review and tagging later. Since WordNet covers only open-class parts of speech, word forms that have homographs in both open and closed-class parts of speech are on this list. During the manual tagging stage, the open-class senses will be tagged. Highly polysemous words such as &quot;be&quot; and &quot;have&quot; are also waitlisted. Many glosses also contain example sentences. While not an essential part of the semantic makeup of the synset, they do give some information about the illustrated word's sense-specific context of use, contributing to meaning in a different way.8 For this reason, we will be tagging the synset word (and only that word) of which the sentence is an exemplar. By virtue of being located within the synset, the exemplified form is in effect automatically disambiguated--it's just a matter of assigning the tag.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The glosstag DTD </SectionTitle> <Paragraph position="0"> Development of the formal model for the sense-tagged glosses took the DTD from the SEMCOR project as a starting point (Landes et al., 1998). It went through several iterations of modification, first to accommodate the specifics of the dataset being tagged (WordNet glosses as opposed to open text), and then to refine the handling of WordNet collocations. Prior tagging efforts had employed the WordNet method of representing collocations as single word forms, with underscores replacing the spaces between words. While it is a practical solution that gives the collocation the same status and representational form that it has as an entry in WordNet, by treating a collocation as a &quot;word&quot;, we lose the fact that it is decomposable into smaller units. This renders difficult the coding of discontinuous collocations (that is, collocations interrupted by one or more intervening words, for example 'His performance blew the competition out of the water', where &quot;blow out of the water&quot; is a WordNet collocation). A scheme that enables collocations to be treated both as individual words and as multi-word units is therefore desirable, particularly if future parsing passes need to identify the internal structure of a collocation, as for distinguishing phrase heads from non-heads.</Paragraph> <Paragraph position="1"> The smallest structural unit, then, is a word (or piece of punctuation), marked as <cf> if it is part of a WordNet collocation, and as <wf> otherwise. Attributes on the <wf> and <cf> elements identify each form uniquely in the gloss, and link together the constituent <cf>'s of a collocation.</Paragraph> <Paragraph position="2"> The major structural units of a gloss are <def>,<ex>, and <aux>. <def> contains the definitional portion of the gloss, the main interest of the tagging task. A <def> may be followed by one or more <ex>'s, each containing an example sentence. Auxiliary information,9 coded as <aux>, may precede or follow the <def>, or occur within it. Figure 1 shows the marked up gloss for sense 11 of life (the gloss text is &quot;living things collectively;&quot;), as it looks after preprocessing.</Paragraph> <Paragraph position="3"> Prior to sense-tagging, the lemma attribute on the <wf> or head10 <cf> of a collocation is set to all possible lemma forms, as determined during lemmatization (explicated more fully below). After sense-tagging, the lemma attribute is set to only the lemma of the word/collocation that it is tagged to, all other options are deleted. An <id> element representing the sense tag is inserted as a child of the <wf> (or <cf>), if multi9Auxiliary text is a cover term for a range of numeric and symbolic classes (dates, times, numbers, numeric ranges, measurements, formulas, and the like), and parenthesized and other secondary text that are inessential to the meaning of a synset. 10&quot;Head&quot; here refers simply to the first word in the collocation, and not the syntactic head. The head <cf> bears the lemma and sense-tag(s) for the entire collocation.</Paragraph> <Paragraph position="4"> ple sense-tags are assigned, then multiple <id>'s are assigned, one for each tag. Figure 2 shows the sense-tagged gloss for life.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Preprocessing and automatic tagging </SectionTitle> <Paragraph position="0"> The preprocessing stage segments the gloss into chunks and tokenizes the gloss contents into words and WordNet collocations. The tokenization pass isolates word forms and disambiguates lexical from non-lexical punctuation.</Paragraph> <Paragraph position="1"> Lexical punctuation is retained as part of the word form, non-lexical punctuation is encoded as ignorable <wf>'s.</Paragraph> <Paragraph position="2"> Abbreviations and acronyms are recognized, contractions split, and stop list and wait list forms are handled. All <wf>'s other than punctuation are lemmatized, that is, they are reduced to their WordNet entry form using an in-house tool, moan11, that was developed for this purpose.</Paragraph> <Paragraph position="3"> Part of speech is not disambiguated during preprocessing, therefore lemmatizing assigns all potential lemma forms for all part of speech classes that moan returns for the token. Part of speech disambiguation will occur as a side-effect of sense-tagging, avoiding the introduction of errors related to POS-tagging. Lemmatizing serves two functions, first when searching the database of glosses for the term being tagged, and then when displaying the sense choices for a particular instance.</Paragraph> <Paragraph position="4"> The targeted tagging approach introduces the problem of locating all inflected forms of the word/collocation to be tagged. Rather than build a tool to generate inflected forms, our solution was to pre-lemmatize the corpus and search on the lemma forms, on the assumption that while the search will overgenerate matches, it will not miss any.</Paragraph> <Paragraph position="5"> Locating alternations in hyphenation will be handled in a similar way, via the pre-indexing of alternate forms of hyphenated words/collocations in WordNet.</Paragraph> <Paragraph position="6"> The ignorable text classifier recognizes ignorable text as described earlier, chunking multi-word terms and assigning attributes indicating semantic class. The markup will enable them to be treated as individual words or, alternatively, as a single form indicating the class, which will be of use should further parsing or semantic processing of the glosses be called for.</Paragraph> <Paragraph position="7"> The WordNet collocation recognizer, or globber, uses a bag-of-words approach to locate multi-word forms in the glosses. First, all possible collocations are pulled from the WordNet database. This list is then filtered by several criteria designed to exclude candidates that cannot be accurately identified automatically. The largest class of excluded words is that of phrasal verbs, which cannot 11Moan falls somewhere between a stemmer and full morphological analyzer--it recognizes inflectional endings and restores a corpus instance to its possible lemma form(s) classified by part of speech and grammatical category of the inflectional suffix. Lemma form is the WordNet entry spelling, if the word is in WordNet.</Paragraph> <Paragraph position="8"> <synset pos=&quot;n&quot; ofs=&quot;00005905&quot;> <gloss desc=&quot;wsd&quot;> <def> <wf wf-num=&quot;1&quot; tag=&quot;un&quot; lemma=&quot;living%1|live%2|living%3&quot;>living</wf> <wf wf-num=&quot;2&quot; tag=&quot;un&quot; lemma=&quot;thing%1|things%1&quot;>things</wf> <wf wf-num=&quot;3&quot; tag=&quot;un&quot; lemma=&quot;collectively%4&quot; sep=&quot;&quot;>collectively</wf> <wf wf-num=&quot;4&quot; type=&quot;punc&quot; tag=&quot;ignore&quot;>;</wf> easily be distinguished from verbs followed by prepositions heading prepositional phrases.12 Many of these will be globbed by hand in the early stages of manual tagging.</Paragraph> <Paragraph position="9"> From this list of excluded words, we also generate a list of collocations that contain monosemous words. This list will later be used to prevent those words from being erroneously tagged in the automatic sense-tagging stage. The final list of words to be automatically globbed also takes into account variations in hyphenation and capitalization.</Paragraph> <Paragraph position="10"> Once the list is completed, the next step is to create an index of the glosses referenced by the lemmatized forms they contain. For each collocation, the globber calculates the intersection of the lists of glosses containing its constituent word forms. This list of possible collocations is then ordered by gloss.</Paragraph> <Paragraph position="11"> The final step of the globber iterates through each of the glosses, three passes per gloss. The first pass marks the monosemous words found in excluded collocations, without globbing the collocation. Pass two identifies multi-word forms that appear as consecutive <wf>'s in the text. The final pass attempts to locate disjoint collocations that follow certain set patterns of usage, such as &quot;ribbon, bull, and garter snakes&quot;, where &quot;ribbon snake&quot;, &quot;bull snake&quot;, and &quot;garter snake&quot; are all in WordNet. &quot;Garter snake&quot; is globbed in pass two, and parallel structure helps identify &quot;ribbon snake&quot; and &quot;bull snake&quot; in the third pass.</Paragraph> <Paragraph position="12"> After preprocessing is complete, the automatic sense tagger tags monosemous <wf>'s and <cf>'s to their WordNet senses. Words and collocations tagged by the automatic tagger are distinguished from manually tagged terms by an attribute in the markup.</Paragraph> <Paragraph position="13"> Sense-tagging the glosses to WordNet senses presupposes that all words used in the glosses (and all senses used of those words) exist as entries. The preprocessing and auto-tagging phase will therefore include a few dry runs to identify any typographical errors and words not covered, errors will be fixed and open class words or word senses will be added to WordNet as necessary.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Manual tagger interface </SectionTitle> <Paragraph position="0"> The single most important design consideration for the manual tagger interface is the repetitiveness inherent to the task. With approximately 550,000 polysemous open-class words and collocations13 in the glosses, each tagger will tag hundreds of words in a day of work. We have made every effort to minimize the amount of mouse movement and the number of button presses required to tag each word.</Paragraph> <Paragraph position="1"> The layout of the program window is simple. The current search term is displayed in an entry box near the top 12&quot;Last year legal fees ate up our profits.&quot; versus &quot;Last night we ate up the street.&quot; 13With an average polysemy of 2.59 senses per word form.</Paragraph> <Paragraph position="2"> of the screen. Below this box are two text boxes, for glosses and examples, respectively. Buttons used to alter the current tag or tags lie above the final text box, which is used to display and select the WordNet sense or senses for the current word.</Paragraph> <Paragraph position="3"> The tag status of each word or collocation in the gloss and example boxes is indicated through the use of color, font, and highlighting. Orange text indicates a term that has been automatically tagged, red type denotes a manually tagged word, words marked as ignorable are shown in black, and the remainder of the taggable text is blue.</Paragraph> <Paragraph position="4"> Words that are part of a collocation are underlined, and forms that match the targeted search term are bolded. The current selection is highlighted in yellow.</Paragraph> <Paragraph position="5"> There are several ways to navigate the glosses. For targeted tagging, the user chooses one or more senses, then clicks the 'Tag' button, assigning those senses and automatically jumping to the next untagged instance of the search word. Other buttons allow movement to the next or previous instance of the search term without altering the tag status of the current selection. The interface was designed with targeted tagging in mind, but the user can switch between targeted and sequential modes, to fix an improperly tagged word.</Paragraph> <Paragraph position="6"> The interface allows a user to filter the displayed senses by part of speech, to concentrate on the relevant options. When context is insufficient to fully disambiguate, a word or collocation can be tagged to more than one sense or to a WordNet verb group. To prevent bias caused by the order of the displayed senses, each time a new targeted search term is entered, the senses shown in the sense box are shuffled after being grouped by part of speech.</Paragraph> <Paragraph position="7"> During the targeted tagging process, the interface also enables a user to easily inspect and change the sense tags assigned to words other than the search term. The interface will display a box containing a tagged word's senses when the cursor is placed over it, providing useful information for disambiguating the search term. Additionally, if the user notices a tagging error, the mis-tagged word can be selected for editing. Errors and omissions of globbing can be corrected in a similar fashion. To &quot;un-glob&quot; a collocation, one need only click on the collocation and click on the &quot;un-glob&quot; button. To group separate <wf>s into a new collocation, a user can select each constituent form with the mouse and click the &quot;glob&quot; button. The interface will then provide a list of potential lemmas for the collocation, from which the user can select the appropriate choice.</Paragraph> </Section> </Section> class="xml-element"></Paper>