File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0203_intro.xml

Size: 3,707 bytes

Last Modified: 2025-10-06 14:06:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0203">
  <Title>Desiderata for Tagging with WordNet Synsets or MCCA Categories</Title>
  <Section position="4" start_page="12" end_page="13" type="intro">
    <SectionTitle>
4 MCCA Categories andWordNet Synsets
</SectionTitle>
    <Paragraph position="0"> (McTavish, et al., 1995) and (2C/IcTavish, et al., 1997) suggest that MCCA categories recapitulate WordNet synsets. We used WordNet synsets in examining MCCA categories to determine their coherence, to characterize their relations with WordNet, and to understand the si~ificance of these relations in the MCCA analysis of concepts and themes and in tagging with WordNet synsets.</Paragraph>
    <Paragraph position="1"> In the MCCA dictionary of I 1,000 words, s the average number of words in a category is 95, with a range from I to about 300. Using the DIMAP soRware (CL Research, 1997 - in preparation), ~ we created sublexicons of individual categories, extracted WordNet synsets for these sublexicons, extracted  lexicons for natural language processing, available from CL Research. Procedures used in this paper, applicable to any category analysis using DIMAP, are available at http:I/www.clres.com. The general principles of category development followed in these procedures are described in (Litkowski, in preparation). information from the Merriam-Webster Concise Electronic Dictionary integrated with DIMAP, and attached lexical semantic information from other resources to entries in these sublexicons.</Paragraph>
    <Paragraph position="2"> We began with the hypothesis that the categories correspond to those developed by (Hearst &amp; Sch0tze, 1996) in creating categories from the WordNet noun hierarchy. We found that the MCCA categories were generally internally consistent, but with characteristics not intuitively obvious) As a result, we needed to articulate firm principles for characterizing the categories.</Paragraph>
    <Paragraph position="3"> Eleven categories (such as Have, Prepositions, You, l-Me, He, A-An, The) consist of only a few words from closed classes. The category The contains one word with an average expected fiequency of 6 percent (with a range over the four contexts of 5.5 to 6.5). The category Prepositions contains 18 words with an average expected fi'equency of I I.I percent (with a range over the four contexts of 9.5 to 12.3 percent). About 20 categories (Implication, If, Colors, Object, Being) consist of a relatively small number of words (34, 22, 65, I I, 12, respectively) taken primarily from syntactically or semantically closed-class words (subordinating conjunctions, relativizers, the tops of WordNet, colors).</Paragraph>
    <Paragraph position="4"> The remaining 80 or so categories consist primarily of open-class words (nouns, verbs, adjectives, and adverbs), sprinkled with closed-class words (auxiliaries, subordinating conjunctions). These categories require more detailed analyses: Several categories correspond well to the Hearst &amp; SchOtze model. The categories Functional roles, Detached roles, and Human roles align with subtrees rooted at particular nodes in the WordNet hierarchies. For exRmple, Detached ro/es has a total of 66 words, with an average expected fi-equency of.16 percent and a range fi'om .10 to .35 percent. The .35 percent frequency is for the ana~#c context; each of the other three contexts have expected fi'equencies of about. 10 percent. The words in this category include: ACADEMIC, ARTIST, BIOLOGIST, CREATOR, CRITIC, HIffIDRIAN, INSTRUCTOR, OBSERVER, PHILOSOPHER, Sin general, we have found that assignment of only about 5 to 10 percent of the words in a category is questionable.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML