File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1013_intro.xml
Size: 5,733 bytes
Last Modified: 2025-10-06 14:01:40
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1013"> <Title>A Categorial Variation Database for English</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Lexical relations describe relative relationships among different lexemes. Lexical relations are either hierarchi- null clude such information in their next version, but only for nouns and verbs (Christiane Fellbaum, pc.), not other pairings such as noun-adjective, verb-preposition relationships. Discussions are currently underway for sharing the CatVar database with Word-Net developers for more rapid development, extension, and mutual validation of both resources.</Paragraph> <Paragraph position="1"> tions (such as identity, overlap, synonymy and antonymy) (Cruse, 1986).</Paragraph> <Paragraph position="2"> WordNet is the most well-developed and widely used lexical database of English (Fellbaum, 1998). In Word-Net, both types of lexical relations are specified among words with the same part of speech (verbs, nouns, adjectives and adverbs). WordNet has been used by many researchers for different purposes ranging from the construction or extension of knowledge bases such as SEN-SUS (Knight and Luk, 1994) or the Lexical Conceptual Structure Verb Database (LVD) (Green et al., 2001) to the faking of meaning ambiguity as part of system evaluation (Bangalore and Rambow, 2000). In the context of these projects, one criticism of WordNet is its lack of cross-categorial links, such as verb-noun or noun-adjective relations. null Mel'Vcuk approaches lexical relations by defining a lexical combinatorial zone that specifies semantically related lexemes through Lexical Functions (LF). These functions define a correspondence between a key lexical item and a set of related lexical items (Mel'Vcuk, 1988). There are two types of functions: paradigmatic and syntagmatic (Ramos et al., 1994). Paradigmatic LFs associate a lexical item with related lexical items. The relation can be semantic or syntactic. Semantic LFs include Syn-</Paragraph> <Paragraph position="4"> inine.</Paragraph> <Paragraph position="5"> Syntagmatic LFs specify collocations with a lexeme given a specified relationship. For example, there is a LF that returns a light verb associated with the LF's key: Light-Verb(attention) = pay. Other LFs specify certain semantic associations such as Intensify-Qualifier(escape) = narrow and Degradation(milk) = sour. Lexical Functions have been used in MT and Generation (e.g. (Ramos et al., 1994)).</Paragraph> <Paragraph position="6"> Although research on Lexical Functions provides an intriguing theoretical discussion, there are no large scale resources available for categorial variations induced by lexical functions. This lack of resources shouldn't suggest that the problem is too trivial to be worthy of investigation or that a solution would not be a significant contribution. On the contrary, categorial variations are necessary for handling many NLP problems. For example, in the context of MT, (Habash et al., 2002) claims that 98% of all translation divergences (variations in how source and target languages structure meaning) involve some form of categorial variation. Moreover, most IR systems require some way to reduce variant words to common roots to improve the ability to match queries (Xu and Croft, 1998; Hull and Grefenstette, 1996; Krovetz, 1993).</Paragraph> <Paragraph position="7"> Given the lack of large-scale resources containing categorial variations, researchers frequently develop and use alternative algorithmic approximations of such a resource. These approximations can be divided into Reductionist (Analytical) or Expansionist (Generative) approximations. The former focuses on the conversion of several surface forms into a common root. Stemmers such as the Porter stemmer (Porter, 1980) are a typical example. The latter, or expansionist approaches, overgenerate possibilities and rely on a statistical language model to rank/select among them. The morphological generator in Nitrogen is an example of such an approximation (Langkilde and Knight, 1998).</Paragraph> <Paragraph position="8"> There are two types of problems with approximations of this type: (1) They are uni-directional and thus limited in usability--A stemmer cannot be used for generation and a morphological overgenerator cannot be used for stemming; (2) The crude approximating nature of such systems cause many problems in quality and efficiency from over-stemming/under-stemming or overgeneration/under-generation. null Consider, for example, the Porter stemmer, which stems communea1 , communicationa1 and communisma1 to a0a2a1a4a3a5a3a7a6a9a8 . And yet, it does not produce this same stem for communista1 or communicablea2a4a3 (stemmed to a19a32a1a34a33 returns eleven variations including a35a30a31 a23 a27 a23 a19a36a1a34a33a37a18a24a38 a23 , a35a39a31 a23 a27 a23 a19a32a1a34a33 a10 a0a40a18 a13a22a10 a1a4a8 and a35a39a31 a23 a27 a23 a19a32a1a34a33a37a41 . Only two are correct (a31 a23 a27 a23 a19a32a1a34a33a37a3 a23 a8 a13 and a31 a23 a27 a23 a19a32a1a34a33 a10 a8a42a38 ). Such overgeneration multiplied out at different points in a sentence expands the search space exponentially, and given various cut-offs in the search algorithm, might even appear in some of the top ranked choices.</Paragraph> <Paragraph position="9"> Given these issues, our goal is to build a database of categorial variations that can be used with both expansionist and reductionist approaches without the cost of over/under-stemming/generation. The research reported herein is relevant to MT, IR, and lexicon construction.</Paragraph> </Section> class="xml-element"></Paper>