File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2602_intro.xml
Size: 3,131 bytes
Last Modified: 2025-10-06 14:02:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2602"> <Title>Towards Full Automation of Lexicon Construction</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A lexicon is a key resource for natural language processing, providing the link between the terms of a language and the semantic and syntactic properties with which they are associated. Like most resources of considerable value, a good lexicon can be dif cult or expensive to obtain. This is particularly true if the lexicon needs to be specialized to a technical subject, an obscure language or dialect, or a highly idiomatic writing style. Motivated by the practical importance of these cases as well as the theoretical interest inherent to the problem, we have set out to develop methods for building a lexicon automatically, given only a corpus of text representative of the domain of interest.</Paragraph> <Paragraph position="1"> We represent the semantics of a term by an associated probability distribution over what we call a grounding space, which we de ne in various relatively conventional ways involving terms that occur in text in the vicinity of the term in question. It is well-known that such distributions can represent meaning reasonably well, at least for meaning-comparison purposes (Landauer and Dumais, 1997). We add to this framework the notion that the more information such a distributional lexicon can capture, the more useful it is. This provides us with a mathematical concept of lexical optimization.</Paragraph> <Paragraph position="2"> We begin the lexicon construction process by applying a distributional clustering technique called information theoretic co-clustering to make a rst pass at grouping the most frequent terms in the corpus according to their most common syntactic part of speech category, as described in Section 2 along with illustrative results. We brie y describe the co-clustering algorithm in Section 2.1. In Section 3.1, we show that novel terms can be sensibly assigned to previously de ned clusters using the same information theoretic criterion that the co-clustering uses. Even though term clustering crudely ignores the fact that a term's part of speech generally varies with its context, it is clear from inspection that the clusters themselves correspond to corpus-adapted part-of-speech categories, and can be used as such. In Section 3.2, we examine two approaches to incorporating context information.</Paragraph> <Paragraph position="3"> The most direct is to partition the contexts in which a term occurs into classes according to the informatic criterion used in co-clustering, creating sense-disambiguated word-with-context-class pseudo-terms . We also discuss the use of Hidden Markov Models (HMMs) to capture contextual information. In Section 3.3 we apply the same principle in reverse to nd multi-word units.</Paragraph> <Paragraph position="4"> We conclude in Section 3.5 with a discussion of possible improvements to our approach, and possible extensions of it.</Paragraph> </Section> class="xml-element"></Paper>