File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2084_intro.xml
Size: 2,882 bytes
Last Modified: 2025-10-06 14:03:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2084"> <Title>Combining Association Measures for Collocation Extraction</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Lexical association measures are mathematical formulas determining the strength of association between two or more words based on their occurrences and cooccurrences in a text corpus. They have a wide spectrum of applications in the field of natural language processing and computational linguistics such as automatic collocation extraction (Manning and Schutze, 1999), bilingual word alignment (Mihalcea and Pedersen, 2003) or dependency parsing. A number of various association measures were introduced in the last decades. An overview of the most widely used techniques is given e.g. in Manning and Schutze (1999) or Pearce (2002). Several researchers also attempted to compare existing methods and suggest different evaluation schemes, e.g Kita (1994) and Evert (2001). A comprehensive study of statistical aspects of word cooccurrences can be found in Evert (2004) or Krenn (2000).</Paragraph> <Paragraph position="1"> In this paper we present a novel approach to automatic collocation extraction based on combining multiple lexical association measures. We also address the issue of the evaluation of association measures by precision-recall graphs and mean average precision scores. Finally, we propose a step-wise feature selection algorithm that reduces the number of combined measures needed with respect to performance on held-out data.</Paragraph> <Paragraph position="2"> The term collocation has both linguistic and lexicographic character. It has various definitions but none of them is widely accepted. We adopt the definition from Choueka (1988) who defines a collocational expression as &quot;a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components&quot;.</Paragraph> <Paragraph position="3"> This notion of collocation is relatively wide and covers a broad range of lexical phenomena such as idioms, phrasal verbs, light verb compounds, technological expressions, proper names, and stock phrases. Our motivation originates from machine translation: we want to capture all phenomena that may require special treatment in translation.</Paragraph> <Paragraph position="4"> Experiments presented in this paper were performed on Czech data and our attention was restricted to two-word (bigram) collocations - primarily for the limited scalability of some methods to higher-order n-grams and also for the reason that experiments with longer word expressions would require processing of much larger corpus to obtain enough evidence of the observed events.</Paragraph> </Section> class="xml-element"></Paper>