File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/p91-1036_intro.xml
Size: 2,501 bytes
Last Modified: 2025-10-06 14:05:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P91-1036"> <Title>FROM N-GRAMS TO COLLOCATIONS AN EVALUATION OF XTRACT</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION </SectionTitle> <Paragraph position="0"> In the past, several approaches have been proposed to retrieve various types of collocations from the analysis of large samples of textual data. Pairwise associations (bigrams or 2-grams) (e.g., \[Smadja, 1988\], \[Church and Hanks, 1989\]) as well as n-word (n > 2) associations (or n-grams) (e.g., \[Choueka el al., 1983\], \[Smadja and McKeown, 1990\]) were retrieved. These techniques automatically produced large numbers of collocations along with statistical figures intended to reflect their relevance.</Paragraph> <Paragraph position="1"> However, none of these techniques provides functional information along with the collocation. Also, the results produced often contained improper word associations reflecting some spurious aspect of the training corpus that did not stand for true collocations. This paper addresses these two problems.</Paragraph> <Paragraph position="2"> Previous papers (e.g., \[Smadja and McKeown, 1990\]) introduced a. set of tecl)niques and a. tool, Xtract, that produces various types of collocations from a two-stage statistical analysis of large textual corpora briefly sketched in the next section. In Sections 3 and 4, we show how robust parsing technology can be used to both filter out a number of invalid collocations as well as add useful syntactic information to the retained ones. This filter/analyzer is implemented in a third stage of Xtract that automatically goes over a the output collocations to reject the invalid ones and label the valid ones with syntactic information. For example, if the first two stages of Xtract produce the collocation &quot;make-decision,&quot; the goal of this third stage'is to identify it as a verb-object collocation. If no such syntactic relation is observed, then the collocation is rejected. In Section 5 we present an evaluation of Xtract as a collocation retrieval system. The addition of the third stage of Xtract has been evaluated to raise the precision of Xtract from 40% to 80degPS and it has a recall of 94%. In this paper we use examples related to the word &quot;takeover&quot; from a 10 million word corpus containing stock market reports originating from the Associated Press newswire.</Paragraph> </Section> class="xml-element"></Paper>