File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/93/j93-1007_relat.xml

Size: 5,894 bytes

Last Modified: 2025-10-06 14:16:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1007">
  <Title>Retrieving Collocations from Text: Xtract</Title>
  <Section position="5" start_page="148" end_page="149" type="relat">
    <SectionTitle>
4. Related Work
</SectionTitle>
    <Paragraph position="0"> There has been a recent surge of research interest in corpus-based computational linguistics methods; that is, the study and elaboration of techniques using large real text as a basis. Such techniques have various applications. Speech recognition (Bahl, Jelinek, and Mercer 1983) and text compression (e.g., Bell, Witten, and Cleary 1989; Guazzo 1980) have been of long-standing interest, and some new applications are currently being investigated, such as machine translation (Brown et al. 1988), spelling correction (Mays, Damerau, and Mercer 1990; Church and Gale 1990), parsing (Debili 1982; Hindle and Rooth 1990). As pointed out by Bell, Witten, and Cleary (1989), these applications fall under two research paradigms: statistical approaches and lexical approaches. In the statistical approach, language is modeled as a stochastic process and the corpus is used to estimate probabilities. In this approach, a collocation is simply considered as a sequence of words (or n-gram) among millions of other possible sequences. In contrast, in the lexical approach, a collocation is an element of a dictionary among a few thousand other lexical items. Collocations in the lexicographic meaning are only dealt with in the lexical approach. Aside from the work we present in this paper, most of the work carried out within the lexical approach has been done in computer-assisted lexicography by Choueka, Klein, and Neuwitz (1983) and Church and his colleagues (Church and Hanks 1989). Both works attempted to automatically acquire true collocations from corpora. Our work builds on Choueka's, and has been developed contemporarily to Church's.</Paragraph>
    <Paragraph position="1"> Choueka, Klein, and Neuwitz (1983) proposed algorithms to automatically retrieve idiomatic and collocational expressions. A collocation, as defined by Choueka, is a sequence of adjacent words that frequently appear together. In theory the sequences can be of any length, but in actuality, they contain two to six words. In Choueka 5 Taken from the daily reports transmitted daily by The Associated Press newswire.</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 19, Number 1 (1988), experiments performed on an 11 million-word corpus taken from the New York Times archives are reported. Thousands of commonly used expressions such as &amp;quot;fried chicken,&amp;quot; &amp;quot;casual sex,&amp;quot; &amp;quot;chop suey,&amp;quot; &amp;quot;home run,&amp;quot; and &amp;quot;Magic Johnson&amp;quot; were retrieved. Choueka's methodology for handling large corpora can be considered as a first step toward computer-aided lexicography. The work, however, has some limitations. First, by definition, only uninterrupted sequences of words are retrieved; more flexible collocations such as &amp;quot;make-decision,&amp;quot; in which the two words can be separated by an arbitrary number of words, are not dealt with. Second, these techniques simply analyze the collocations according to their observed frequency in the corpus; this makes the results too dependent on the size of the corpus. Finally, at a more general level, although disambiguation was originally considered as a performance task, the collocations retrieved have not been used for any specific computational task.</Paragraph>
    <Paragraph position="3"> Church and Hanks (1989) describe a different set of techniques to retrieve collocations. A collocation as defined in their work is a pair of correlated words. That is, a collocation is a pair of words that appear together more often than expected.</Paragraph>
    <Paragraph position="4"> Church et al. (1991) improve over Choueka's work as they retrieve interrupted as well as uninterrupted sequences of words. Also, these collocations have been used by an automatic parser in order to resolve attachment ambiguities (Hindle and Rooth 1990).</Paragraph>
    <Paragraph position="5"> They use the notion of mutual information as defined in information theory (Shannon 1948; Fano 1961) in a manner similar to what has been used in speech recognition (e.g., Ephraim and Rabiner 1990), or text compression (e.g., Bell, Witten, and Cleary 1989), to evaluate the correlation of common appearances of pairs of words. Their work, however, has some limitations too. First, by definition, it can only retrieve collocations of length two. This limitation is intrinsic to the technique used since mutual information scores are defined for two items. The second limitation is that many collocations identified in Church and Hanks (1989) do not really identify true collocations, but simply pairs of words that frequently appear together such as the pairs &amp;quot;doctor-nurse,&amp;quot; &amp;quot;doctor-bill,&amp;quot; &amp;quot;doctor-honorary,&amp;quot; &amp;quot;doctors-dentists,&amp;quot; &amp;quot;doctors-hospitals,&amp;quot; etc. These co-occurrences are mostly due to semantic reasons. The two words are used in the same context because they are of related meanings; they are not part of a single collocational construct.</Paragraph>
    <Paragraph position="6"> The work we describe in the rest of this paper is along the same lines of research. It builds on Choueka's work and attempts to remedy the problems identified above. The techniques we describe retrieve the three types of collocations discussed in Section 2, and they have been implemented in a tool, Xtract. Xtract retrieves interrupted as well as uninterrupted sequences of words and deals with collocations of arbitrary length (1 to 30 in actuality). The following four sections describe and discuss the techniques used for Xtract.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML