File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/p90-1032_intro.xml

Size: 3,609 bytes

Last Modified: 2025-10-06 14:05:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P90-1032">
  <Title>AUTOMATICALLY EXTRACTING AND REPRESENTING COLLOCATIONS FOR LANGUAGE GENERATION*</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Language generation research on lexical choice has focused on syntactic and semantic constraints on word choice and word ordering. Colloca~ional constraints, however, also play a role in how words can co-occur in the same sentence. Often, the use of one word in a particular context of meaning will require the use of one or more other words in the same sentence. While phrasal lexicons, in which lexical associations are pre-encoded (e.g., \[Kukich 83\], \[Jacobs 85\], \[Danlos 87\]), allow for the treatment of certain types of collocations, they also have problems. Phrasal entries must be compiled by hand which is both expensive and incomplete. Furthermore, phrasal entries tend to capture rather rigid, idiomatic expressions. In contrast, collocations vary tremendously in the number of words involved, in the syntactic categories of the words, in the syntactic relations between the words, and in how rigidly the individual words are used together. For example, in some cases, the words of a collocation must be adjacent, while in others they can be separated by a varying number of other words.</Paragraph>
    <Paragraph position="1"> *The research reported in this paper was partially supported by DARPA grant N00039-84-C-0165, by NSF grant IRT-84-51438 and by ONR grant N00014-89-J-1782.</Paragraph>
    <Paragraph position="2"> tMost of this work is also done in collaboration with Bell Communication Research, 445 South Street, Morristown, NJ  In this paper, we identify a range of collocations that are necessary for language generation, including open compounds of two or more words, predicative relations (e.g., subject-verb), and phrasal templates representing more idiomatic expressions. We then describe how Xtract automatically acquires the full range of collocations using a two stage statistical analysis of large domain specific corpora. Finally, we show how collocations can be efficiently represented in a flexible lexicon using a unification based formalism. This is a word based lexicon that has been macrocoded with collocational knowledge.</Paragraph>
    <Paragraph position="3"> Unlike a purely phrasal lexicon, we thus retain the flexibility of word based lexicons which allows for collocations to be combined and merged in syntactically acceptable ways with other words or phrases of the sentence. Unlike pure word based lexicons, we gain the ability to deal with a variety of phrasal entries. Furthermore, while there has been work on the automatic retrieval of lexical information from text \[Garside 87\], \[Choueka 88\], \[Klavans 88\], \[Amsler 89\], \[Boguraev &amp; Briscoe 89\], \[Church 89\], none of these systems retrieves the entire range of collocations that we identify and no real effort has been made to use this information for language generation \[Boguraev &amp; Briscoe 89\].</Paragraph>
    <Paragraph position="4"> In the following sections, we describe the range of collocations that we can handle, the fully implemented acquisition method, results obtained, and the representation of collocations in Functional Unification Grammars (FUGs) \[Kay 79\]. Our application domain is the domain of stock market reports and the corpus on which our expertise is based consists of more than 10 million words taken from the Associated Press news wire.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML