File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/p90-1032_metho.xml

Size: 17,053 bytes

Last Modified: 2025-10-06 14:12:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P90-1032">
  <Title>AUTOMATICALLY EXTRACTING AND REPRESENTING COLLOCATIONS FOR LANGUAGE GENERATION*</Title>
  <Section position="4" start_page="0" end_page="252" type="metho">
    <SectionTitle>
SINGLE WORDS TO WHOLE
PHRASES: WHAT KIND OF
LEXICAL UNITS ARE NEEDED?
</SectionTitle>
    <Paragraph position="0"> Collocational knowledge indicates which members of a set of roughly synonymous words co-occur with other words and how they combine syntactically. These affinities can not be predicted on the basis of semantic or syntactic rules, but can be observed with some regularity in * text \[Cruse 86\]. We have found a range of collocations from word pairs to whole phrases, and as we shall show,  this range will require a flexible method of representation. null</Paragraph>
  </Section>
  <Section position="5" start_page="252" end_page="252" type="metho">
    <SectionTitle>
3 THE ACQUISITION METHOD:
Xtract
</SectionTitle>
    <Paragraph position="0"> Open Compounds . Open compounds involve uninterrupted sequences of words such as &amp;quot;stock market,&amp;quot; &amp;quot;foreign ezchange,&amp;quot; &amp;quot;New York Stock Ezchange,&amp;quot; &amp;quot;The Dow Jones average of $0 indust~als.&amp;quot; They can include nouns, adjectives, and closed class words and are similar to the type of collocations retrieved by \[Choueka 88\] or \[Amsler 89\]. An open compound generally functions as a single constituent of a sentence. More open compound examples are given in figure 1. x Predicative Relations consist of two (or several) words repeatedly used together in a similar syntactic relation. These lexical relations axe harder to identify since they often correspond to interrupted word sequences in the corpus. They axe also the most flexible in their use. This class of col locations is related to Mel'~uk's Lexical Functions \[Mel'~uk 81\], and Benson's L-type relations \[Benson 86\]. Within this class, Xtract retrieves subjectverb, verb-object, noun-adjective, verb-adverb, verb-verb and verb-particle predicative relations. Church \[Church 89\] also retrieves verb-particle associations.</Paragraph>
    <Paragraph position="1"> Such collocations require a representation that allows for a lexical function relating two or more words. Examples of such collocations axe given in figure 2. 2 Phrasal templates: consist of idiomatic phrases containing one, several or no empty slots. They axe extremely rigid and long collocations. These almost complete phrases are quite representative of a given domain. Due to their slightly idiosyncratic structure, we propose representing and generating them by simple template filling. Although some of these could be generated using a word based lexicon, in general, their usage gives an impression of fluency that cannot be equaled with compositional generation alone. Xtract has retrieved several dozens of such templates from our stock market corpus, ineluding: null &amp;quot;The NYSE's composite indez of all its listed common stocks rose</Paragraph>
  </Section>
  <Section position="6" start_page="252" end_page="253" type="metho">
    <SectionTitle>
*NUMBER* to *NUMBER*&amp;quot;
</SectionTitle>
    <Paragraph position="0"> one or several words. The &amp;quot;C/*&amp;quot; sign means that the two words can be in any order.</Paragraph>
    <Paragraph position="1"> In order to produce sentences containing collocations, a language generation system must have knowledge about the possible collocations that occur in a given domain.</Paragraph>
    <Paragraph position="2"> In previous language generation work \[Danlos 87\], \[Iordanskaja 88\], \[Nirenburg 88\], collocations are identified and encoded by hand, sometimes using the help of lexicographers (e.g., Danlos' \[Daulos 87\] use of Gross' \[Gross 75\] work). This is an expensive and time-consuming process, and often incomplete. In this section, we describe how Xtract can automatically produce the full range of collocations described above.</Paragraph>
    <Paragraph position="3"> Xtract has two main components, a concordancing component, Xconcord, and a statistical component, Xstat. Given one or several words, Xconcord locates all sentences in the corpus containing them. Xstat is the co-occurrence compiler. Given Xconcord's output, it makes statistical observations about these words and other words with which they appear. Only statistically significant word pairs are retained. In \[Smadja 89a\], and \[Smadja 88\], we detail an earlier version of Xtract and its output, and in \[Smadja 891)\] we compare our results both qualitatively and quantitatively to the lexicon used in \[Kukich 83\]. Xtract has also been used for information retrieval in \[Maarek &amp; Smadja 89\]. In the updated version of Xtract we describe here, statistical significance is based on four parameters, instead of just one, and a second stage of processing has been added that looks for combinations of word pairs produced in the first stage, resulting in multiple word collocations.</Paragraph>
    <Paragraph position="4"> Stage one- In the first phase, Xconcord is called for a single open class word and its output is pipeIined to Xstat which then analyses the distribution of words in this sample. The output of this first stage is a list of tuples (wx,w2, distance, strength, spread, height, type), where (wl, w2) is a lexical relation between two open-class words (Wx and w2). Some results are given in Table 1. &amp;quot;Type&amp;quot; represents the syntactic categories of wl and w2. 3. &amp;quot;Distance&amp;quot; is the relative distance between the two words, wl and w2 (e.g., a distance of 1 means w~ occurs immediately after wx and a distance of-i means it occurs immediately before it). A different tuple is produced for each statistically significant word pair and distance.</Paragraph>
    <Paragraph position="5"> Thus, ff the same two words occur equally often separated by two different distances, they will appear twice in the list. &amp;quot;Strength&amp;quot; (also computed in the earlier version of Xtract) indicates how strongly the two words are related (see \[Smadja 89a\]). &amp;quot;Spread&amp;quot; is the distribution of the relative distance between the two words; thus, the larger the &amp;quot;spread&amp;quot; the more rigidly they are used in combination to one another. &amp;quot;Height&amp;quot; combines the factors of &amp;quot;spread&amp;quot; 3In order to get part of speech information we use a stochastic word tagger developed at AT&amp;T Bell Laboratories by Ken Church \[Church 88\]</Paragraph>
  </Section>
  <Section position="7" start_page="253" end_page="255" type="metho">
    <SectionTitle>
3: Concordances for &amp;quot;composite indez&amp;quot;
</SectionTitle>
    <Paragraph position="0"> of all its listed common stocks fell 1.76 to 164.13.</Paragraph>
    <Paragraph position="1"> of all its listed common stocks fell 0.98 to 164.91.</Paragraph>
    <Paragraph position="2"> of all its listed common stocks fell 0.96 to 164.93.</Paragraph>
    <Paragraph position="3"> of all its listed common stocks fell 0.91 to 164.98.</Paragraph>
    <Paragraph position="4"> of all its listed common stocks rose 1.04 to 167.08.</Paragraph>
    <Paragraph position="5"> of all its listed common stocks rose 0.76 of all its listed common stocks rose 0.50 to 166.54.</Paragraph>
    <Paragraph position="6"> of all its listed common stocks rose 0.69 to 166.73.</Paragraph>
    <Paragraph position="7"> of all its listed common stocks fell 0.33 to 170.63.</Paragraph>
    <Paragraph position="8">  words for their &amp;quot;distances&amp;quot;. Church \[Church 89\] produces results similar to those presented in the table using a different statistical method. However, Church's method is mainly based on the computation of the &amp;quot;strength&amp;quot; attribute, and it does not take into account &amp;quot;spread&amp;quot; and &amp;quot;height&amp;quot;. As we shall see, these additional parameters are crucial for producing multiple word collocations and distinguishing between open compounds (words are adjacent) and predicative relations (words can be separated by varying distance).</Paragraph>
    <Paragraph position="9"> Stage two: In the second phase, Xtraet first uses the same components but in a different way. It starts with the pairwise lexical relations produced in Stage one to produce multiple word collocations, then classifies the collocations as one of three classes identified above, end finally attempts to determine the syntactic relations between the words of the collocation. To do this, Xtract studies the lexical relations in context, which is exactly what lexicographers do. For each entry of Table 1, Xtract calls Xconcord on the two words wl and w~ to produce the concordances. Tables 2 and 3 show the concordances (output of Xconcord) for the input pairs: &amp;quot;average-industrial&amp;quot; end &amp;quot;indez-composite&amp;quot;. Xstat then compiles information on the words surrounding both wl and w2 in the corpus. This stage allows us to filter out incorrect associations such as &amp;quot;blue.stocks&amp;quot; or &amp;quot;advancing-market&amp;quot; and replace them with the appropriate ones, &amp;quot;blue chip stocks,&amp;quot; &amp;quot;the broader market in the NYSE advancing is.</Paragraph>
    <Paragraph position="10"> sues.&amp;quot; This stage also produces phrasal templates such as those given in the previous section. In short, stage two filters inapropriate results and combines word pairs to produce multiple word combinations.</Paragraph>
    <Paragraph position="11"> To make the results directly usable for language generation we are currently investigating the use of a bottom-up parser in combination with stage two in order to classify the collocations according to syntactic criteria. For example if the lexical relation involves a noun and a verb it determines if it is a subject-verb or a verb-object collocation. We plan to do this using a deterministic bottom up parser developed at Bell Communication Research \[Abney 89\] to parse the concordances. The parser would analyse each sentence of the concordances and the parse trees would then be passed to Xstat.</Paragraph>
    <Paragraph position="12"> Sample results of Stage two are shown in Figures 1, 2 and 3. Figure 3 shows phrasal templates and open compounds. Xstat notices that the words &amp;quot;composite and &amp;quot;indez&amp;quot; are used very rigidly throughout the corpus. They almost always appear in one of the two  &amp;quot;The NYSE's composite indez of all its listed common stocks fell *NUMBER* to *NUMBER*&amp;quot; &amp;quot;the NYSE's composite indez of all its listed common stocks rose *NUMBER* to *NUMBER*.&amp;quot; \[ &amp;quot;close-industrial&amp;quot; &amp;quot;Five minutes before the close the Dow Jones average of 30 industrials  sentences. The lexical relation composite-indez thus produces two phrasal templates. For the lexical relation average-industrial Xtract produces an open compound collocation as illustrated in figure 3. Stage two also confirms pairwise relations. Some examples are given in figure 2. By examining the parsed concordances and extracting recurring patterns, Xstat produces all three types of collocations.</Paragraph>
  </Section>
  <Section position="8" start_page="255" end_page="257" type="metho">
    <SectionTitle>
4 HOW TO REPRESENT THEM
FOR LANGUAGE GENERATION?
</SectionTitle>
    <Paragraph position="0"> Such a wide variety of lexical associations would be difficnlt to use with any of the existing lexicon formalisms.</Paragraph>
    <Paragraph position="1"> We need a flexible lexicon capable of using single word entries, multiple word entries as well as phrasal templates and a mechanism that would be able to gracefully merge and combine them with other types of constraints.</Paragraph>
    <Paragraph position="2"> The idea of a flexible lexicon is not novel in itself. The lexical representation used in \[Jacobs 85\] and later refined in \[Desemer &amp; Jabobs 87\] could also represent a wide range of expressions. However, in this language, collocational, syntactic and selectional constraints are mixed together into phrasal entries. This makes the lexicon both difficnlt to use and difficult to compile. In the following we briefly show how FUGs can be successfully used as they offer a flexible declarative language as well as a powerful mechanism for sentence generation.</Paragraph>
    <Paragraph position="3"> We have implemented a first version of Cook, a surface generator that uses a flexible lexicon for expressin~ co-occurrence constraints. Cook uses FUF \[Elhadad 90J, an extended implementation of PUGs, to uniformly represent the lexicon and the syntax as originally suggested by Halliday \[Halliday 66\]. Generating a sentence is equivalent to unifying a semantic structure (Logical Form) with the grammar. The grammar we use is divided into three zones, the &amp;quot;sentential,&amp;quot; the &amp;quot;lezical&amp;quot; and &amp;quot;the syntactic zone.&amp;quot; Each zone contains constraints pertaining to a given domain and the input logical form is unified in turn with the three zones. As it is, full backtracking across the three zones is allowed.</Paragraph>
    <Paragraph position="4"> * The sentential zone contains the phrasal templates against which the logical form is unified first. A sententiai entry is a whole sentence that should be used in a given context. This context is specified by subparts of the logical form given as input. When there is a match at this point, unification succeeds and generation is reduced to simple template filling.</Paragraph>
    <Paragraph position="5"> * The lezical zone contains the information used to lexicalize the input. It contains collocational information along with the semantic context in which to use it. This zone contains predicative and open compound collocations. Its role is to trigger phrases or words in the presence of other words or phrases.</Paragraph>
    <Paragraph position="6"> Figure 5 is a portion of the lexical grammar used in Cook. It illustrates the choice of the verb to be used when &amp;quot;advancers&amp;quot; is the subject. (See below for more detail).</Paragraph>
    <Paragraph position="7"> * The syniacgic zone contains the syntactic grammar.</Paragraph>
    <Paragraph position="8"> It is used last as it is the part of the grammar ensuring the correctness of the produced sentences.</Paragraph>
    <Paragraph position="9"> An example input logical form is given in Figure 4. In this example, the logical form represents the fact that on the New York stock exchange, the advancing issues (semantic representation or sere-R: c:winners) were ahead (predicate c:lead)of the losing ones (sem-R: c:losers)and that there were 3 times more winning issues than losing ones ratio). In addition, it also says that this ratio is of degree 2. A degree of 1 is considered as a slim lead whereas a degree of 5 is a commanding margin. When unified with the grammar, this logical form produces the sentences given in Figure 6.</Paragraph>
    <Paragraph position="10"> As an example of how Cook uses and merges co-occurrence information with other kind of knowledge consider Figure 5. The figure is an edited portion of the lexical zone. It only includes the parts that are relevant to the choice of the verb when &amp;quot;advancers&amp;quot; is the subject. The lex and sem-R attributes specify the lexeme we are considering (&amp;quot;advancers&amp;quot;) and its semantic representation (c:winners).</Paragraph>
    <Paragraph position="11"> The semantic context (sere-context) which points to the logical form and its features will then be used in order  to select among the alternatives classes of verbs. In the figure we only included two alternatives. Both are relative to the predicate p:lead but they axe used with different values of the degree attribute. When the degree is 2 then the first alternative containing the verbs listed under SV-colloca~es (e.g. &amp;quot;outnumber&amp;quot;) will be selected. When the degree is 4 the second alternative containing the verbs listed under SV-collocal;es (e.g. &amp;quot;overpower&amp;quot;) will be selected. All the verbal collocates shown in this figure have actually been retrieved by Xtract at a preceding stage.</Paragraph>
    <Paragraph position="12"> The unification of the logical form of Figure 4 with the lexical grammar and then with the syntactic grammar will ultimately produce the sentences shown in Figure 6 among others. In this example, the sentencial zone was not used since no phrasal template expresses its semantics. The verbs selected are all listed under the SV-collocates of the first alternative in Figure 5.</Paragraph>
    <Paragraph position="13"> We have been able to use Cook to generate several sentences in the domain of stock maxket reports using this method. However, this is still on-going reseaxch and the scope of the system is currently limited. We are working on extending Cook's lexicon as well as on developing extensions that will allow flexible interaction among collocations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML