File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/p91-1036_metho.xml

Size: 14,834 bytes

Last Modified: 2025-10-06 14:12:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1036">
  <Title>FROM N-GRAMS TO COLLOCATIONS AN EVALUATION OF XTRACT</Title>
  <Section position="4" start_page="0" end_page="279" type="metho">
    <SectionTitle>
2 FIRST 2 STAGES OF XTRACT,
PRODUCING N-GRAMS
</SectionTitle>
    <Paragraph position="0"> In afirst stage, Xtract uses statistical techniques to retrieve pairs of words (or bigrams) whose common appearances within a single sentence are correlated in the corpus. A bigram is retrieved if its frequency of occurrence is above a certain threshold and if the words are used in relatively rigid ways. Some bigrams produced by the first stage of Xtract are given in Table 1: the bigrams all contain the word &amp;quot;takeover&amp;quot; and an adjective. In the table, the distance parameter indicates the usual distance between the two words. For example, distance = 1 indicates that the two words are frequently adjacent in the corpus.</Paragraph>
    <Paragraph position="1"> In a second stage, Xtract uses the output bi-grams to produce collocations involving more than two words (or n-grams). It examines all the sentences containing the bigram and analyzes the statistical distribution of words and parts of speech for each position around the pair. It retains words (or parts of speech) occupying a position with probability greater than a given  threshold. For example, the bigram &amp;quot;average-industrial&amp;quot; produces the n-gram &amp;quot;the Dow Jones industrial average&amp;quot; since the words are always used within this compound in the training corpus. Example. outputs of the second stage of Xtraet are given in Figure 1. In the figure, the numbers on the left indicate the frequency of the n-grams in the corpus, NN indicates that. a noun is expected at this position, AT indicates that an article is expected, NP stands for a proper noun and VBD stands for a verb in the past tense. See \[Smadja and McKeown, 1990\] and \[Smadja, 1991\] for more details on these two stages.</Paragraph>
  </Section>
  <Section position="5" start_page="279" end_page="279" type="metho">
    <SectionTitle>
3 STAGE THREE: SYNTACTICALLY
LABELING COLLOCATIONS
</SectionTitle>
    <Paragraph position="0"> In the past, Debili \[Debili, 1982\] parsed corpora of French texts to identify non-ambiguous predicate argument relations. He then used these relations for disambiguation in parsing. Since then, the advent of robust parsers such as Cass \[Abney, 1990\], Fidditeh \[Itindle, 1983\] has made it possible to process large amounts of text with good performance. This enabled Itindle and Rooth \[Hindle and Rooth, 1990\], to improve Debili's work by using bigram statistics to enhance the task of prepositional phrase attachment. Combining statistical and parsing methods has also been done by Church and his colleagues. In \[Church et al., 1989\] and \[Church'et ai., 1991\] they consider predicate argument relations in the form of questions such as What does a boat typically do? They are preprocessing a corpus with the Fiddlteh parser in order to statistically analyze the distribution of the predicates used with a given argument such as &amp;quot;boat.&amp;quot; Our goal is different, since we analyze a set of collocations automatically produced by Xtract to either enrich them with syntactic information or reject them.</Paragraph>
    <Paragraph position="1"> For example, if, bigram collocation produced by Xtract involves a noun and a verb, the role of Stage 3 of Xtract is to determine whether it is a subject-verb or a verb-object collocation. If no such relation can be identified, then the collocation is rejected. This section presents the algorithm for Xtract Stage 3 in some detail. For illustrative purposes we use the example words takeover and thwart with a distance of 2.</Paragraph>
    <Section position="1" start_page="279" end_page="279" type="sub_section">
      <SectionTitle>
3.1 DESCRIPTION OF THE ALGORITHM
</SectionTitle>
      <Paragraph position="0"> Input: A bigram with some distance information indicating the most probable distance between the two words. For example, takeover and thwart with a distance of 2.</Paragraph>
      <Paragraph position="1"> Output/Goah Either a syntactic label for the bigram or a rejection. In the case of takeover and thwart the collocation is accepted and its produced label is VO for verb-object.</Paragraph>
      <Paragraph position="2"> The algorithm works in the following 3 steps:</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="279" end_page="280" type="metho">
    <SectionTitle>
CONCORDANCES
</SectionTitle>
    <Paragraph position="0"> All the sentences in the corpus that contain the two words in this given position are produced. This is done with a concord,acing program which is part of Xtraet (see \[Smadja, 1991\]). The sentences are labeled with part of speech information by preprocessing the corpus with an automatic stochastic tagger. 1  Each sentence is then processed by Cass, a bottom-up incremental parser \[Abney, 1990\]. 2 Cass takes input sentences labeled with part of speech and attempts to identify syntactic structure. One of Cass modules identifies predicate argument relations. We use this module to produce binary syntactic relations (or labels) such as &amp;quot;verb-object&amp;quot; (VO), %erb-subject&amp;quot; (VS), &amp;quot;noun-adjective&amp;quot; (N J), and &amp;quot;noun-noun&amp;quot; ( N N ). Consider Sentence (1) below and all the labels as produced  by Cass on it.</Paragraph>
    <Paragraph position="1"> (1) &amp;quot;Under the recapitalization plan it proposed to thwart the takeover.&amp;quot; label bigrarn SV it proposed</Paragraph>
    <Paragraph position="3"> For each sentence in the concordance set, from the output of Cass, Xtract determines the syntactic relation of the two words among VO, SV, N J, NN and assigns this label to the sentence. If no such relation is observed, Xtract associates the label U (for undefined) to the sentence. We note label\[ia~ the label associated</Paragraph>
  </Section>
  <Section position="7" start_page="280" end_page="282" type="metho">
    <SectionTitle>
COLLOCATION
</SectionTitle>
    <Paragraph position="0"> This last step consists of deciding on a label for the bigram from the set of label\[i~'.s. For this, we count the frequency of each label for the bigram and perform a statistical analysis of this distribution. A collocation is accepted if the two seed words are consistently used with the same syntactic relation. More precisely, the collocation is accepted if and only if there is a label 12 ~: U satisfying the following inequation: \[probability(labeliid \] = PS)&gt; T I in which T is a given threshold to be determined by the experimenter. A collocation is thus rejected if no valid label satisfies the inequation or if U satisfies it. Figure 2 lists some accepted collocations in the format produced by Xtract with their syntactic labels.</Paragraph>
    <Paragraph position="1"> For these examples, the threshold T was set to 80%.</Paragraph>
    <Paragraph position="2"> For each collocation, the first line is the output of the first stage of Xtract. It is the seed bigram with the distance between the two words. The second line is the output of the second stage of Xtract, it is a multiple word collocation (or n-gram). The numbers on the left indicate the frequency of occurrence of the n-gram in the corpus. The third line indicates the syntactic label as determined by the third stage of Xtract. Finally, the last lines simply list an example sentence and the position of the collocation in the sentence.</Paragraph>
    <Paragraph position="3"> Such collocations can then be used for various purposes including lexicography, spelling correction, speech recognition and language generation. Ill \[Smadja and McKeown, 1990\] and \[Smadja, 1991\] we describe how they are used to build a lexicon for language generation in the domain of stock market reports.</Paragraph>
    <Paragraph position="4"> The third stage of Xtract can thus be considered as a retrieval system which retrieves valid collocations from a set of candidates. This section describes an evaluation experiment of the third stage of Xtract as a retrieval system. Evaluation of retrieval systems is usually done with the help of two parameters: precision and recall \[Salton, 1989\]. Precision of a retrieval system is defined as the ratio of retrieved valid elements divided by the total number of retrieved elements \[Salton, 1989\]. It measures the quality of the retrieved material. Recall is defined as the ratio of retrieved valid elements divided by the total number of valid elements. It measures the effectiveness of the system. This section presents an evaluation of the retrieval performance of the third stage of Xtract.</Paragraph>
    <Section position="1" start_page="280" end_page="282" type="sub_section">
      <SectionTitle>
4.1 THE EVALUATION EXPERIMENT
</SectionTitle>
      <Paragraph position="0"> Deciding whether a given word combination is a valid or invahd collocation is actually a difficult task that is best done by a lexicographer. Jeffery Triggs is a lexicographer working for Oxford English Dictionary (OED) coordinating the North American Readers program of OED at Bell Communication Research. Jeffery Triggs agreed to manually go over several thousands collocations, a We randomly selected a subset of about 4,000 collocations that contained the information compiled by Xtract after the first 2 stages. This data set was then the subject of the following experiment.</Paragraph>
      <Paragraph position="1"> We gave the 4,000 collocations to evaluate to the lexicographer, asking him to select the ones that he 3I am grateful to Jeffery whose professionalism and kindness helped me understand some of the difficulty of lexicography. Without him this evaluation would not have been possible.</Paragraph>
      <Paragraph position="3"> would consider for a domain specific dictionary and to cross out the others. The lexicographer came up with three simple tags, YY, Y and N. Both Y and YY are good collocations, and N are bad collocations. The difference between YY and Y is that Y collocations are of better quality than YY collocations. YY collocations are often too specific to be included in a dictionary, or some words are missing, etc. After Stage 2, about 20% of the collocations are Y, about 20% are YY, and about 60% are N. This told us that the precision of Xtract at Stage 2 was only about 40 %.</Paragraph>
      <Paragraph position="4"> Although this would seem like a poor precision, one should compare it with the much lower rates currently in practice in lexicography. For the OED, for example, the first stage roughly consists of reading numerous documents to identify new or interesting expressions. This task is performed by professional readers. For the OED, the readers for the American program alone produce some 10,000 expressions a month. These lists are then sent off to the dictionary and go through several rounds of careful analysis before actually being submitted to the dictionary. The ratio of proposed candidates to good candidates is usually low. For example, out of the 10,000 expressions proposed each month, less than 400 are serious candidate for the OED, which represents a current rate of 4%. Automatically producing lists of candidate expressions could actually be of great help to lexicographers and even a precision of 40% would be helpful. Such lexicographic tools could, for example, help readers retrieve sublanguage specific expressions by providing them with lists of candidate collocations. The lexicographer then manually examines the list to remove the irrelevant data. Even low precision is useful for lexicographers as manual filtering is much faster than manual scanning of the documents \[Marcus, 1990\]. Such techniques are not able to replace readers though, as they are not designed to identify low frequency expressions, whereas a human reader immediately identifies interesting expressions with as few as one occurrence.</Paragraph>
      <Paragraph position="5"> The second stage of this experiment was to use Xtract Stage 3 to filter out and label the sample set of collocations. As described in Section 3, there are several valid labels (VO, VS, NN, etc.). In this experiment, we grouped them under a single label: T. There is only one non-valid label: U (for unlabeled}. A T collocation is thus accepted by Xtract Stage 3, and a U collocation is rejected. The results of the use of Stage 3 on the sample set of collocations are similar to the manual evaluation in terms of numbers: about 40% of the collocations were labeled (T) by Xtract Stage 3, and about 60% were rejected (U).</Paragraph>
      <Paragraph position="6"> Figure 3 shows the overlap of the classifications made by Xtract and the lexicographer. In the figure, the first diagram on the left represents the breakdown in T and U of each of the manual categories (Y - YY and N). The diagram on the right represents the breakdown in Y - YY and N of the the T and U categories. For example, the first column of the diagram on the left represents the application of Xtract Stage 3 on the YY collocations. It shows that 94% of the collocations accepted by the lexicographer were also accepted by Xtract. In other words, this means that the recall ofthe third stage of Xtract is 94%. The first column of the diagram on the right represents the lexicographic evaluation of the collocations automatically accepted by Xtract. It shows that about 80% of the T collocations were accepted by the lexicographer and that about 20% were rejected. This shows that precision was raised from 40% to 80% with the addition of Xtract Stage 3. In summary, these experiments allowed us to evaluate Stage 3 as a retrieval system. The results are:</Paragraph>
      <Paragraph position="8"/>
    </Section>
  </Section>
  <Section position="8" start_page="282" end_page="282" type="metho">
    <SectionTitle>
5 SUMMARY AND
CONTRIBUTIONS
</SectionTitle>
    <Paragraph position="0"> In this paper, we described a new set of techniques for syntactically filtering and labeling collocations. Using such techniques for post processing the set of collocations produced by Xtract has two major results. First, it adds syntax to the collocations which is necessary for computational use. Second, it provides considerable improvement to the quality of the retrieved collocations as the precision of Xtract is raised from 40% to 80% with a recall of 94%.</Paragraph>
    <Paragraph position="1"> By combining statistical techniques with a sophisticated robust parser we have been able to design and implement some original techniques for the automatic extraction of collocations. Results so far are very encouraging and they indicate that more efforts should be made at combining statistical techniques with more symbolic ones.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML