File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-1007_metho.xml

Size: 51,161 bytes

Last Modified: 2025-10-06 14:13:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-1007">
  <Title>Retrieving Collocations from Text: Xtract</Title>
  <Section position="6" start_page="149" end_page="154" type="metho">
    <SectionTitle>
5. Xtract: Introduction
</SectionTitle>
    <Paragraph position="0"> Xtract consists of a set of tools to locate words in context and make statistical observation to identify collocations. In the upgraded version we describe here, Xtract has been extended and refined. More information is computed and an effort has been made to extract more functional information. Xtract now works in three stages.</Paragraph>
    <Paragraph position="1"> The three-stage analysis is described in Sections 6, 7, and 8. In the first stage, described in Section 6, Xtract uses straight statistical measures to retrieve from a corpus pairwise lexical relations whose common appearance within a single sentence are correlated. A pair (or bigram) is retrieved if its frequency of occurrence is above a certain threshold and if the words are used in relatively rigid ways. The output of stage one is then passed to both the second and third stage in parallel. In the second</Paragraph>
    <Section position="1" start_page="150" end_page="150" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtract
</SectionTitle>
      <Paragraph position="0"> stage, described in Section 7, Xtract uses the output bigrams to produce collocations involving more than two words (or n-grams). It analyzes all sentences containing the bigram and the distribution of words and parts of speech for each position around the pair. It retains words (or parts of speech) occupying a position with probability greater than a given threshold. For example, the bigram &amp;quot;average-industrial&amp;quot; produces the n-gram &amp;quot;the Dow Jones industrial average,&amp;quot; since the words are always used within rigid noun phrases in the training corpus. In the third stage, described in Section 8, Xtract adds syntactic information to collocations retrieved at the first stage and filters out inappropriate ones. For example, if a bigram involves a noun and a verb, this stage identifies it either as a subject-verb or as a verb-object collocation. If no such consistent relation is observed, then the collocation is rejected.</Paragraph>
      <Paragraph position="1"> 6. Xtract Stage One: Extracting Significant Bigrams According to Cruse's definition (Cruse 1986), a syntagmatic lexical relation consists of a pair of words whose common appearances within a single phrase structure are correlated. In other words, those two words appear together within a single syntactic construct more often than expected by chance. The first stage of Xtract attempts to identify such pairwise lexical relations and produce statistical information on pairs of words involved together in the corpus.</Paragraph>
      <Paragraph position="2"> Ideally, in order to identify lexical relations in a corpus one would need to first parse it to verify that the words are used in a single phrase structure. However, in practice, free-style texts contain a great deal of nonstandard features over which automatic parsers would fail. 6 Fortunately, there is strong lexicographic evidence that most syntagmatic lexical relations relate words separated by at most five other words (Martin, A1, and Van Sterkenburg 1983). In other words, most of the lexical relations involving a word w can be retrieved by examining the neighborhood of w, wherever it occurs, within a span of five (-5 and +5 around w) words. 7 In the work presented here, we use this simplification and consider that two words co-occur if they are in a single sentence and if there are fewer than five words between them.</Paragraph>
      <Paragraph position="3"> In this first stage, we thus use only statistical methods to identify relevant pairs of words. These techniques are based on the assumptions that if two words are involved in a collocation then: * the words must appear together significantly more often than expected by chance.</Paragraph>
      <Paragraph position="4"> * because of syntactic constraints the words should appear in a relatively rigid way. 8 These two assumptions are used to analyze the word distributions, and we base our filtering techniques on them.</Paragraph>
    </Section>
    <Section position="2" start_page="150" end_page="154" type="sub_section">
      <SectionTitle>
6.1 Presentation of the Method
</SectionTitle>
      <Paragraph position="0"> In this stage as well as in the two others, we often need part-of-speech information for several purposes. Stochastic part-of-speech taggers such as those in Church (1988) and  Computational Linguistics Volume 19, Number 1 Garside and Leech (1987) have been shown to reach 95-99% performance on free-style text. We preprocessed the corpus with a stochastic part-of-speech tagger developed at Bell Laboratories by Ken Church (Church 1988). 9 In the rest of this section, we describe the algorithm used for the first stage of Xtract in some detail. We assume that the corpus is preprocessed by a part of speech tagger and we note wi a collocate of w if the two words appear in a common sentence within a distance of 5 words.</Paragraph>
      <Paragraph position="1"> Step 1.1: Producing Concordances Input: The tagged corpus, a given word w.</Paragraph>
      <Paragraph position="2"> Output: All the sentences containing w.</Paragraph>
      <Paragraph position="3"> Description: This actually encompasses the task of identifying sentence boundaries, and the task of selecting sentences containing w. The first task is not simple and is still an open problem. It is not enough to look for a period followed by a blank space as, for example, abbreviations and acronyms such as S.B.F., U.S.A., and A.T.M. often pose a problem. The basic algorithm for isolating sentences is described and implemented by a finite-state recognizer. Our implementation could easily be improved in many ways. For example, it performs poorly on acronyms and often considers them as end of sentences; giving it a list of currently used acronyms such as N.B.A., E.I.K., etc., would significantly improve its performance.</Paragraph>
      <Paragraph position="4"> Step 1.2: Compile and Sort Input: Output of Step 1.1, i.e., a set of tagged sentences containing w.</Paragraph>
      <Paragraph position="5"> Output: A list of words wi with frequency information on how w and wi co-occur. This includes the raw frequency as well as the breakdown into frequencies for each possible position. See Table 2 for example outputs.</Paragraph>
      <Paragraph position="6"> Description: For each input sentence containing w, we make a note of its collocates and store them along with their position relative to w, their part of speech, and their frequency of appearance. More precisely, for each prospective lexical relation, or for each potential collocate wi, we maintain a data structure containing this information. The data structure is shown in Figure 5. It contains freqi, the frequency of appearance of wi with w so far in the corpus, PP, the part of speech of wi, and p~, (-5 _&lt; j &lt; 5, j ~ 0), the frequency of appearance of wi with w such that they are j words apart. The p~s represent the histogram of the frequency of appearances of w and wi in given positions. This histogram will be used in later stages.</Paragraph>
      <Paragraph position="7"> As an example, if sentence (9) is the current input to step 1.2 and w = takeover, then, the prospective lexical relations identified in sentence (9) are as shown in Table 3. 9. &amp;quot;The pill would make a takeover attempt more expensive by allowing the retailer's shareholders to ...&amp;quot; In Table 3, distance is the distance between &amp;quot;takeover&amp;quot; and wi, and PP is the part of speech of wi. The closed class words are not considered at this stage and the other 9 We are grateful to Ken Church and to Bell Laboratories for providing us with this tool.</Paragraph>
      <Paragraph position="9"> Figure 5 Data structure maintained at stage one by Xtract. words, such as &amp;quot;shareholders,&amp;quot; are rejected because they are more than five words away from &amp;quot;takeover.&amp;quot; For each of the above word pairs, we maintain the associated data structure as indicated in Figure 5. For takeover pill, for example, we would increment freqpill, and the p4 column in the histogram. Table 2 shows the output for the adjective collocates of the word &amp;quot;takeover.&amp;quot; Step 1.3: Analyze Input: Output of Step 1.2, i.e., a list of words wi with information on how often and how w and wi co-occur. See Table 2 for an example input.</Paragraph>
      <Paragraph position="10"> Output: Significant word pairs, along with some statistical information describing how strongly the words are connected and how rigidly they are used together. A separate (but similar) statistical analysis is done for each syntactic category of collocates. See Table 4 for an example output.</Paragraph>
      <Paragraph position="11"> Description: At this step, the statistical distribution of the collocates of w is analyzed, and the interesting word pairs are automatically selected. If part of speech information is available, a separate analysis is made depending on the part of speech of the collocates. This balances the fact that verbs, adjectives, and nouns are simply not equally frequent.</Paragraph>
      <Paragraph position="12"> For each word w, we first analyze the distribution of the frequencies freqi of its collocates wi, and then compute its average frequency f and standard deviation cr around f. We then replace freqi by its associated z-score ki. ki is called the strength of the word pair in Figure 4; it represents the number of standard deviation above the</Paragraph>
      <Paragraph position="14"> The collocates of &amp;quot;takeover&amp;quot; as retrieved from sentence (9).</Paragraph>
      <Paragraph position="15"> w wi distance PP takeover pill 4 N takeover make 2 V takeover attempt -1 N takeover expensive -3 J takeover allowing -5 V average of the frequency of the word pair w and wi and is defined as:</Paragraph>
      <Paragraph position="17"> Then, we analyze the distribution of the p)s and produce their average \]~i and variance Ui around \]~i. In Figure 4 spread represents Ui on a scale of 1 to 100. Ui characterizes the shape of the ~ histogram. If Ui is small, then the histogram will tend to be flat,</Paragraph>
    </Section>
    <Section position="3" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtraet
</SectionTitle>
      <Paragraph position="0"> which means that wi can be used equivalently in almost any position around w. In contrast, if Ui is large, then the histogram will tend to have peaks, which means that wi can only be used in one (or several) specific position around w. Ui is defined by:</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="154" end_page="163" type="metho">
    <SectionTitle>
10 (lb)
</SectionTitle>
    <Paragraph position="0"> These analyses are then used to sort out the retrieved data. First, using (la), collocates with strength smaller than a given threshold/co are eliminated. Then, using (lb), we filter out the collocates having a variance Ui smaller than a given threshold U0.</Paragraph>
    <Paragraph position="1"> Finally, we keep the interesting collocates by pulling out the peaks of the p'j distributions. These peaks correspond to the js such that the z-score of p~j is bigger than a given threshold kl. These thresholds have to be determined by the experimenter and are dependent on the use of the retrieved collocations. As described in Smadja (1991), for language generation we found that (k0, kl, U0) = (1, 1, 10) gave good results, but for other tasks different thresholds might be preferable. In general, the lower the threshold the more data are accepted, the higher the recall, and the lower the precision of the results. Section 10 describes an evaluation of the results produced with the above thresholds.</Paragraph>
    <Paragraph position="2"> More formally, a peak, or lexical relation containing w, at this point is defined as a tuple (wi, distance, strength, spread,j) verifying the following set of inequalities:  (c) strength = #eq~-f &gt; ko (C1) } spread &gt;_ Uo (C2)</Paragraph>
    <Paragraph position="4"> Some example results are given in Table 4.</Paragraph>
    <Paragraph position="5"> As shown in Smadja (1991), the whole first stage of Xtract as described above can be performed in O(S log S) time, in which S is the size of the corpus. The third step of counting frequencies and maintaining the data structure dominates the whole process and as pointed out by Ken Church (personal communication), it can be reduced to a sorting problem.</Paragraph>
    <Section position="1" start_page="154" end_page="156" type="sub_section">
      <SectionTitle>
6.2 What Exactly Is Filtered Out?
</SectionTitle>
      <Paragraph position="0"> The inequality set (C) is used to filter out irrelevant data, that is pairs of words supposedly not used consistently within a single syntactic structure. This section discusses the importance of each inequality in (C) on the filtering process.</Paragraph>
      <Paragraph position="1"> strength - freq -f &gt;_ ko (C1) (Y Condition (C1) helps eliminate the collocates that are not frequent enough. This condition specifies that the frequency of appearance of wi in the neighborhood of w must be at least one standard deviation above the average. In most statistical distributions, this thresholding eliminates the vast majority of the lexical relations. For example, for w = &amp;quot;takeover,&amp;quot; among the 3385 possible collocates only 167 were selected, which gives a proportion of 95% rejected. In the case of the standard normal distribution, this would reject some 68% of the cases. This indicates that the actual distribution of the  largest takeover 2 .82 60 collocates of &amp;quot;takeover&amp;quot; has a large kurtosis. 1deg Among the eliminated collocates were &amp;quot;dormant, dilute, ex., defunct,&amp;quot; which obviously are not typical of a takeover. Although these rejected collocations might be useful for applications such as speech recognition, for example, we do not consider them any further here. We are looking for recurrent combinations and not casual ones.</Paragraph>
      <Paragraph position="2"> spread &gt;&gt;_ Uo (C2) Condition (C2) requires that the histogram of the 10 relative frequencies of appearance of wi within five words of w (or p}s) have at least one spike. If the histogram is flat, it will be rejected by this condition. For example, in Figure 5, the histogram associated with w2 would be rejected, whereas the one associated with Wl or wi would be accepted. In Table 2, the histogram for &amp;quot;takeover-possible&amp;quot; is clearly accepted (there is a spike for p-l), whereas the one for &amp;quot;takeover-federal&amp;quot; is rejected. The assumption here is that, if the two words are repeatedly used together within a single syntactic construct, then they will have a marked pattern of co-appearance, i.e., they will not appear in all the possible positions with an equal probability. This actually eliminates pairs such as &amp;quot;telephone-television,&amp;quot; &amp;quot;bomb-soldier, .... trouble-problem,&amp;quot; &amp;quot;big-small,&amp;quot; and 10 The kurtosis of the distribution of the collocates probably depends on the word, and there is currently no agreement on the type of distribution that would describe them.</Paragraph>
    </Section>
    <Section position="2" start_page="156" end_page="156" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtract
</SectionTitle>
      <Paragraph position="0"> &amp;quot;doctor-nurse&amp;quot; where the two words co-occur with no real structural consistency. The two words are often used together because they are associated with the same context rather than for pure structural reasons. Many collocations retrieved in Church and Hanks (1989) were of this type, as they retrieved doctors-dentists, doctors-nurses, doctorbills, doctors-hospitals, nurses-doctor, etc., which are not collocations in the sense defined above. Such collocations are not of interest for our purpose, although they could be useful for disambiguation or other semantic purposes. Condition (C2) filters out exactly this type of collocations.</Paragraph>
      <Paragraph position="1"> p~ ~ \]~i q- (kl x V~i) (C3) Condition (C3) pulls out the interesting relative positions of the two words. Conditions (C2) and (C1) eliminate rows in the output of Step 1.2. (See Figure 2). In contrast, Condition (C3) selects columns from the remaining rows. For each pair of words, one or several positions might be favored and thus result in several PI selected. For example, the pair &amp;quot;expensive-takeover&amp;quot; produced two different peaks, one with only one word in between &amp;quot;expensive&amp;quot; and &amp;quot;takeover,&amp;quot; and the other with two words. Example sentences containing the two words in the two possible positions are: &amp;quot; The provision is aimed at making a hostile takeover prohibitively expensive by enabling Borg Warner's stockholders to buy the...&amp;quot; &amp;quot;The pill would make a takeover attempt more expensive by allowing the retailer's shareholders to buy more company stock...&amp;quot; Let us note that this filtering method is an original contribution of our work.</Paragraph>
      <Paragraph position="2"> Other works such as Church and Hanks (1989) simply focus on an evaluation of the correlation of appearance of a pair of words, which is roughly equivalent to condition (C1). (See next section). However, taking note of their pattern of appearance allows us to filter out more irrelevant collocations with (C2) and (C3). This is a very important point that will allow us to filter out many invalid collocations and also produce more functional information at stages 2 and 3. A graphical interpretation of the filtering method used for Xtract is given in Smadja (1991).</Paragraph>
      <Paragraph position="3"> 7. Xtract Stage Two: From 2-Grams to N-Grams The role of the second stage of Xtract is twofold. It produces collocations involving more than two words, and it filters out some pairwise relations. Stage 2 is related to the work of Choueka (1988), and to some extent to what has been done in speech recognition (e.g., Bahl, Jelinek, and Mercer 1983; Merialdo 1987; Ephraim and Rabiner 1990).</Paragraph>
    </Section>
    <Section position="3" start_page="156" end_page="158" type="sub_section">
      <SectionTitle>
7.1 Presentation of the Method
</SectionTitle>
      <Paragraph position="0"> In this second stage, Xtract uses the same components used for the first stage but in a different way. It starts with the pairwise lexical relations produced in stage 1 and produces multiple word collocations, such as rigid noun phrases or phrasal templates, from them. To do this, Xtract studies the lexical relations in context, which is exactly what lexicographers do. For each bigram identified at the previous stage, Xtract examines all instances of appearance of the two words and analyzes the distributions of words and parts of speech in the surrounding positions.</Paragraph>
      <Paragraph position="1"> Input: Output of Stage 1. Similar to Table 4, i.e., a list of bigrams with their statistical information as computed in stage 1.</Paragraph>
      <Paragraph position="2">  Identical to Stage 1, Step 1.1. Given a pair of words w and wi, and an integer specifying the distance of the two words, n This step produces all the sentences containing them in the given position. For example, given the bigram takeover-thwart and the distance 2, this step produces sentences like: &amp;quot;Under the recapitalization plan it proposed to thwart the takeover.&amp;quot; Step 2.2: Compile and Sort Identical to Stage 1, Step 1.2. We compute the frequency of appearance of each of the collocates of w by maintaining a data structure similar to the one given in Figure 5, Step 2.3: Analyze and Filter Input: Output of Step 2.2.</Paragraph>
      <Paragraph position="3"> Output: N-grams such as in Figure 8.</Paragraph>
      <Paragraph position="4"> Discussion: Here, the analyses are simpler than for Stage 1. We are only interested in percentage frequencies and we only compute the moment of order 1 of the frequency distributions.</Paragraph>
      <Paragraph position="5"> Tables produced in Step 2.2 (such as in Figure 5) are used to compute the frequency of appearance of each word in each position around w. For each of the possible relative distances from w, we analyze the distribution of the words and only keep the words occupying the position with a probability greater than a given threshold T. 12 If part of speech information is available, the same analysis is also performed with parts of speech instead of actual words. In short, a word w or a part of speech pos is kept in the final n-gram at position i if and only if it satisfies the following inequation:</Paragraph>
      <Paragraph position="7"> p(e) denotes the probability of event e. Consider the examples given in Figures 6 and 7 that show the concordances (output of step 2.1) for the input pairs: &amp;quot;average-industrial&amp;quot; and &amp;quot;index-composite.&amp;quot; In Figure 6, the same words are always used from position -4 to position 0.</Paragraph>
      <Paragraph position="8"> However, at position +1, the words used are always different. &amp;quot;Dow&amp;quot; is used at position -3 in more than 90% of the cases. It is thus part of the produced rigid noun phrases.</Paragraph>
      <Paragraph position="9"> But &amp;quot;down&amp;quot; is only used a couple of times (out of several hundred) at position +1, 11 The distance is actually optional and can be given in various ways. We can specify the word order, the maximum distance, the exact distance, etc. 12 This threshold must also be determined by the experimenter. In the following we use T = 0.75. As discussed previously, the choice of the threshold is arbitrary, and the general rule is that the lower the threshold, the higher the recall and the lower the precision of the results. The choice of 0.75 is based on the manual observations of several samples and it has effected the overall results, as discussed in Section 10.</Paragraph>
      <Paragraph position="10">  of all its'l'isted common stocks fell 1.76 to 164.13.</Paragraph>
      <Paragraph position="11"> of all its listed common stocks fell 0.98 to 164.91.</Paragraph>
      <Paragraph position="12"> of all its listed common stocks fell 0.96 to 164.93.</Paragraph>
      <Paragraph position="13"> of all its listed common stocks fell 0.91 to 164.98.</Paragraph>
      <Paragraph position="14"> of all its listed common stocks rose 1.04 to 167.08.</Paragraph>
      <Paragraph position="15"> of all its listed common stocks rose 0.76 of all its listed common stocks rose 0.50 to 166.54.</Paragraph>
      <Paragraph position="16"> of all its listed common stocks rose 0.69 to 166.73.</Paragraph>
      <Paragraph position="17"> of all its listed common stocks fell 0.33 to 170.63.</Paragraph>
      <Paragraph position="18"> &amp;quot;the NYSE's composite index of all its listed common stocks  the same. In all the example sentences in which &amp;quot;composite&amp;quot; and &amp;quot;index&amp;quot; are adjacent, the two words are used within a bigger construct of 11 words (also called an 11-gram). However, if we look at position +8 for example, we see that although the words used are different, in all the cases they are verbs. Thus, after the 11-gram we expect to find a verb. In short, Figure 7 helps us produce both the rigid noun phrases &amp;quot;The NYSE's composite index of all its listed common stocks,&amp;quot; as well as the phrasal template &amp;quot;The NYSE's composite index of all its listed common stocks *VERB* *NUMBER* to *NUMBER*.&amp;quot; Figure 8 shows some sample phrasal templates and rigid noun phrases that were produced at this stage. The leftmost column gives the input lexical relations. Some other examples are given in Figure 3.</Paragraph>
    </Section>
    <Section position="4" start_page="158" end_page="162" type="sub_section">
      <SectionTitle>
7.2 Discussion
</SectionTitle>
      <Paragraph position="0"> The role of stage 2 is to filter out many lexical relations and replace them by valid ones. It produces both phrasal templates and rigid noun phrases. For example, associations such as &amp;quot;blue-stocks, &amp;quot; &amp;quot;air-controller,&amp;quot; or &amp;quot;advancing-market&amp;quot; were filtered out  &amp;quot;The NYSE's composite index of all its listed common stocks fell *NUMBER* to *NUMBER*&amp;quot; &amp;quot;the NYSE's composite index of all its listed common stocks rose *NUMBER* to *NUMBER*.&amp;quot; &amp;quot;Five minutes before the close the Dow Jones average of 30 industrials was up/down *NUMBER* to/from *NUMBER*&amp;quot;  Example output collocations of stage two.</Paragraph>
      <Paragraph position="1"> and respectively replaced by: &amp;quot;blue chip stocks,&amp;quot; &amp;quot;air traffic controllers,&amp;quot; and &amp;quot;the broader market in the NYSE advancing issues.&amp;quot; Thus stage 2 produces n-word collocations from two-word associations. Producing n-word collocations has already been done (e.g., Choueka 1988). 13 The general method used by Choueka is the following: for each length n, (1 &lt; n &lt; 6), produce all the word sequences of length n and sort them by frequency. On a 12 million-word corpus, Choueka retrieved 10 collocations of length six, 115 collocations of length five, 1,024 collocations of length four, 4,777 of length three, and some 15,973 of length two. The threshold imposed was 14. The method we presented in this section has three main advantages when compared to a straight n-gram method like Choueka's.</Paragraph>
      <Paragraph position="2">  1. Stage 2 retrieves phrasal templates in addition to simple rigid noun phrases. Using part of speech information, we allow categories and words in our templates, thus retrieving a more flexible type of collocation. It is not clear how simple n-gram techniques could be adapted to obtain the same results.</Paragraph>
      <Paragraph position="3"> 2. Stage 2 gets rid of subsumed m-grams of a given n-gram (m &lt; n). Since  stage 2 works from bigrams, and produces the biggest n-gram containing it, there is no m-gram (m &lt; n) produced that is subsumed by it. For example, although &amp;quot;shipments of arms to Iran&amp;quot; is a collocation of length five, &amp;quot;arms to Iran&amp;quot; is not an interesting collocation. It is not opaque, and does not constitute a modifier-modified syntactic relation. A straight n-gram method would retrieve both, as well as many other subsumed m-grams, such as &amp;quot;of arms to Iran.&amp;quot; A sophisticated filtering method would then be necessary to eliminate the invalid ones (See Choueka 1988). Our method avoids this problem and only produces the biggest possible n-gram, namely: &amp;quot;shipment of arms to Iran.&amp;quot;  3. Stage 2 is a simple way of compiling n-gram data. Retrieving an 11-gram by the methods used in speech, for example, would require a great deal 13 Similar approaches have been done for several applications such as Bahl, Jelinek, and Mercer (1983) and Cerf-Danon et al. (1989) for speech recognition, and Morris and Cherry (1975), Angell (1983), Kukich (1990), and Mays, Damerau, and Mercer (1990) for spelling correction (with letters instead of words).  of CPU time and space. In a 10 million-word corpus, with about 60,000 different words, there are about 3.6 x 109 possible bigrams, 2.16 x 1014 trigrams, and 3 x 1033 7-grams. This rapidly gets out of hand. Choueka, for example, had to stop at length six. In contrast, the rigid noun phrases we retrieve are of arbitrary length and are retrieved very easily and in one pass. The method we use starts from bigrams and produces the biggest possible subsuming n-gram. It is based on the fact that if an n-gram is statistically significant, then the included bigrams must also be significant. For example, to identify &amp;quot;The Dow Jones average of 30 industrials,&amp;quot; a traditional n-gram method would compare it to the other 7-grams and determine that it is significant. In contrast, we start from an included significant bigram (for example, &amp;quot;Dow-30&amp;quot;) and we directly retrieve the surrounding n-grams. 14 8. Xtract Stage Three: Adding Syntax to the Collocations The collocations as produced in the previous stages are already useful for lexicography. For computational use, however, functional information is needed. For example, the collocations should have some syntactic properties. It is not enough to say that &amp;quot;make&amp;quot; goes with &amp;quot;decision&amp;quot;; we need to know that &amp;quot;decision&amp;quot; is used as the direct object of the verb.</Paragraph>
      <Paragraph position="4"> The advent of robust parsers such as Cass (Abney 1990) and Fidditch (Hindle 1983) has made it possible to process large text corpora with good performance and thus combine statistical techniques with more symbolic analysis. In the past, some similar attempts have been done. Debili (1982) parsed corpora of French texts to identify nonambiguous predicate argument relations. He then used these relations for disambiguation. Hindle and Rooth (1990) later refined this approach by using bigram statistics to enhance the task of prepositional phrase attachment. Church et al. (1989, 1991) have yet another approach; they consider questions such as what does a boat typically do? They are preprocessing a corpus with the Fidditch parser (Hindle 1983) in order to produce a list of verbs that are most likely associated with the subject &amp;quot;boat.&amp;quot; Our goal here is different, as we analyze collocations automatically produced by the first stage of Xtract to either add syntactic information or reject them. For example, if a lexical relation identified at stage 1 involves a noun and a verb, the role of stage 3 is to determine whether it is a subject-verb or a verb-object collocation. If no such consistent relation is observed, then the collocation is rejected. Stage 3 uses a parser but it does not require a complete parse tree. Given a number of sentences, Xtract only needs to know pairwise syntactic (modifier-modified) relations. The parser we used in the experiment reported here is Cass (Abney 1989, 1990), a bottom-up incremental parser. Cass 15 takes input sentences labeled with part of speech and attempts to identify syntactic structure. One of the subtasks performed by Cass is to identify predicate argument relations, and this is the task we are interested in here. Stage 3 works in the following three steps.</Paragraph>
      <Paragraph position="5"> 14 Actually, this 7-gram could be retrieved several times, one for each pair of open class word it contains. But a simple sorting algorithm gets rid of such repetitions.</Paragraph>
      <Paragraph position="6"> 15 The parser developed at Bell Communication Research by Steve Abney, Cass stands for Cascaded Analysis of Syntactic Structure. We are grateful to Steve for helping us with the use of Cass and customizing its output for us.</Paragraph>
      <Paragraph position="7">  Identical to what we did at Stage 2, Step 2.1. Given a pair of words w and wi, a distance of the two words (optional), and a tagged corpus, Xtract produces all the (tagged) sentences containing them in the given position specified by the distance. Step 3.2: Parse Input: Output of Step 3.1. A set of tagged sentences each containing both w and wi. Output: For each sentence, a set of syntactic labels such as those shown in Figure 9. Discussion: Cass is called on the concordances. From Cass output, we only retrieve binary syntactic relations (or labels) such as &amp;quot;verb-object&amp;quot; or &amp;quot;verb-subject,&amp;quot; &amp;quot;nounadjective,&amp;quot; and &amp;quot;noun-noun.&amp;quot; To simplify, we abbreviate them respectively: VO, SV, NJ, NN. For sentence (10) below, for example, the labels produced are shown in Figure 9. 10. &amp;quot;Wall Street faced a major test with stock traders returning to action for the first time since last week's epic selloff and investors awaited signs of life from the 5-year-old bull market.&amp;quot; Step 3.3: Label Sentences Input: A set of sentences each associated with a set of labels as shown in Figure 9. Output: Collocations with associated syntactic labels as shown in Figure 10.</Paragraph>
      <Paragraph position="8"> Discussion: For any given sentence containing both w and wi, two cases are possible: either there is a label for the bigram (w, wi), or there is none. For example, for sentence (10), there is a syntactic label for the bigram faced-test, but there is none for the bigram stock-returning. Faced-test enters into a verb object relation, and stock-returning does not enter into any type of relation. If no label is retrieved for the bigram, it means that the parser could not identify a relation between the two words. In this case we introduce a new label: U (for undefined) to label the bigram. At this point, we associate with the sentence the label for the bigram (w, wi). With each of the input sentences, we associate a label for the bigram (w, wi). For example, the label associated with sentence (10) for the bigram faced-test would be VO. A list of labeled sentences for the bigram w = &amp;quot;rose&amp;quot; and wi = &amp;quot;prices&amp;quot; is shown in Figure 10.</Paragraph>
    </Section>
    <Section position="5" start_page="162" end_page="162" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtract
</SectionTitle>
      <Paragraph position="0"> Some Concordances for (rose, prices) label ... when they rose pork prices 1.1 percent ... VO Analysts said stock prices rose because of a rally in Treasury bonds. SV Bond prices rose because many traders took the report as a signal ... SV Stock prices rose in moderate trading today with little news ... SV Bond prices rose in quiet trading SV Stock prices rose sharply Friday in response to a rally in ... SV ... soft drink prices rose 0.5 percent ... SV Stock prices rose broadly in early trading today as a rising dollar ... SV Figure 10 Producing the &amp;quot;prices \[\] rose,&amp;quot; SV predicative relation at stage 3. Step 3.4: Filter and Label Collocation Input: A set of sentences containing w and wi each associated with a label as shown in Figure 10.</Paragraph>
      <Paragraph position="1"> Output: Labeled collocations as shown in Figure 11.</Paragraph>
      <Paragraph position="2"> Discussion on Step 3.4: At this step, we count the frequencies of each possible label identified for the bigram (w} wi) and perform a statistical analysis of order two for this distribution. We compute the average frequency for the distribution of labels: ~ and the standard deviation crt. We finally apply a filtering method similar to (C2). Let t be a possible label. We keep t if and only if it satisfies inequality (4b) similar to (4a) given before:</Paragraph>
      <Paragraph position="4"> A collocation is thus accepted if and only if it has a label g satisfying inequality (4b), and g # U. Similarly, a collocation is rejected if no label satisfies inequality (4b) or if U satisfies it.</Paragraph>
      <Paragraph position="5"> Figure 10 shows part of the output of Step 3.3 for w = &amp;quot;rose&amp;quot; and wi = &amp;quot;prices.&amp;quot; As shown in the figure, SV labels are a large majority. Thus, we would label the relation price-rose as an SV relation. An example output of this stage is given in Figure 11.</Paragraph>
      <Paragraph position="6"> The bigrams labeled U were rejected at this stage.</Paragraph>
      <Paragraph position="7"> Stage 3 thus produces very useful results. It filters out collocations and rejects more than half of them, thus improving the quality of the results. It also labels the collocations it accepts, thus producing a more functional and usable type of knowledge. For example, if the first two stages of Xtract produce the collocation &amp;quot;make-decision,&amp;quot; the third stage identifies it as a verb-object collocation. If no such relation can be observed, then the collocation is rejected. The produced collocations are not simple word associations but complex syntactic structures. Labeling and filtering are two useful tasks for automatic use of collocations as well as for lexicography. The whole of stage 3 (both as a filter and as a labeler) is an original contribution of our work.</Paragraph>
      <Paragraph position="8"> Retrieving syntactically labeled collocations is a relatively new concern. Moreover, filtering greatly improves the quality of the results. This is also a possible use of the emerging new parsing technology.</Paragraph>
    </Section>
    <Section position="6" start_page="162" end_page="163" type="sub_section">
      <SectionTitle>
8.1 Xtract: The Toolkit
</SectionTitle>
      <Paragraph position="0"> Xtract is actually a library of tools implemented using standard C-Unix libraries. The toolkit has several utilities useful for analyzing corpora. Without making any effort  to make Xtract efficient in terms of computing resources, the first stage as well as the second stage of Xtract only takes a few minutes to run on a ten-megabyte (pre-tagged) corpus. Xtract is currently being used at Columbia University for various lexical tasks. And it has been tested on many corpora, among them: several ten-megabyte corpora of news stories, a corpus, consisting of some twenty megabytes of New York Times articles, which has already been used by Choueka (1988), the Brown corpus (Francis and Ku~era 1982), a corpus of the proceedings of the Canadian Parliament, also called the Hansards corpus, which amounts to several hundred megabytes. We are currently working on packaging Xtract to make it available to the research community. The packaged version will be portable, reusable, and faster than the one we used to write this paper. 16 We evaluate the filtering power of stage 3 in the evaluation section, Section 10.</Paragraph>
      <Paragraph position="1"> Section 9 presents some results that we obtained with the three stages of Xtract.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="163" end_page="168" type="metho">
    <SectionTitle>
9. Some Results
</SectionTitle>
    <Paragraph position="0"> Results obtained from The Jerusalem Post corpus have already been reported (e.g., Smadja 1991). Figure 12 gives some results for the three-stage process of Xtract on a 10 million-word corpus of stock market reports taken from the Associated Press newswire. The collocations are given in the following format. The first line contains the bigrams with the distance, so that &amp;quot;sales fell -1&amp;quot; says that the two words under consideration are &amp;quot;sales&amp;quot; and &amp;quot;fell,&amp;quot; and that the distance we are considering is -1. The first line is thus the output of stage 1. The second line gives the output of stage 2, i.e., the n-grams. For example, &amp;quot;takeover-thwart&amp;quot; is retrieved as &amp;quot;44 ..... to thwart AT takeover NN ....... &amp;quot; AT stands for article, NN stands for nouns, and 44 is the number of times this collocation has been retrieved in the corpus. The third line gives the retrieved tags for this collocation, so that the syntactic relation between &amp;quot;takeover&amp;quot; and &amp;quot;thwart&amp;quot; is an SV relation. And finally, the last line is an example sentence containing the collocation. Output of the type of Figure 12 is automatically produced. This kind of output is about as far as we have gone automatically. Any further analysis and/or use of the collocations would probably require some manual intervention.</Paragraph>
    <Paragraph position="1"> 16 Please contact the author if you are interested in getting a copy of the software.</Paragraph>
    <Paragraph position="2">  now that Robins has agreed it makes sense to sell the company we are finally down to the real questions How much will the company bring in the open market and how much of that amount will the claimants allow to go to shareholders? steps take 1  Some complete output on the stock market corpus.</Paragraph>
    <Paragraph position="3"> For the 10 million-word stock market corpus, there were some 60,000 different word forms. Xtract has been able to retrieve some 15,000 collocations in total. We would like to note, however, that Xtraet has only been effective at retrieving collocations for words appearing at least several dozen times in the corpus. This means that low-frequency words were not productive in terms of collocations. Out of the 60,000 words in the corpus, only 8,000 were repeated more than 50 times. This means that for a target  Computational Linguistics Volume 19, Number 1</Paragraph>
    <Paragraph position="5"> Overlap of the manual and automatic evaluations lexicon of size N = 8,000, one should expect at least as many collocations to be added, and Xtract can help retrieve most of them.</Paragraph>
    <Paragraph position="6"> 10. A Lexicographic Evaluation The third stage of Xtract can thus be considered as a retrieval system that retrieves valid collocations from a set of candidates. This section describes an evaluation experiment of the third stage of Xtract as a retrieval system as well as an evaluation of the overall output of Xtract. Evaluation of retrieval systems is usually done with the help of two parameters: precision and recall (Salton 1989). Precision of a retrieval system is defined as the ratio of retrieved valid elements divided by the total number of retrieved elements (Salton 1989). It measures the quality of the retrieved material. Recall is defined as the ratio of retrieved valid elements divided by the total number of valid elements. It measures the effectiveness of the system. This section presents an evaluation of the retrieval performance of the third stage of Xtract.</Paragraph>
    <Paragraph position="7"> Deciding whether a given word combination is a valid or invalid collocation is actually a difficult task that is best done by a lexicographer. Jeffery Triggs is a lexicographer working for the Oxford English Dictionary (OED) coordinating the North American Readers program of OED at Bell Communication Research. Jeffery Triggs agreed to go over manually several thousands of collocations. 17 In order to have an unbiased experiment we had to be able to evaluate the performance of Xtract against a human expert. We had to have the lexicographer and Xtract perform the same task. To do this in an unbiased way we randomly selected a subset of about 4,000 collocations after the first two stages of Xtract. This set of collocations thus contained some good collocations and some bad ones. This data set was then evaluated by the lexicographer and the third stage of Xtract. This allowed 17 1 am grateful to Jeffery, whose professionalism and kindness helped me understand some of the difficulty of lexicography. Without him this evaluation would not have been possible.</Paragraph>
    <Section position="1" start_page="166" end_page="168" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtract
</SectionTitle>
      <Paragraph position="0"> us to evaluate the performances of the third stage of Xtract and the overall quality of the total output of Xtract in a single experiment. The experiment was as follows: We gave the 4,000 collocations to evaluate to the lexicographer, asking him to select the ones that he would consider for a domain-specific dictionary and to cross out the others. The lexicographer came up with three simple tags, YY, Y, and N. Both Y and YY include good collocations, and N includes bad collocations. The difference between YY and Y is that Y collocations are of better quality than YY collocations.</Paragraph>
      <Paragraph position="1"> YY collocations are often too specific to be included in a dictionary, or some words are missing, etc. After stage 2, about 20% of the collocations are Y, about 20% are YY, and about 60% are N. This told us that the precision of Xtract at stage 2 was only about 40%.</Paragraph>
      <Paragraph position="2"> Although this would seem like a poor precision, one should compare it with the much lower rates currently in practice in lexicography. For compiling new entries for the OED, for example, the first stage roughly consists of reading numerous documents to identify new or interesting expressions. This task is performed by professional readers. For the OED, the readers for the American program alone produce some 10,000 expressions a month. These lists are then sent off to the dictionary and go through several rounds of careful analysis before actually being submitted to the dictionary.</Paragraph>
      <Paragraph position="3"> The ratio of proposed candidates to good candidates is usually low. For example, out of the 10,000 expressions proposed each month, fewer than 400 are serious candidates for the OED, which represents a current rate of 4%. Automatically producing lists of candidate expressions could actually be of great help to lexicographers, and even a precision of 40% would be helpful. Such lexicographic tools could, for example, help readers retrieve sublanguage-specific expressions by providing them with lists of candidate collocations. The lexicographer then manually examines the list to remove the irrelevant data. Even low precision is useful for lexicographers, as manual filtering is much faster than manual scanning of the documents (Marcus 1990). Such techniques are not able to replace readers, though, as they are not designed to identify low-frequency expressions, whereas a human reader immediately identifies interesting expressions with as few as one occurrence.</Paragraph>
      <Paragraph position="4"> The second stage of this experiment was to use Xtract stage 3 to filter out and label the sample set of collocations. As described in Section 8, there are several valid labels (VO~ VS~ NN, etc.). In this experiment, we grouped them under a single label: T.</Paragraph>
      <Paragraph position="5"> There is only one nonvalid label: U (for unlabeled). A T collocation is thus accepted by Xtract stage 3, and a U collocation is rejected. The results of the use of stage 3 on the sample set of collocations are similar to the manual evaluation in terms of numbers: about 40% of the collocations were labeled (T) by Xtract stage 3, and about 60% were rejected (U).</Paragraph>
      <Paragraph position="6"> Figure 13 shows the overlap of the classifications made by Xtract and the lexicographer. In the figure, the first diagram on the left represents the breakdown in T and U of each of the manual categories (Y-YY and N). The diagram on the right represents the breakdown in Y-YY and N of the T and U categories. For example, the first column of the diagram on the left represents the application of Xtract stage 3 on the YY collocations. It shows that 94% of the collocations accepted by the lexicographer were also accepted by Xtract. In other words, this means that the recall of the third stage of Xtract is 94%. The first column of the diagram on the right represents the lexicographic evaluation of the collocations automatically accepted by Xtract. It shows that about 80% of the T collocations were accepted by the lexicographer and that about 20% were rejected. This shows that precision was raised from 40% to 80% with the addition of Xtract stage 3. In summary, these experiments allowed us to evaluate Stage 3 as a retrieval system. The results are: precision = 80% and recall -- 94%.</Paragraph>
      <Paragraph position="7">  Top associations with &amp;quot;price&amp;quot; in NYT, DJ, and AP.</Paragraph>
      <Paragraph position="8"> 11. Influence of the Corpus on the Results In this section, we discuss the extent to which the results are dependent on the corpus used. To illustrate our purpose here, we are using results collected from three different corpora. The first one, DJ, for Dow Jones, is the corpus we used in this paper; it contains (mostly) stock market stories taken from the Associated Press newswire. DJ contains 8-9 million words. The second corpus, NYT, contains articles published in the New York Times during the years 1987 and 1988. The articles are on various subjects. This is the same corpus that was used by Choueka (1988). NYT contains 12 million words.</Paragraph>
      <Paragraph position="9"> The third corpus, AP, contains stories from the Associated Press newswire on various domains such as weather reports, politics, health, finances, etc. AP is 4 million words. Figure 14 represents the top 10 word associations retrieved by Xtract stage 1 for the three corpora with the word &amp;quot;price.&amp;quot; In this figure, d represents the distance between the two words and w represents the weight associated with the bigram. The weight is a combined index of the statistical distribution as discussed in Section 6, and it evaluates the collocation. There are several differences and similarities among the three columns of the figure in terms of the words retrieved, the order of the words retrieved, and the values of w. We identified two main ways in which the results depend on the corpus.</Paragraph>
      <Paragraph position="10"> We discuss them in turn.</Paragraph>
      <Paragraph position="11"> 11.1 Results Are Dependent on the Size of the Corpus From the different corpora we used, we noticed that our statistical methods were not effective for low-frequency words. More precisely, the statistical methods we use do not seem to be effective on low frequency words (fewer than 100 occurrences). If the word is not frequently used in the corpus or if the corpus is too small, then the distribution of its collocates will not be big enough. For example, from AP, which contains about 1,000 occurrences of the word &amp;quot;rain,&amp;quot; Xtract produced over 170 collocations at stage 1 involving it. In contrast, DJ only contains some 50 occurrences of &amp;quot;rain &amp;quot;is and Xtract could only produce a few collocations with it. Some collocations with &amp;quot;rain&amp;quot; and &amp;quot;hurricane&amp;quot; extracted from AP are listed in Figure 15. Both words are high-frequency words in AP and low-frequency words in DJ.</Paragraph>
      <Paragraph position="12"> 18 The corpus actually contains some stories not related to Wall Street.</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
Frank Smadja Retrieving Collocations from Text: Xtract
</SectionTitle>
      <Paragraph position="0"> In short, to build a lexicon for a computational linguistics application in a given domain, one should make sure that the important words in the domain are frequent enough in the corpus. For a subdomain of the stock market describing only the fluctuations of several indexes and some of the major events of the day at Wall Street, a corpus of 10 million words appeared to be sufficient. This 10 million-token corpus contains only 5,000 words each repeated more than 100 times.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML