XML Viewer - w02-0909

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0909_metho.xml
Size: 19,699 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0909">
  <Title>Acquiring Collocations for Lexical Choice between Near-Synonyms</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Differential collocations
</SectionTitle>
    <Paragraph position="0"> For each cluster of near-synonyms, we now have the words that occur in preferred collocations with each near-synonym. We need to check whether these words collocate with the other near-synonyms in the same cluster. For example, if daunting task is a preferred collocation, we check whether daunting collocates with the other near-synonyms of task.</Paragraph>
    <Paragraph position="1"> We use the Web as a corpus for differential collocations. We don't use the BNC corpus to rank less-preferred and anti-collocations, because their absence in BNC may be due to chance. We can assume that the Web (the portion retrieved by search engines) is big enough that a negative result can be trusted.</Paragraph>
    <Paragraph position="2"> We use an interface to AltaVista search engine to count how often a collocation is found. (See Table 2 for an example.5) A low number of co-occurrences indicates a less-preferred collocation. But we also need to consider how frequent the two words in the collocation are. We use the differential t-test to find collocations that best distinguish between two near-synonyms (Church et al., 1991), but we use the Web as a corpus. Here we don't have part-of-speech tags but this is not a problem because in the previous step we selected collocations with the right part-of-speech for the near-synonym. We approximate the number of occurrences of a word on the Web with the number of documents containing the word.</Paragraph>
    <Paragraph position="3"> The t-test can also be used in the hypothesis testing method to rank collocations. It looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample was drawn from a normal distribution with mean u. It measures the difference between observed ( -x) and expected means, scaled by the variance of the data (s2), which in turn is scaled by the sample size (N).</Paragraph>
    <Paragraph position="5"> We are interested in the Differential t-test, which can be used for hypothesis testing of differences. It compares the means of two normal populations:</Paragraph>
    <Paragraph position="7"> Here the null hypothesis is that the average difference is u = 0.Therefore -x u = u = -x1 -x2. In the denominator we add the variances of the two populations. null If the collocations of interest are xw and yw (or similarly wx and wy), then we have the approxima-</Paragraph>
    <Paragraph position="9"> If w is a word that collocates with one of the near-synonyms in a cluster, and x is each of the near5The search was done on 13 March 2002.</Paragraph>
    <Paragraph position="10"> synonyms, we can approximate the mutual information relative to w:</Paragraph>
    <Paragraph position="12"> where P(w) was dropped because it is the same for various x (we cannot compute if we keep it, because we don't know the total number of bigrams on the Web).</Paragraph>
    <Paragraph position="13"> We use this measure to eliminate collocations wrongly selected in step 1. We eliminate those with mutual information lower that a threshold. We describe the way we chose this threshold (Tmi) in section 5.</Paragraph>
    <Paragraph position="14"> We are careful not to consider collocations of a near-synonym with a wrong part-of-speech (our collocations are tagged). But there is also the case when a near-synonym has more than one major sense. In this case we are likely to retrieve collocations for senses other than the one required in the cluster. For example, for the cluster job, task, duty, etc., the collocation import/N duty/N is likely to be for a different sense of duty (the customs sense). Our way of dealing with this is to disambiguate the sense used in each collocations (we assume one sense per collocation), by using a simple Lesk-style method (Lesk, 1986). For each collocation, we retrieve instances in the corpus, and collect the content words surrounding the collocations. This set of words is then intersected with the context of the near-synonym in CTRW (that is the whole entry). If the intersection is not empty, it is likely that the collocation and the entry use the near-synonym in the same sense. If the intersection is empty, we don't keep the collocation.</Paragraph>
    <Paragraph position="15"> In step 3, we group the collocations of each near-synonym with a given collocate in three classes, based on the t-test values of pairwise collocations.</Paragraph>
    <Paragraph position="16"> We compute the t-test between each collocation and the collocation with maximum frequency, and the t-test between each collocation and the collocation with minimum frequency (see Table 2 for an example). Then, we need to determine a set of thresholds that classify the collocations in the three groups: preferred collocations, less preferred collocations, and anti-collocations. The procedure we use in this step is detailed in section 5.</Paragraph>
    <Paragraph position="17">  hits for the collocation daunting x, where x is one of the near-synonyms in the first column. The third column shows the mutual information, the fourth column, the differential t-test between the collocation with maximum frequency (daunting task) and daunting x, and the last column, the t-test between daunting x and the collocation with minimum frequency (daunting hitch).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We obtained 15,813,685 bigrams. From these, 1,350,398 were distinct and occurred at least 4 times.</Paragraph>
    <Paragraph position="1"> We present some of the top-ranked collocations for each measure in the Appendix. We present the rank given by each measure (1 is the highest), the value of the measure, the frequency of the collocation, and the frequencies of the words in the collocation. null We selected collocations for all 914 clusters in CTRW (5419 near-synonyms in total). An example of collocations extracted for the near-synonym task is:  where the numbers are, in order, the rank given by the measure and the value of the measure.</Paragraph>
    <Paragraph position="2"> We filtered out the collocations using MI on the Web (step 2), and then we applied the differential t-test (step 3). Table 2 shows the values of MI between daunting x and x, where x is one of the near-synonyms of task. It also shows t-test val-Near-synonyms daunting particular tough task p p p job ? p p  ues between (some) pairs of collocations. Table 3 presents an example of results for differential collocations, where p marks preferred collocations, ? marks less-preferred collocations, and marks anticollocations. null Before proceeding with step 3, we filtered out the collocations in which the near-synonym is used in a different sense, using the Lesk method explained above. For example, suspended/V duty/N is kept while customs/N duty/N and import/N duty/N are rejected. The disambiguation part of our system was run only for a subset of CTRW, because we have yet to evaluate it. The other parts of our system were run for the whole CTRW. Their evaluation is described in the next section.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> Our evaluation has two purposes: to get a quantitative measure of the quality of our results, and to choose thresholds in a principled way.</Paragraph>
    <Paragraph position="1"> As described in the previous sections, in step 1 we selected potential collocations from BNC (the ones selected by at least two of the five measures).</Paragraph>
    <Paragraph position="2"> Then, we selected collocations for each of the near-synonyms in CTRW (step 2). We need to evaluate the MI filter (step 3), which filters out the bigrams that are not true collocations, based on their mutual information computed on the Web. We also need to evaluate step 4, the three way classification based on the differential t-test on the Web.</Paragraph>
    <Paragraph position="3"> For evaluation purposes we selected three clusters from CTRW, with a total of 24 near-synonyms. For these, we obtained 916 collocations from BNC according to the method described in section 2.</Paragraph>
    <Paragraph position="4"> We had two human judges reviewing these collocations to determine which of them are true collocations and which are not. We presented the collocations to the judges in random order, and each collocation was presented twice. The first judge was consistent (judged a collocation in the same way both times it appeared) in 90.4% of the cases. The second judge was consistent in 88% of the cases. The agreement between the two judges was 67.5% (computed in a strict way, that is we considered agreement only when the two judges had the same opinion including the cases when they were not consistent). The consistency and agreement figures show how difficult the task is for humans.</Paragraph>
    <Paragraph position="5"> We used the data annotated by the two judges to build a standard solution, so we can evaluate the results of our MI filter. In the standard solution a bigram was considered a true collocation if both judges considered it so. We used the standard solution to evaluate the results of the filtering, for various values of the threshold Tmi. That is, if a bigram had the value of MI on the Web lower than a threshold Tmi, it was filtered out. We choose the value of Tmi so that the accuracy of our filtering program is the highest. By accuracy we mean the number of true collocations (as given by the standard solution) identified by our program over the total number of bigrams we used in the evaluation. The best accuracy was 70.7% for Tmi = 0.0017. We used this value of the threshold when running our programs for all CTRW.</Paragraph>
    <Paragraph position="6"> As a result of this first part of the evaluation, we can say that after filtering collocations based on MI on the Web, approximately 70.7% of the remaining bigrams are true collocation. This value is not absolute, because we used a sample of the data for the evaluation. The 70.7% accuracy is much better than a baseline (approximately 50% for random choice).</Paragraph>
    <Paragraph position="7"> Table 4 summarizes our evaluation results.</Paragraph>
    <Paragraph position="8"> Next, we proceeded with evaluating the differential t-test three-way classifier. For each cluster, for each collocation, new collocations were formed from the collocate and all the near-synonyms in the cluster. In order to learn the classifier, and to evaluate its results, we had the two judges manually classify a sample data into preferred collocations, less-preferred collocations, and anti-collocations. We used 2838 collocations obtained for the same three clusters from 401 collocations (out of the initial 916) that remained after filtering. We built a standard solution for this task, based on the classifications of Step Baseline Our system Filter (MI on the Web) 50% 70.7% Dif. t-test classifier 71.4% 84.1%  both judges. When the judges agreed, the class was clear. When they did not agree, we designed simple rules, such as: when one judge chose the class preferred collocation, and the other judge chose the class anti-collocation, the class in the solution was less-preferred collocation. The agreement between judges was 80%; therefore we are confident that the quality of our standard solution is high. We used this standard solution as training data to learn a decision tree6 for our three-way classifier. The features in the decision tree are the t-test between each collocation and the collocation from the same group that has maximum frequency on the Web, and the t-test between the current collocation and the collocation that has minimum frequency (as presented in Table 2). We could have set aside a part of the training data as a test set. Instead, we did 10-fold cross validation to quantify the accuracy on unseen data. The accuracy on the test set was 84.1% (compared with a baseline that chooses the most frequent class, anti-collocations, and achieves an accuracy of 71.4%). We also experimented with including MI as a feature in the decision tree, and with manually choosing thresholds (without a decision tree) for the three-way classification, but the accuracy was lower than 84.1%.</Paragraph>
    <Paragraph position="9"> The three-way classifier can fix some of the mistakes of the MI filter. If a wrong collocation remained after the MI filter, the classifier can classify it in the anti-collocations class.</Paragraph>
    <Paragraph position="10"> We can conclude that the collocational knowledge we acquired has acceptable quality.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Word Association
</SectionTitle>
    <Paragraph position="0"> We performed a second experiment, where we looked for long distance co-occurrences (words that co-occur in a window of size K). We call these associations, and they include the lexical collocations  We use BSP with the option of looking for bi-grams in a window larger than 2. For example if the window size is 3, and the text is vaccine/N cure/V available/A, the extracted bigrams are vaccine/N cure/V, cure/V available/A, and vaccine/N available/A. We would like to choose a large (415) window size; the only problem is the increase in computation time. We look for associations of a word in the paragraph, not only in the sentence. Because we look for bigrams, we may get associations that occur to the left or to the right of the word. This is an indication of strong association.</Paragraph>
    <Paragraph position="1"> We obtained associations similar to those presented by Church et al.(1991) for the near-synonyms ship and boat. Church et al. suggest that a lexicographer looking at these associations can infer that a boat is generally smaller than a ship, because they are found in rivers and lakes, while the ships are found in seas. Also, boats are used for small jobs (e.g., fishing, police, pleasure), whereas ships are used for serious business (e.g., cargo, war). Our intention is to use the associations to automatically infer this kind of knowledge and to validate acquired knowledge.</Paragraph>
    <Paragraph position="2"> For our purpose we need only very strong associations, and we don't want words that associate with all near-synonyms in a cluster. Therefore we test for anti-associations using the same method we used in section 3, with the difference that the query asked to AltaVista is: x NEAR y (where x and y are the words of interest).</Paragraph>
    <Paragraph position="3"> Words that don't associate with a near-synonym but associate with all the other near-synonyms in a cluster can tell us something about its nuances of meaning. For example terrible slip is an antiassociation, while terrible associates with mistake, blunder, error. This is an indication that slip is a minor error.</Paragraph>
    <Paragraph position="4"> Table 5 presents some preliminary results we obtained with K = 4 (on half the BNC and then on the Web), for the differential associations of boat (where p marks preferred associations, ? marks less-preferred associations, and marks antiassociations). We used the same thresholds as for our experiment with collocations.</Paragraph>
    <Paragraph position="5"> Near-synonyms fishing club rowing boat p p p vessel p</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Related work
</SectionTitle>
    <Paragraph position="0"> There has been a lot of work done in extracting collocations for different applications. We have already mentioned some of the most important contributors.</Paragraph>
    <Paragraph position="1"> Like Church et al.(1991), we use the t-test and mutual information, but unlike them we use the Web as a corpus for this task (and a modified form of mutual information), and we distinguish three types of collocations (preferred, less-preferred, and anticollocations). null We are concerned with extracting collocations for use in lexical choice. There is a lot of work on using collocations in NLG (but not in the lexical choice sub-component). There are two typical approaches: the use of phrasal templates in the form of canned phrases, and the use of automatically extracted collocations for unification-based generation (McKeown and Radev, 2000).</Paragraph>
    <Paragraph position="2"> Statistical NLG systems (such as Nitrogen (Langkilde and Knight, 1998)) make good use of the most frequent words and their collocations. But such a system cannot choose a less-frequent synonym that may be more appropriate for conveying desired nuances of meaning, if the synonym is not a frequent word.</Paragraph>
    <Paragraph position="3"> Finally, there is work related to ours from the point of view of the synonymy relation.</Paragraph>
    <Paragraph position="4"> Turney (2001) used mutual information to detect the best answer to questions about synonyms from Test of English as a Foreign Language (TOEFL) and English as a Second Language (ESL). Given a problem word (with or without context), and four alternative words, the question is to choose the alternative most similar in meaning with the problem word. His work is based on the assumption that two synonyms are likely to occur in the same document (on the Web). This can be true if the author needs to avoid repeating the same word, but not true when the synonym is of secondary importance in a text.</Paragraph>
    <Paragraph position="5"> The alternative that has the highest PMI-IR (pointwise mutual information for information retrieval) with the problem word is selected as the answer. We used the same measure in section 3 -- the mutual information between a collocation and a collocate that has the potential to discriminate between nearsynonyms. Both works use the Web as a corpus, and a search engine to estimate the mutual information scores.</Paragraph>
    <Paragraph position="6"> Pearce (2001) improves the quality of retrieved collocations by using synonyms from WordNet (Pearce, 2001). A pair of words is considered a collocation if one of the words significantly prefers only one (or several) of the synonyms of the other word. For example, emotional baggage is a good collocation because baggage and luggage are in the same synset and emotional luggage is not a collocation. As in our work, three types of collocations are distinguished: words that collocate well; words that tend to not occur together, but if they do the reading is acceptable; and words that must not be used together because the reading will be unnatural (anti-collocations). In a similar manner with (Pearce, 2001), in section 3, we don't record collocations in our lexical knowledge-base if they don't help discriminate between near-synonyms. A difference is that we use more than frequency counts to classify collocations (we use a combination of t-test and MI).</Paragraph>
    <Paragraph position="7"> Our evaluation was partly inspired by Evert and Krenn (2001). They collect collocations of the form noun-adjective and verb-prepositional phrase. They build a solution using two human judges, and use the solution to decide what is the best threshold for taking the N highest-ranked pairs as true collocations. In their experiment MI behaves worse that other measures (LL, t-test), but in our experiment MI on the Web achieves good results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML