File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0909_intro.xml

Size: 10,508 bytes

Last Modified: 2025-10-06 14:01:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0909">
  <Title>Acquiring Collocations for Lexical Choice between Near-Synonyms</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Unsupervised Lexical Acquisition: Proceedings of the Workshop of the erates all the possible sentences with a given meaning, and ranks them according to the degree to which they satisfy a set of preferences given as input (these are the denotational, attitudinal, and stylistic nuances mentioned above). We can refine the ranking so that it favors good collocations, and penalizes sentences containing words that do not collocate well.</Paragraph>
    <Paragraph position="1"> We acquire collocates of all near-synonyms in CTRW from free text. We combine several statistical measures, unlike other researchers who rely on only one measure to rank collocations.</Paragraph>
    <Paragraph position="2"> Then we acquire knowledge about less-preferred collocations and anti-collocations2. For example daunting task is a preferred collocation, while daunting job is less preferred (it should not be used in lexical choice unless there is no better alternative), and daunting duty is an anti-collocation (it must not be used in lexical choice). Like Church et al.(1991), we use the t-test and mutual information. Unlike them we use the Web as a corpus for this task, we distinguish three different types of collocations, and we apply sense disambiguation to collocations.</Paragraph>
    <Paragraph position="3"> Collocations are defined in different ways by different researchers. For us collocations consist of consecutive words that appear together much more often than by chance. We also include words separated by a few non-content words (short-distance co-occurrence in the same sentence).</Paragraph>
    <Paragraph position="4"> We are interested in collocations to be used in lexical choice. Therefore we need to extract lexical collocations (between open-class words), not grammatical collocations (which could contain closed-class words, for example put on). For now, we consider only two-word fixed collocations. In future work we will consider longer and more flexible collocations. null We are also interested in acquiring words that strongly associate with our near-synonyms, especially words that associate with only one of the near-synonyms in the cluster. Using these strong associations, we plan to learn about nuances of near-synonyms in order to validate and extend our lexical knowledge-base of near-synonym differences.</Paragraph>
    <Paragraph position="5"> In our first experiment, described in sections 2 and 3 (with results in section 4, and evaluation in 2This term was introduced by Pearce (2001).</Paragraph>
    <Paragraph position="6"> section 5), we acquire knowledge about the collocational behaviour of the near-synonyms. In step 1 (section 2), we acquire potential collocations from the British National Corpus (BNC)3, combining several measures. In section 3 we present: (step2) select collocations for the near-synonyms in CTRW; (step 3) filter out wrongly selected collocations using mutual information on the Web; (step 4) for each cluster we compose new collocations by combining the collocate of one near-synonym with the the other near-synonym, and we apply the differential t-test to classify them into preferred collocations, less-preferred collocations, and anti-collocations. Section 6 sketches our second experiment, involving word associations. The last two sections present related work, and conclusions and future work.</Paragraph>
    <Paragraph position="7"> 2 Extracting collocations from free text For the first experiment we acquired collocations for near-synonyms from a corpus. We experimented with 100 million words from the Wall Street Journal (WSJ). Some of our near-synonyms appear very few times (10.64% appear fewer than 5 times) and 6.87% of them do not appear at all in WSJ (due to its business domain). Therefore we need a more general corpus. We used the 100 million word BNC. Only 2.61% of our near-synonyms do not occur; and only 2.63% occur between 1 and 5 times.</Paragraph>
    <Paragraph position="8"> Many of the near-synonyms appear in more than one cluster, with different parts-of-speech. We experimented on extracting collocations from raw text, but we decided to use a part-of-speech tagged corpus because we need to extract only collocations relevant for each cluster of near-synonyms. The BNC is a good choice of corpus for us because it has been tagged (automatically by the CLAWS tagger).</Paragraph>
    <Paragraph position="9"> We preprocessed the BNC by removing all words tagged as closed-class. To reduce computation time, we also removed words that are not useful for our purposes, such as proper names (tagged NP0). If we keep the proper names, they are likely to be among the highest-ranked collocations.</Paragraph>
    <Paragraph position="10"> There are many statistical methods that can be used to identify collocations. Four general methods are presented by Manning and Sch&amp;quot;utze (1999). The first one, based on frequency of co-occurrence,  does not consider the length of the corpus. Part-of-speech filtering is needed to obtain useful collocations. The second method considers the means and variance of the distance between two words, and can compute flexible collocations (Smadja, 1993). The third method is hypothesis testing, which uses statistical tests to decide if the words occur together with probability higher than chance (it tests whether we can reject the null hypothesis that the two words occurred together by chance). The fourth method is (pointwise) mutual information, an information-theoretical measure.</Paragraph>
    <Paragraph position="11"> We use Ted Pedersen's Bigram Statistics Package4. BSP is a suite of programs to aid in analyzing bigrams in a corpus (newer versions allow Ngrams). The package can compute bigram frequencies and various statistics to measure the degree of association between two words: mutual information (MI), Dice, chi-square (kh2), log-likelihood (LL), and Fisher's exact test.</Paragraph>
    <Paragraph position="12"> The BSP tools count for each bigram in a corpus how many times it occurs, and how many times the first word occurs.</Paragraph>
    <Paragraph position="13"> We briefly describe the methods we use in our experiments, for the two-word case. Each bigram xy can be viewed as having two features represented by the binary variables X and Y . The joint frequency distribution of X and Y is described in a contingency table. Table 1 shows an example for the bigram daunting task. n11 is the number of times the bi-gram xy occurs; n12 is the number of times x occurs in bigrams at the left of words other than y; n21 is the number of times y occurs in bigrams after words other that x; and n22 is the number of bigrams containing neither x nor y. In Table 1 the variable X denotes the presence or absence of daunting in the first position of a bigram, and Y denotes the presence or absence of task in the second position of a bigram. The marginal distributions of X and Y are the row and column totals obtained by summing the joint frequencies: n+1 = n11 + n21, n1+ = n11 + n12, and n++ is the total number of bigrams.</Paragraph>
    <Paragraph position="14"> The BSP tool counts for each bigram in a corpus how many times it occurs, how many times the first word occurs at the left of any bigram (n+1), and how many times the second words occurs at the right of</Paragraph>
    <Paragraph position="16"/>
    <Paragraph position="18"> any bigram (n1+).</Paragraph>
    <Paragraph position="19"> Mutual information, I(x;y), compares the probability of observing words x and word y together (the joint probability) with the probabilities of observing x and y independently (the probability of occurring together by chance) (Church and Hanks, 1991).</Paragraph>
    <Paragraph position="21"> The probabilities can be approximated by: P(x) =</Paragraph>
    <Paragraph position="23"> The Dice coefficient is related to mutual information and it is calculated as:</Paragraph>
    <Paragraph position="25"> The next methods fall under hypothesis testing methods. Pearson's Chi-square and Log-likelihood ratios measure the divergence of observed (ni j) and expected (mi j) sample counts (i = 1;2, j = 1;2). The expected values are for the model that assumes independence (assumes that the null hypothesis is true). For each cell in the contingency table, the expected counts are: mi j = ni+n+ jn++ . The measures are calculated as (Pedersen, 1996):</Paragraph>
    <Paragraph position="27"> appropriate for sparse data than chi-square.</Paragraph>
    <Paragraph position="28"> Fisher's exact test is a significance test that is considered to be more appropriate for sparse and skewed samples of data than statistics such as the log-likelihood ratio or Pearson's Chi-Square test (Pedersen, 1996). Fisher's exact test is computed by fixing the marginal totals of a contingency table and then determining the probability of each of the possible tables that could result in those marginal totals. Therefore it is computationally expensive. The formula is:</Paragraph>
    <Paragraph position="30"> Because these five measures rank collocations in different ways (as the results in the Appendix will show), and have different advantages and drawbacks, we decided to combine them in choosing collocations. We choose as potential collocations for each near-synonym a collocation that is selected by at least two of the measures. For each measure we need to choose a threshold T , and consider as selected collocations only the T highest-ranked bi-grams (where T can differ for each measure). By choosing higher thresholds we increase the precision (reduce the chance of accepting wrong collocations).</Paragraph>
    <Paragraph position="31"> By choosing lower thresholds we get better recall.</Paragraph>
    <Paragraph position="32"> If we opt for low recall we may not get many collocations for some of the near-synonyms. Because there is no principled way of choosing these thresholds, we prefer to choose lower thresholds (the first 200,000 collocations selected by each measure, except Fisher's measure for which we take all 435,000 collocations ranked 1) and to filter out later (in step 2) the bigrams that are not true collocations, using mutual information on the Web.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML