File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2003_metho.xml

Size: 13,006 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2003">
  <Title>An Extensive Empirical Study of Collocation Extraction Methods</Title>
  <Section position="3" start_page="13" end_page="13" type="metho">
    <SectionTitle>
2 Collocation extraction
</SectionTitle>
    <Paragraph position="0"> Most methods for collocation extraction are based on verification of typical collocation properties.</Paragraph>
    <Paragraph position="1"> These properties are formally described by mathematical formulas that determine the degree of association between components of collocation. Such formulas are called association measures and compute an association score for each collocation candidate extracted from a corpus. The scores indicate a chance of a candidate to be a collocation. They can be used for ranking or for classification - by setting a threshold. Finding such a threshold depends on the intended application.</Paragraph>
    <Paragraph position="2"> The most widely tested property of collocations is non-compositionality: If words occur together more often than by a chance, then this is the evidence that they have a special function that is not simply explained as a result of their combination (Manning and Schutze, 1999). We think of a corpus as a randomly generated sequence of words that is viewed as a sequence of word pairs. Occurrence frequencies of these bigrams are extracted and kept in contingency tables (Table 1a). Values from these tables are used in several association measures that reflect how much the word coocurrence is accidental. A list of such measures is given in Table 2 and includes: estimation of bigram and unigram probabilities (rows 3-5), mutual information and derived measures (611), statistical tests of independence (12-16), likelihood measures (17-18), and various other heuristic association measures and coefficients (19-57).</Paragraph>
    <Paragraph position="3"> Another frequently tested property is taken directly from the definition that a collocation is a syntactic and semantic unit. For each bigram occurring in the corpus, information of its empirical context (frequencies of open-class words occurring within a specified context window) and left and right immediate contexts (frequencies of words immediately preceding or following the bigram) is extracted (Table 1b). By determining the entropy of the immediate contexts of a word sequence, the association measures rank collocations according to the assumption that they occur as units in a (informationtheoretically) noisy environment (Shimohata et al., 1997) (58-62). By comparing empirical contexts of a word sequence and its components, the association measures rank collocations according to the as-</Paragraph>
    <Paragraph position="5"> b) Cw empirical context of w Cxy empirical context of xy Clxy left immediate context of xy Crxy right immediate context of xy  marginal frequencies for a bigram xy; -w stands for any word except w; [?] stands for any word; N is a total number of bigrams. The table cells are sometimes referred as fij. Statistical tests of independence work with contingency tables of expected frequencies ^f(xy)=f(x[?])f([?]y)/N. b) Different notions of empirical contexts.</Paragraph>
    <Paragraph position="6"> sumption that semantically non-compositional expressions typically occur in different contexts than their components (Zhai, 1997). Measures (63-76) have information theory background and measures (77-84) are adopted from the field of information retrieval. Context association measures are mainly used for extracting idioms.</Paragraph>
    <Paragraph position="7"> Besides all the association measures described above, we also take into account other recommended measures (1-2) (Manning and Schutze, 1999) and some basic linguistic characteristics used for filtering non-collocations (85-87). This information can be obtained automatically from morphological taggers and syntactic parsers available with reasonably high accuracy for many languages.</Paragraph>
  </Section>
  <Section position="4" start_page="13" end_page="15" type="metho">
    <SectionTitle>
3 Empirical evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation of collocation extraction methods is a complicated task. On one hand, different applications require different setting of association score thresholds. On the other hand, methods give different results within different ranges of their association scores. We need a complex evaluation scheme covering all demands. In such a case, Evert (2001) and other authors suggest using precision and recall measures on a full reference data or on n-best lists.</Paragraph>
    <Paragraph position="1"> Data. All the presented experiments were performed on morphologically and syntactically annotated Czech text from the Prague Dependency Tree-bank (PDT) (HajiVc et al., 2001). Dependency trees were broken down into dependency bigrams consisting of: lemmas and part-of-speech of the components, and type of dependence between the components. null For each bigram type we counted frequencies in its contingency table, extracted empirical and immediate contexts, and computed all the 84 association measures from Table 2. We processed 81 614 sen- null # Name Formula 1. Mean component offset 1n Pni=1 di 2. Variance component offset 1n[?]1 Pni=1 `di[?] -d'2 3. Joint probability P(xy) 4. Conditional probability P(y|x) 5. Reverse conditional prob. P(x|y) star6. Pointwise mutual inform. log P(xy)</Paragraph>
    <Paragraph position="3"> 75. Skew divergence D(p(w|Cx)||a(w|Cy)+(1[?]a)p(w|Cx)) 76. Reverse skew divergence D(p(w|Cy)||ap(w|Cx)+(1[?]a)p(w|Cy)) 77. Phrase word coocurrence 12(f(x|Cxy)f(xy) + f(y|Cxy)f(xy) ) 78. Word association 12(f(x|Cy)[?]f(xy)f(xy) + f(y|Cx)[?]f(xy)f(xy) )</Paragraph>
    <Paragraph position="5"> star79. in boolean vector space zi=d(f(wi|Cz)) 80. in tf vector space zi=f(wi|Cz) 81. in tf*idf vector space zi=f(wi|Cz)* Ndf(wi);df(wi)=|{x:wiepsilon1Cx}| Dice context similarity: 12(dice(cx,cxy)+dice(cy,cxy))</Paragraph>
    <Paragraph position="7"> tences with 1 255 590 words and obtained a total of 202 171 different dependency bigrams.</Paragraph>
    <Paragraph position="8"> Krenn (2000) argues that collocation extraction methods should be evaluated against a reference set of collocations manually extracted from the full candidate data from a corpus. However, we reduced the full candidate data from PDT to 21 597 bigram by filtering out any bigrams which occurred 5 or less times in the data and thus we obtained a reference data set which fulfills requirements of a sufficient size and a minimal frequency of observations which is needed for the assumption of normal distribution required by some methods.</Paragraph>
    <Paragraph position="9"> We manually processed the entire reference data set and extracted bigrams that were considered to be collocations. At this point we applied part-of-speech filtering: First, we identified POS patterns that never form a collocation. Second, all dependency bigrams having such a POS pattern were removed from the reference data and a final reference set of 8 904 bi-grams was created. We no longer consider bigrams with such patterns to be collocation candidates.</Paragraph>
    <Paragraph position="10"> This data set contained 2 649 items considered to be collocations. The a priori probability of a bi-gram to be a collocation was 29.75 %. A stratified one-third subsample of this data was selected as test data and used for evaluation and testing purposes in this work. The rest was taken apart and used as training data in later experiments.</Paragraph>
    <Paragraph position="11"> Evaluation metrics. Since we manually annotated the entire reference data set we could use the suggested precision and recall measures (and their harmonic mean F-measure). A collocation extraction method using any association measure with a given threshold can be considered a classifier and the measures can be computed in the following way: Precision = #correctly classified collocations#total predicted as collocations Recall = #correctly classified collocations#total collocations The higher these scores, the better the classifier is.</Paragraph>
    <Paragraph position="12"> By changing the threshold we can tune the classifier performance and &amp;quot;trade&amp;quot; recall for precision. Therefore, collocation extraction methods can be thoroughly compared by comparing their precision-recall curves: The closer the curve to the top right corner, the better the method is.</Paragraph>
    <Paragraph position="13">  Results. Presenting individual results for all of the 84 association measures is not possible in a paper of this length. Therefore, we present precision-recall graphs only for the best methods from each group mentioned in Section 2; see Figure 1. The baseline system that classifies bigrams randomly, operates with a precision of 29.75 %. The overall best result was achieved by Pointwise mutual information: 30 % recall with 85.5 % precision (F-measure 44.4), 60 % recall with 78.4 % precision (F-measure 68.0), and 90 % recall with 62.5 % precision (F-measure 73.8).</Paragraph>
  </Section>
  <Section position="5" start_page="15" end_page="16" type="metho">
    <SectionTitle>
4 Statistical classification
</SectionTitle>
    <Paragraph position="0"> In the previous section we mentioned that collocation extraction is a classification problem. Each method classifies instances of the candidate data set according to the values of an association score. Now we have several association scores for each candidate bigram and want to combine them together to achieve better performance. A motivating example is depicted in Figure 3: Association scores of Point-wise mutual information and Cosine context similarity are independent enough to be linearly combined to provide better results. Considering all association measures, we deal with a problem of high-dimensional classification into two classes.</Paragraph>
    <Paragraph position="1"> In our case, each bigram x is described by the attribute vector x=(x1,...,x87) consisting of linguistic features and association scores from Table 2.</Paragraph>
    <Paragraph position="2"> Now we look for a function assigning each bigram one class : f(x)-{collocation, non-collocation}.</Paragraph>
    <Paragraph position="3"> The result of this approach is similar to setting a threshold of the association score in methods us- null sion. By moving this boundary we can tune the classifier output (a 5 % stratified sample of the test data is displayed). ing one association measure, which is not very usefull for our purpose. Some classification methods, however, output also the predicted probability P(xiscollocation) that can be considered a regular association measure as described above. Thus, the classification method can be also tuned by changing a threshold of this probability and can be compared with other methods by the same means of precision and recall.</Paragraph>
    <Paragraph position="4"> One of the basic classification methods that gives a predicted probability is Logistic linear regression.</Paragraph>
    <Paragraph position="5"> The model defines the predicted probability as:</Paragraph>
    <Paragraph position="7"> where the coefficients bi are obtained by the iteratively reweighted least squares (IRLS) algorithm which solves the weighted least squares problem at each iteration. Categorial attributes need to be transformed to numeric dummy variables. It is also recommended to normalize all numeric attributes to have zero mean and unit variance.</Paragraph>
    <Paragraph position="8"> We employed the datamining software Weka by Witten and Frank (2000) in our experiments. As training data we used a two-third subsample of the reference data described above. The test data was the same as in the evaluation of the basic methods.</Paragraph>
    <Paragraph position="9"> By combining all the 87 attributes, we achieved the results displayed in Table 3 and illustrated in Figure 3. At a recall level of 90 % the relative increase in precision was 35.2 % and at a precision level of  i) logistic linear regression on the full set of 87 attributes and ii) on the selected subset with 17 attributes. The thin unlabeled curves refer to the methods from the 17 selected attributes Attribute selection. In the final step of our experiments, we attempted to reduce the attribute space of our data and thus obtain an attribute subset with the same prediction ability. We employed a greedy step-wise search method with attribute subset evaluation via logistic regression implemented in Weka. It performs a greedy search through the space of attribute subsets and iteratively merges subsets that give the best results until the performance is no longer improved. null We ended up with a subset consisting of the following 17 attributes: (6, 10, 21, 25, 31, 56, 58, 61, 71, 73, 74, 79, 82, 83, 84, 85, 86) which are also marked in  in Table 3 and precision-recall graphs of the selected attributes and their combinations are in Figure 3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML