File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2084_metho.xml

Size: 26,368 bytes

Last Modified: 2025-10-06 14:10:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2084">
  <Title>Combining Association Measures for Collocation Extraction</Title>
  <Section position="4" start_page="0" end_page="651" type="metho">
    <SectionTitle>
2 Reference data
</SectionTitle>
    <Paragraph position="0"> The first step in our work was to create a reference data set. Krenn (2000) suggests that collocation extraction methods should be evaluated against a reference set of collocations manually extracted from the full candidate data from a corpus. To avoid the experiments to be biased by underlying data preprocessing (part-of-speech tagging, lemmatization, and parsing), we extracted the reference data from morphologically and syntactically annotated Prague Dependency Treebank 2.0 containing about 1.5 million words annotated on analytical layer (PDT 2.0, 2006). A corpus of this size is certainly not sufficient for real-world applications but we found it adequate for our evaluation purposes - a larger corpus would have made the manual collocation extraction task infeasible.</Paragraph>
    <Paragraph position="1">  Dependency trees from the corpus were broken down into dependency bigrams consisting of lemmas of the head word and its modifier, their part-of-speech pattern, and dependency type. From 87 980 sentences containing 1 504 847 words, we obtained a total of 635 952 different dependency bigrams types. Only 26 450 of them occur in the data more than five times. The less frequent bi-grams do not meet the requirement of sufficient evidence of observations needed by some methods used in this work (they assume normal distribution of observations and become unreliable when dealing with rare events) and were not included in the evaluation. We, however, must agree with Moore (2004) arguing that these cases comprise majority of all the data (the Zipfian phenomenon) and thus should not be excluded from real-world applications. Finally, we filtered out all bigrams having such part-of-speech patterns that never form a collocation (conjunctionpreposition, preposition-pronoun, etc.) and obtained a list consisting of 12 232 dependency bigrams, further called collocation candidates.</Paragraph>
    <Section position="1" start_page="651" end_page="651" type="sub_section">
      <SectionTitle>
2.1 Manual annotation
</SectionTitle>
      <Paragraph position="0"> The list of collocation candidates was manually processed by three trained linguists in parallel and independently with the aim of identifying collocations as defined by Choueka. To simplify and clarify the work they were instructed to select those bigrams that can be assigned to these categories:  [?] idiomatic expressions - studena valka (cold war) - visi otaznik (question mark is hanging[?]open question) [?] technical terms - pVredseda vlady (prime minister) - oVcity svVedek (eye witness) [?] support verb constructions - mit pravdu (to be right) - uVcinit rozhodnuti (make decision) [?] names of persons, locations, and other entities - Prazsky hrad (Prague Castle) - VCerveny kVriz (Red Cross) [?] stock phrases - zasadni problem (major problem) - konec roku (end of the year)  The first (expected) observation was that the interannotator agreement among all the categories was rather poor: the Cohen's k between annotators ranged from 0.29 to 0.49, which demonstrates that the notion of collocation is very subjective, domain-specific, and somewhat vague. The reason that three annotators were used was to get a more precise and objective idea about what can be considered a collocation by combining outcomes from multiple annotators. Only those bigrams that all three annotators independently recognized as collocations (of any type) were considered true collocations. The reference data set contains 2 557 such bigrams, which is 20.9% of all. k between these two categories reanged from 0.52 to 0.58.</Paragraph>
      <Paragraph position="1"> The data was split into six stratified samples. Five folds were used for five-fold cross validation and average performance estimation. The remaining one fold was put aside and used as held-out data in experiments described in Section 5.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="651" end_page="654" type="metho">
    <SectionTitle>
3 Association measures
</SectionTitle>
    <Paragraph position="0"> In the context of collocation extraction, lexical association measures are formulas determining the degree of association between collocation components. They compute an association score for each collocation candidate extracted from a corpus. The scores indicate the potential for a candidate to be a collocation. They can be used for ranking (candidates with high scores at the top), or for classification (by setting a threshold and discarding all bigrams below this threshold).</Paragraph>
    <Paragraph position="1"> If some words occur together more often than by chance, then this may be evidence that they have a special function that is not simply explained as a result of their combination (Manning and Schutze, 1999). This property is known in linguistics as non-compositionality. We think of a corpus as a randomly generated sequence of words that is viewed as a sequence of word pairs (dependency bigrams in our case). Occurrence frequencies and marginal frequencies are used in several association measures that reflect how much the word cooccurrence is accidental. Such measures include: estimation of joint and conditional bigram probabilities (Table 1, 1-3), mutual information and derived measures (4-9), statistical tests of independence (10-14), likelihood measures (1516), and various other heuristic association measures and coefficients (17-55) originating in different research fields.</Paragraph>
    <Paragraph position="2"> By determining the entropy of the immediate context of a word sequence (words immediately preceding or following the bigram), the association measures (56-60) rank collocations according to the assumption that they occur as (syntactic) units in a (information-theoretically) noisy environment (Shimohata et al., 1997). By comparing empirical contexts of a word sequence and of its components (open-class words occurring within  # Name Formula 1. Joint probability P(xy) star2. Conditional probability P(y|x) 3. Reverse conditional prob. P(x|y) 4. Pointwise mutual inform. log P(xy)P(x[?])P([?]y) 5. Mutual dependency (MD) log P(xy)</Paragraph>
    <Paragraph position="4"> 67. Confusion probability Pw P(x|Cw)P(y|Cw)P(w)P(x[?]) star68. Reverse confusion prob. P</Paragraph>
    <Paragraph position="6"> 71. KL divergence Pw P(w|Cx)logP(w|Cx)P(w|Cy) 72. Reverse KL divergence Pw P(w|Cy)logP(w|Cy)P(w|Cx) star73. Skew divergence D(p(w|Cx)||a(w|Cy)+(1[?]a)p(w|Cx)) 74. Reverse skew divergence D(p(w|Cy)||ap(w|Cx)+(1[?]a)p(w|Cy)) 75. Phrase word coocurrence 12(f(x|Cxy)f(xy) +f(y|Cxy)f(xy) ) 76. Word association 12(f(x|Cy)[?]f(xy)f(xy) +f(y|Cx)[?]f(xy)f(xy) )</Paragraph>
    <Paragraph position="8"> star77. in boolean vector space zi=d(f(wi|Cz)) 78. in tf vector space zi=f(wi|Cz) 79. in tf*idf vector space zi=f(wi|Cz)* Ndf(wi);df(wi)=|{x:wiepsilon1Cx}| Dice context similarity: 12(dice(cx,cxy)+dice(cy,cxy))</Paragraph>
    <Paragraph position="10"> 80. in boolean vector space zi=d(f(wi|Cz)) 81. in tf vector space zi=f(wi|Cz) 82. in tf*idf vector space zi=f(wi|Cz)* Ndf(wi);df(wi)=|{x:wiepsilon1Cx}|</Paragraph>
    <Paragraph position="12"> A contingency table contains observed frequencies and marginal frequencies for a bigram xy; -w stands for any word except w; [?] stands for any word; N is a total number of bigrams. The table cells are sometimes referred to as fij. Statistical tests of independence work with contingency tables of expected frequencies ^f(xy)=f(x[?])f([?]y)/N.</Paragraph>
    <Paragraph position="13"> Cw empirical context of w Cxy empirical context of xy Clxy left immediate context of xy Crxy right immediate context of xy  stardenotes those selected by the model reduction algorithm discussed in Section 5.  Averaged precison curveFigure 1: Vertical averaging of precision-recall curves. Thin curves represent individual non-averaged curves obtained by Pointwise mutual information (4) on five data folds.</Paragraph>
    <Paragraph position="14"> a specified context window), the association measures rank collocations according to the assumption that semantically non-compositional expressions typically occur as (semantic) units in different contexts than their components (Zhai, 1997).</Paragraph>
    <Paragraph position="15"> Measures (61-74) have information theory background and measures (75-82) are adopted from the field of information retrieval.</Paragraph>
    <Section position="1" start_page="653" end_page="653" type="sub_section">
      <SectionTitle>
3.1 Evaluation
</SectionTitle>
      <Paragraph position="0"> Collocation extraction can be viewed as classification into two categories. By setting a threshold, any association measure becomes a binary classifier: bigrams with higher association scores fall into one class (collocations), the rest into the other class (non-collocations). Performance of such classifiers can be measured for example by accuracy - fraction of correct predictions. However, the proportion of the two classes in our case is far from equal and we want to distinguish classifier performance between them. In this case, several authors, e.g. Evert (2001), suggest using precision - fraction of positive predictions correct and recall - fraction of positives correctly predicted. The higher the scores the better the classification is.</Paragraph>
    </Section>
    <Section position="2" start_page="653" end_page="653" type="sub_section">
      <SectionTitle>
3.2 Precision-recall curves
</SectionTitle>
      <Paragraph position="0"> Since choosing a classification threshold depends primarily on the intended application and there is no principled way of finding it (Inkpen and Hirst, 2002), we can measure performance of association measures by precision-recall scores within the entire interval of possible threshold values. In this manner, individual association measures can be thoroughly compared by their two-dimensional precision-recall curves visualizing the quality of ranking without committing to a classification threshold. The closer the curve stays to the top and right, the better the ranking procedure is.</Paragraph>
      <Paragraph position="1">  Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)Figure 2: Crossvalidated and averaged precision-recall curves of selected association measures (numbers in brackets refer to Table 1).</Paragraph>
      <Paragraph position="2"> Precision-recall curves are very sensitive to data (see Figure 1). In order to obtain a good estimate of their shapes cross validation and averaging are necessary: all cross-validation folds with scores for each instance are combined and a single curve is drawn. Averaging can be done in three ways: vertical - fixing recall, averaging precision, horizontal - fixing precision, averaging recall, and combined - fixing threshold, averaging both precision and recall (Fawcett, 2003). Vertical averaging, as illustrated in Figure 1, worked reasonably well in our case and was used in all experiments.</Paragraph>
    </Section>
    <Section position="3" start_page="653" end_page="654" type="sub_section">
      <SectionTitle>
3.3 Mean average precision
</SectionTitle>
      <Paragraph position="0"> Visual comparison of precision-recall curves is a powerfull evaluation tool in many research fields (e.g. information retrieval). However, it has a serious weakness. One can easily compare two curves that never cross one another. The curve that predominates another one within the entire interval of recall seems obviously better. When this is not the case, the judgment is not so obvious. Also significance tests on the curves are problematic.</Paragraph>
      <Paragraph position="1"> Only well-defined one-dimensional quality measures can rank evaluated methods by their performance. We adopt such a measure from information retrieval (Hull, 1993). For each cross-validation data fold we define average precision (AP) as the expected value of precision for all possible values of recall (assuming uniform distribution) and mean average precision (MAP) as a mean of this measure computed for each data fold. Significance testing in this case can be realized by paired t-test or by more appropriate nonparametric paired Wilcoxon test.</Paragraph>
      <Paragraph position="2"> Due to the unreliable precision scores for low recall and their fast changes for high recall, estimation of AP should be limited only to some narrower recall interval, e.g. &lt;0.1,0.9&gt;  from Table 1. The solid points correspond to measures selected by the model reduction algorithm from Section 5. b) Visualization of p-values from the significance tests of difference between each method pair (order is the same for both graphs). The darker points correspond to p-values greater than a=0.1 and indicate methods with statistically indistinguishable performance (measured by paired Wilcoxon test on values of average precision obtained from five independent data folds).</Paragraph>
    </Section>
    <Section position="4" start_page="654" end_page="654" type="sub_section">
      <SectionTitle>
3.4 Experiments and results
</SectionTitle>
      <Paragraph position="0"> In the initial experiments, we implemented all 82 association measures from Table 1, processed all morphologically and syntactically annotated sentences from PDT 2.0, and computed scores of all the association measures for each dependency bi-gram in the reference data. For each association measure and each of the five evaluation data folds, we computed precision-recall scores and drew an averaged precision-recall curve. Curves of some well-performing methods are depicted in Figure 2. Next, for each association measure and each data fold, we estimated scores of average precision on narrower recall interval &lt;0.1,0.9&gt; , computed mean average precision, ranked the association measures according to MAP in descending order, and result depicted in Figure 3 a). Finally, we applied a paired Wilcoxon test, detected measures with statistically indistinguishable performance, and visualized this information in Figure 3 b).</Paragraph>
      <Paragraph position="1"> A baseline system ranking bigrams randomly operates with average precision of 20.9%. The best performing method for collocation extraction measured by mean average precision is cosine context similarity in boolean vector space (77) (MAP 66.49%) followed by other 16 association measures with nearly identical performance (Figure 3 a). They include some popular methods well-known to perform reliably in this task, such as pointwise mutual information (4), Pearson's kh2 test (10), z score (13), odds ratio (27), or squared log likelihood ratio (16).</Paragraph>
      <Paragraph position="2"> The interesting point to note is that, in terms of MAP, context similarity measures, e.g. (77), slightly outperform measures based on simple occurence frequencies, e.g. (39). In a more thorough comparison by percision-recall curves, we observe that the former very significantly predominates the latter in the first half of the recall interval and vice versa in the second half (Figure 2). This is a case where the MAP is not a sufficient metric for comparison of association measure performance. It is also worth pointing out that even if two methods have the same precision-recall curves the actual bi-gram rank order can be very different. Existence of such non-correlated (in terms of ranking) measures will be essential in the following sections.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="654" end_page="656" type="metho">
    <SectionTitle>
4 Combining association measures
</SectionTitle>
    <Paragraph position="0"> Each collocation candidate xi can be described by the feature vector xi =(xi1,...,xi82)T consisting of 82 association scores from Table 1 and assigned a label yi [?] {0,1} which indicates whether the bigram is considered to be a collocation (y =1) or not (y = 0). We look for a ranker function f(x)-R that determines the strength of lexical association between components of bigram x and hence has the character of an association measure.</Paragraph>
    <Paragraph position="1"> This allows us to compare it with other association measures by the same means of precision-recall curves and mean average precision. Further, we present several classification methods and demonstrate how they can be employed for ranking, i.e. what function can be used as a ranker. For references see Venables and Ripley (2002).</Paragraph>
    <Section position="1" start_page="654" end_page="655" type="sub_section">
      <SectionTitle>
4.1 Linear logistic regression
</SectionTitle>
      <Paragraph position="0"> An additive model for binary response is represented by a generalized linear model (GLM) in a form of logistic regression:</Paragraph>
      <Paragraph position="2"> measures: average precision (AP) for fixed recall values and mean average precision (MAP) on the narrower recall interval with relative improvement in the last column (values in %).</Paragraph>
      <Paragraph position="3"> where logit(pi)=log(pi/(1[?]pi)) is a canonical link function for odds-ratio and pi [?] (0,1) is a conditional probability for positive response given a vector x. The estimation of b0 and b is done by maximum likelihood method which is solved by the iteratively reweighted least squares algorithm. The ranker function in this case is defined as the predicted value hatwidepi, or equivalently (due to the monotonicity of logit link function) as the linear combination hatwideb0 + hatwidebTx.</Paragraph>
    </Section>
    <Section position="2" start_page="655" end_page="655" type="sub_section">
      <SectionTitle>
4.2 Linear discriminant analysis
</SectionTitle>
      <Paragraph position="0"> The basic idea of Fisher's linear discriminant analysis (LDA) is to find a one-dimensional projection defined by a vector c so that for the projected combination cTx the ratio of the between variance B to the within variance W is maximized: maxc c TBc cTWc After projection,cTxcan be directly used as ranker.</Paragraph>
    </Section>
    <Section position="3" start_page="655" end_page="655" type="sub_section">
      <SectionTitle>
4.3 Support vector machines
</SectionTitle>
      <Paragraph position="0"> For technical reason, let us now change the labels yi[?]{-1,+1}. The goal in support vector machines (SVM) is to estimate a function f(x)=b0+bTxand find a classifier y(x) = signparenleftbigf(x)parenrightbig which can be solved through the following convex optimization:  with l as a regularization parameter. The hinge loss function L(y,f(x)) = [1[?]yf(x)]+ is active only for positive values (i.e. bad predictions) and therefore is very suitable for ranking models with hatwideb0+hatwidebTx as a ranker function. Setting the regularization parameter l is crucial for both the estimators hatwideb0,hatwideb and further classification (or ranking). As an alternative to a often inappropriate grid  Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)Figure 4: Precision-recall curves of selected methods combining all association measures compared with curves of two best measures employed individually on the same data sets. search, Hastie (2004) proposed an effective algorithm which fits the entire SVM regularization path [b0(l),b(l)] and gave us the option to choose the optimal value of l. As an objective function we used total amount of loss on training data.</Paragraph>
    </Section>
    <Section position="4" start_page="655" end_page="655" type="sub_section">
      <SectionTitle>
4.4 Neural networks
</SectionTitle>
      <Paragraph position="0"> Assuming the most common model of neural networks (NNet) with one hidden layer, the aim is to find inner weights wjh and outer weights whi for</Paragraph>
      <Paragraph position="2"> where h ranges over units in the hidden layer. Activation functions phh and function ph0 are fixed.</Paragraph>
      <Paragraph position="3"> Typically, phh is taken to be the logistic function phh(z) = exp(z)/(1+exp(z)) and ph0 to be the indicator function ph0(z) = I(z &gt; [?]) with [?] as a classification threshold. For ranking we simply set ph0(z)=z. Parameters of neural networks are estimated by the backpropagation algorithm. The loss function can be based either on least squares or maximum likehood. To avoid problems with convergence of the algorithm we used the former one. The tuning parameter of a classifier is then the number of units in the hidden layer.</Paragraph>
    </Section>
    <Section position="5" start_page="655" end_page="656" type="sub_section">
      <SectionTitle>
4.5 Experiments and results
</SectionTitle>
      <Paragraph position="0"> To avoid incommensurability of association measures in our experiments, we used a common pre-processing technique for multivariate standardization: we centered values of each association measure towards zero and scaled them to unit variance.</Paragraph>
      <Paragraph position="1"> Precision-recall curves of all methods were obtained by vertical averaging in five-fold cross validation on the same reference data as in the earlier experiments. Mean average precision was computed from average precision values estimated  on the recall interval &lt;0.1,0.9&gt; . In each cross-validation step, four folds were used for training and one fold for testing.</Paragraph>
      <Paragraph position="2"> All methods performed very well in comparison with individual measures. The best result was achieved by a neural network with five units in the hidden layer with 80.81% MAP, which is 21.53% relative improvement compared to the best individual associaton measure. More complex models, such as neural networks with more than five units in the hidden layer and support vector machines with higher order polynomial kernels, were highly overfitted on the training data folds and better results were achieved by simpler models. Detailed results of all experiment are given in Table 2 and precision-recall curves of selected methods depicted in Figure 4.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="656" end_page="657" type="metho">
    <SectionTitle>
5 Model reduction
</SectionTitle>
    <Paragraph position="0"> Combining association measures by any of the presented methods is reasonable and helps in the collocation extraction task. However, the combination models are too complex in number of predictors used. Some association measures are very similar (analytically or empirically) and as predictors perhaps even redundant. Such measures have no use in the models, make their training harder, and should be excluded. Principal component analysis applied to the evaluation data showed that 95% of its total variance is explained by only 17 principal components and 99.9% is explained by 42 of them. This gives us the idea that we should be able to significantly reduce the number of variables in our models with no (or relativelly small) degradation in their performance.</Paragraph>
    <Section position="1" start_page="656" end_page="656" type="sub_section">
      <SectionTitle>
5.1 The algorithm
</SectionTitle>
      <Paragraph position="0"> A straightforward, but in our case hardly feasible, approach is an exhaustive search through the space of all possible subsets of all association measures.</Paragraph>
      <Paragraph position="1"> Another option is a heuristic step-wise algorithm iteratively removing one variable at a time until some stopping criterion is met. Such algorithms are not very robust, they are sensitive to data and generally not very recommended. However, we tried to avoid these problems by initializing our step-wise algorithm by clustering similar variables and choosing one predictor from each cluster as a representative of variables with the same contribution to the model. Thus we remove the highly corelated predictors and continue with the step-wise procedure.</Paragraph>
      <Paragraph position="2">  Cosine context similarity in boolean vector space (77) Unigram subtuple measure (39)Figure 5: Precision-recall curves of four NNet models from the model reduction process with different number of predictors compared with curves of two best individual methods. The algorithm starts with the hierarchical clustering of variables in order to group those with a similar contribution to the model, measured by the absolute value of Pearson's correlation coefficient. After 82[?]d iterations, variables are grouped into d non-empty clusters and one representative from each cluster is selected as a predictor into the initial model. This selection is based on individual predictor performance on held-out data.</Paragraph>
      <Paragraph position="3"> Then, the algorithm continues with d predictors in the initial model and in each iteration removes a predictor causing minimal degradation of performance measured by MAP on held-out data. The algorithm stops when the difference becomes significant - either statistically (by paired Wilcoxon test) or practically (set by a human).</Paragraph>
    </Section>
    <Section position="2" start_page="656" end_page="657" type="sub_section">
      <SectionTitle>
5.2 Experiments and results
</SectionTitle>
      <Paragraph position="0"> We performed the model reduction experiment on the neural network with five units in the hidden layer (the best performing combination method).</Paragraph>
      <Paragraph position="1"> The similarity matrix for hierarchical clustering was computed on the held-out data and parameter d (number of initial predictors) was experimentally set to 60. In each iteration of the algorithm, we used four data folds (out of the five used in previous experiments) for fitting the models and the held-out fold to measure the performance of these models and to select the variable to be removed.</Paragraph>
      <Paragraph position="2"> The new model was cross-validated on the same five data-folds as in the previous experiments.</Paragraph>
      <Paragraph position="3"> Precision-recall curves for some intermediate models are shown in Figure 5. We can conclude that we were able to reduce the NNet model to about 17 predictors without statistically significant difference in performance. The corresponding association measures are marked in Table 1 and highlighted in Figure 3a). They include measures from the entire range of individual mean average precision values.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML