File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1025_metho.xml
Size: 19,092 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1025"> <Title>Methods for the Qualitative Evaluation of Lexical Association Measures</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Base Data </SectionTitle> <Paragraph position="0"> The base data for our experiments are extracted from two corpora which differ with respect to size and text type. The base sets also differ with respect to syntactic homogeneity and grammatical correctness. Both candidate sets have been manually inspected for TPs.</Paragraph> <Paragraph position="1"> The first set comprises bigrams of adjacent, lemmatized AdjN pairs extracted from a small (a0a2a1a4a3a6a5a8a7a10a9 word) corpus of freely available German law texts.3 Due to the extraction strategy, the data are homogeneous and grammatically correct, i.e., there is (almost) always a grammatical dependency between adjacent adjectives and nouns in running text. Two human annotators independently marked candidate pairs perceived as &quot;typical&quot; combinations, including idioms ((die) hohe See, 'the high seas'), legal terms (uble Nachrede, 'slander'), and proper names (Rotes Kreuz, 'Red Cross'). Candidates accepted by either one of the annotators were considered TPs.</Paragraph> <Paragraph position="2"> The second set consists of PNV triples extracted from an 8 million word portion of the Frankfurter Rundschau Corpus4, in which part-of-speech tags and minimal PPs were identified.5 The PNV triples were selected automatically such that the preposition and the noun are constituents of the same PP, and the PP and the verb co-occur within a sentence. Only main verbs were considered and full forms were reduced to bases.6 The PNV data are partially inhomogeneous and not fully grammatically correct, because they include combinations with no grammatical relation between PN and V. PNV collocations were manually annotated. The criteria used for the distinction between collocations and arbitrary word combinations are: There is a grammatical relation between the verb and the PP, and the triple can be interpreted as support verb construction and/or a metaphoric or idiomatic reading is available, e.g.: zur Verfugung stellen (at_the availability put, 'make available'), am Herzen liegen (at the heart lie, 'have at heart').7 colloc. 15.84% colloc. 6.41%</Paragraph> <Paragraph position="4"> General statistics for the AdjN and PNV base sets are given in Table 1. Manual annotation was performed for AdjN pairs with frequency a0a6a1 a5 and PNV triples with a0a7a1 a9 only (see section 5 for a discussion of the excluded low-frequency candidates).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Setup </SectionTitle> <Paragraph position="0"> After extraction of the base data and manual identification of TPs, the AMs are applied, resulting in an ordered candidate list for each measure (henceforth significance list, SL). The order indicates the degree of collocativity. Multiple candidates with identical scores are listed in random order. This is necessary, in particular, when co-occurrence frequency is used as an association measure.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 a8 -Best Lists </SectionTitle> <Paragraph position="0"> In this approach, the set of the a6 highest ranked word combinations is evaluated for each measure, and the proportion of TPs among this a6 -best list (the precision) is computed. Another measure of goodness is the proportion of TPs in the base data that are also contained in the a6 -best list (the recall). While precision measures the quality of the a6 -best lists produced, recall measures their coverage, i.e., how many of all true collocations in the corpus were identified. The most problematic aspect here is that conclusions drawn from a6 -best lists for a single (and often small) value of a6 are only snapshots and likely to be misleading.</Paragraph> <Paragraph position="1"> For instance, considering the set of AdjN base data with a0a9a1 a5 we might arrive at the following results (Table 2 gives the precision values of the a6 highest ranked word combinations with a6a11a10 a1 a7 a7a13a12a15a14a8a7 a7 ): As expected from the results of other studies (e.g. Lezius (1999)), the precision of a0a2a1 is significantly lower than that of log-likelihood,8 cally overestimates the collocativity of low-frequency pairs, cf. section 4.3.</Paragraph> <Paragraph position="2"> whereas the t-test competes with log-likelihood, especially for larger values of a6 . Frequency leads to clearly better results than a0 a1 and a3 a4 , and, for</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Precision and Recall Graphs </SectionTitle> <Paragraph position="0"> For a clearer picture, however, larger portions of the SLs need to be examined. A well suited means for comparing the goodness of different AMs are the precision and recall graphs obtained by step-wise processing of the complete SLs (Figures 1 to The a20 -axis represents the percentage of data processed in the respective SL, while the a21 -axis represents the precision (or recall) values achieved. For instance, the precision values for</Paragraph> <Paragraph position="2"> a14a8a7 a7 for the AdjN data can be read from the a21 -axis in Figure 1 at positions where the percentage of true collocations in the base set. This value corresponds to the expected precision value for random selection, and provides a base-line for the interpretation of the precision curves. General findings from the precision graphs are: (i) It is only useful to consider the first halves of the SLs, as the measures approximate afterwards. (ii) Precision of log-likelihood, a3 a4 , t-test and frequency strongly decreases in the first part of the SLs, whereas precision of a0a2a1 remains almost constant (cf. Figure 1) or even increases slightly (cf. Figure 2). (iii) The identification results are instable for the first few percent of the data, with log-likelihood, t-test and frequency stabilizing earlier than a0a2a1 anda3 a4 , and the PNV data stabilizing earlier than the AdjN data. This instability is caused by &quot;random fluctuations&quot;, i.e., whether a particular TP ends up on rank a6 (and thus increases the precision of the a6 -best list) or on rank a6a8a7 a1 . The a6 -best lists for AMs with low precision values (a0 a1 , a3 a4 ) contain a particularly small number of TPs. Therefore, they are more susceptible to random variation, which illustrates that evaluation based on a small number of a6 -best candidate pairs cannot be reliable.</Paragraph> <Paragraph position="3"> With respect to the recall curves (Figures 3 and 4), we find: (i) Examination of 50% of the data in the SLs leads to identification of between 75% (AdjN) and 80% (PNV) of the TPs. (ii) For the first 40% of the SLs, a0 a1 and a3 a4 lead to the worst results, with a3 a4 outperforming a0a2a1 .</Paragraph> <Paragraph position="4"> Examining the precision and recall graphs in more detail, we find that for the AdjN data (Figure 1), log-likelihood and t-test lead to the best results, with log-likelihood giving an overall better result than the t-test. The picture differs slightly for the PNV data (Figure 2). Here t-test outperforms log-likelihood, and even precision gained by frequency is better than or at least comparable to log-likelihood. These pairings - log-likelihood and t-test for AdjN, and t-test and frequency for PNV - are also visible in the recall curves (Figures 3 and 4). Moreover, for the PNV data the t-test leads to a recall of over 60% when approx.</Paragraph> <Paragraph position="5"> 20% of the SL has been considered.</Paragraph> <Paragraph position="6"> In the Figures above, there are a number of positions on the a20 -axis where the precision and recall values of different measures are almost identical. This shows that a simple a6 -best approach will often produce misleading results. For instance, if we just look at the first a0 a9a8a7 a26 of the SLs for the PNV data, we might conclude that the t-test and frequency measures are equally well suited for the extraction of PNV collocations.</Paragraph> <Paragraph position="7"> However, the full curves in Figures 2 and 4 show that t-test is consistently better than frequency.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Frequency Strata </SectionTitle> <Paragraph position="0"> While we have previously considered data from a broad frequency range (i.e., frequencies a0 a1 a5 for AdjN and a0 a1 a9 for PNV), we will now split up the candidate sets into high-frequency and low-frequency occurrences. This procedure allows us to assess the performance of AMs within different frequency strata. For instance, there is a widely held belief that a0a2a1 and a3 a4 are inferior to other measures because they overestimate the collocativity of low-frequency candidates (cf. the remarks on the a3 a4 measure in (Dunning, 1993)).</Paragraph> <Paragraph position="1"> One might thus expect a0a2a1 and a3 a4 to yield much better results for higher frequencies.</Paragraph> <Paragraph position="2"> We have divided the AdjN data into two samples with a0a2a1 a14 (high frequencies) and a5a2a1 a0a4a3 a14 (low frequencies), because the number of data in the base sample is quite small. As there are enough PNV data, we used a higher threshold and selected samples with a0 a1 a1 a7 (high frequencies) and a0 a10 a9a24a12a6a5 (low frequencies).</Paragraph> <Paragraph position="3"> High Frequencies Considering our high-frequency AdjN data (Figure 5), we find that all precision curves decline as more of the data in the SLs is examined. Especially for a0 a1 , this is markedly different from the results obtained before. As the full curves show, log-likelihood is obviously the best measure. It is followed by t-test, a3 a4 , frequency and a0a2a1 in this order. Frequency and a0a2a1 approximate when 50% of the data in the SLs are examined. In the remaining part of the lists, a0a2a1 yields better results than frequency and is practically identical to the best-performing measures.</Paragraph> <Paragraph position="4"> Surprisingly, the precision curves of a3 a4 and in particular a0 a1 increase over the first 60% of the SLs for high-frequency PNV data, whereas the curves for t-test, log-likelihood, and frequency have the usual downward slope (see Figure 6).</Paragraph> <Paragraph position="5"> Log-likelihood achieves precision values above 50% for the first 10% of the list, but is outperformed by the t-test afterwards. Looking at the first 40% of the data, there is a big gap between the good measures (t-test, log-likelihood, and frequency) and the weak measures (a3 a4 and a0a2a1 ). In the second half of the data in the SLs, however, there is virtually no difference between a0a2a1 , a3a5a4 , and the other measures, with the exception of mere co-occurrence frequency.</Paragraph> <Paragraph position="6"> Summing up, t-test - with a few exceptions around the first 5% of the data in the SLs leads to the overall best precision results for high-frequency PNV data. Log-likelihood is second best but achieves the best results for high-frequency AdjN data.</Paragraph> <Paragraph position="7"> ence between the AMs for low-frequency data, except for co-occurrence frequency, which leads to worse results than all other measures.</Paragraph> <Paragraph position="8"> For AdjN data, the AMs at best lead to an improvement of factor 3 compared to random selection (when up to a0 a5 a26 of the SL is examined, log-likelihood achieves precision values above 30%). Log-likelihood is the overall best measure for identifying AdjN collocations, except for a20 coordinates between 15% and 20% where t-test outperforms log-likelihood.</Paragraph> <Paragraph position="9"> For PNV data, the curves of all measures (except for frequency) are nearly identical. Their precision values are not significantly10 different from the baseline obtained by random selection.</Paragraph> <Paragraph position="10"> In contrast to our expectation stated at the beginning of this section, the performance of a0a2a1 and a3a5a4 relative to the other AMs is not better for high-frequency data than for low-frequency data.</Paragraph> <Paragraph position="11"> Instead, the poor performance observed in section 4.2 is explained by the considerably higher base-line precision of the high-frequency data (cf. Figures 5 to 8): unlike the a6 -best lists for &quot;frequencysensitive&quot; measures such as log-likelihood, those of a0a2a1 and a3a5a4 contain a large proportion of low-frequency candidates.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Hapaxlegomena and Double Occurrences </SectionTitle> <Paragraph position="0"> As the frequency distribution of word combinations in texts is characterised by a large number of rare events, low-frequency data are a serious challenge for AMs. One way to deal with low-frequency candidates is the introduction of cut-off thresholds. This is a widely used strategy, and it is motivated by the fact that it is in general highly problematic to draw conclusions from low-frequency data with statistical methods (cf.</Paragraph> <Paragraph position="1"> Weeber et al. (2000) and Figure 8). A practical reason for cutting off low-frequency data is the need to reduce the amount of manual work when the complete data set has to be evaluated, which is a precondition for the exact calculation of recall and for plotting precision curves.</Paragraph> <Paragraph position="2"> The major drawback of an approach where all low-frequency candidates are excluded is that a large part of the data is lost for collocation extraction. In our data, for instance, 80% of the full set of PNV data and 58% of the AdjN data are hapaxes. Thus it is important to know how many (and which) true collocations there are among the excluded low-frequency candidates.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Statistical Estimation of TPs among Low-Frequency Data </SectionTitle> <Paragraph position="0"> In this section, we estimate the number of collocations in the data excluded from our experiments (i.e., AdjN pairs with a0 a10 a1 and PNV triples with a0 a10 a1a29a12 a5 ). Because of the large number of candidates in those sets (6 435 for AdjN, 10According to the a8 -test as described in section 6.</Paragraph> <Paragraph position="1"> 279 880 for PNV), manual inspection of the entire data is impractical. Therefore, we use random samples from the candidate sets to obtain estimates for the proportion a0 of true collocations among the low-frequency data. We randomly selected 965 items (15%) from the AdjN hapaxes, and 983 items (a0 0.35%) from the low-frequency PNV triples. Manual examination of the samples yielded 31 TPs for AdjN (a proportion of 3.2%) and 6 TPs for PNV (0.6%).</Paragraph> <Paragraph position="2"> Considering the low proportion of collocations in the samples, we must expect highly skewed frequency distributions (where a0 is very small), which are problematic for standard statistical tests. In order to obtain reliable estimates, we have used an exact test based on the following model: Assuming a proportion a0 of TPs in the full low-frequency data (AdjN or PNV), the number of TPs in a random sample of size a8 is described by a binomially distributed random variable a1 with parameter a0 .11 Consequently, the probability of finding a2 or less TPs in the sample is ply a one-tailed statistical test based on the probabilities a3a14a4 a3 a1 a1 a2 a4 to our samples in order to obtain an upper estimate for the actual proportion of collocations among the low-frequency data: the estimate a0 a1 a0a14a24 is accepted at a given significance level a25 if a3a14a4a27a26 a3 a1 a1 a2 a4 a3 a25 . In the case of the AdjN data (a2 a10 a9a2a1 , a8 a10 a7a13a23a7 a1 ). Thus, there should be at most 320 TPs among the AdjN candidates with a0 a10 a1 . Compared to the 737 TPs identified in the AdjN data with a0a22a1 a5 , our decision to exclude the hapaxlegomena was well justified. The proportion of TPs in the PNV sample (a2</Paragraph> <Paragraph position="4"> was much lower and we find that a0 a1 a1a29a23a14a5a26 at the same confidence level of 99%. However, due to the very large number of low-frequency candidates, there may be as many as 4200 collocations in the PNV data with a0 a10 a1a29a12 a5 , more than 4 times the number identified in our experiment.</Paragraph> <Paragraph position="5"> It is imaginable, then, that one of the AMs 11To be precise, the binomial distribution is itself an approximation of the exact hypergeometric probabilities (cf. Pedersen (1996)). This approximation is sufficiently accurate as long as the sample size a29 is small compared to the size of the base set (i.e., the number of low-frequency candidates). null might succeed in extracting a substantial number of collocations from the low-frequency PNV data. Figure 9 shows precision curves for the 10 000 highest ranked word combinations from each SL for PNV combinations with a0 a10 a1a29a12 a5 (the vertical lines correspond to a6 -best lists for</Paragraph> <Paragraph position="7"> In order to reduce the amount of manual work, the precision values for each AM are based on a 10% random sample from the 10 000 highest ranked candidates. We have applied the statistical test described above to obtain confidence intervals for the true precision values of the best-performing AM (frequency), given our 10% sample. The upper and lower bounds of the 95% confidence intervals are shown as thin lines. Even the highest precision estimates fall well below the 6.41% precision baseline of the PNV data with a0 a1 a9 . Again, we conclude that the exclusion of low-frequency candidates was well justified.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Significance Testing </SectionTitle> <Paragraph position="0"> We have assessed the significance of differences between AMs using the well-known a3 a4 test as described in (Krenn, 2000).12 The thin lines in Figure 10 delimit 95% confidence intervals around the best-performing measure for the AdjN data with a0a2a1 a5 (log-likelihood).</Paragraph> <Paragraph position="1"> There is no significant difference between log-likelihood and t-test. And only for a6 -best lists with a6 a0 a1 a7 a7 a7 , frequency performs marginally significantly worse than log-likelihood. For the PNV data (not shown), the t-test is significantly better than log-likelihood, but the difference between frequency and the t-test is at best marginally significant.</Paragraph> <Paragraph position="2"> 12See (Krenn and Evert, 2001) for a short discussion of the applicability of this test.</Paragraph> </Section> class="xml-element"></Paper>