File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/j00-3001_abstr.xml

Size: 22,382 bytes

Last Modified: 2025-10-06 13:41:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-3001">
  <Title>Psycholinguistics</Title>
  <Section position="2" start_page="0" end_page="308" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The research reported here arose from an attempt to determine the conditions under which optimal recall and precision are obtained for the extraction of terms related to side effects of drugs in medical abstracts. We used the standard technique of defining a window around a seed term, side-effect in our case, and selected as potentially relevant terms those words that appeared more often in these windows than expected under chance conditions.</Paragraph>
    <Paragraph position="1"> Our original question concerned the extent to which recall and precision are influenced by the size of the window. It turns out, however, that a preliminary question needs to be answered first, namely, how to gauge the significance of the large effect of the lowest-frequency words on recall, precision, and the number of words extracted as potentially relevant terms.</Paragraph>
    <Paragraph position="2">  Frequency distribution of medical expert word types. Panel (a) shows the number of side-effect-related word types as judged by a medical expert (Nexpert) as a function of the first 23 frequency classes. Panel (b) shows the proportion of expert types/total corpus types (Ntotal) for the first 23 frequency classes. The horizontal dashed line indicates the mean proportion of 0.0619.</Paragraph>
    <Paragraph position="3"> It is common practice in information retrieval to discard the lowest-frequency words a priori as nonsignificant (Rijsbergen 1979). In Smadja's collocation algorithm Xtract, the lowest-frequency words are effectively discarded as well (Smadja 1993). Church and Hanks (1990) use mutual information to identify collocations, a method they claim is reasonably effective for words with a frequency of not less than five. A frequency threshold of five seems quite low. Unfortunately, even this lower frequency threshold of five is too high for the extraction of side-effect-related terms from our medical abstracts. To see this, consider the left panel of Figure 1, which plots the number of side-effect-related words in our corpus of abstracts as judged by a medical expert, as a function of word-frequency class. The side-effect-related words with a frequency of less than five account for 295 of a total of 432 expert words (68.3%). The right panel of Figure 1 shows that the first 23 word-frequency classes are characterized by, on average, the same proportion of side-effect-related words. The a priori assumption of Rijsbergen (1979) that the lowest-frequency words are nonsignificant is not warranted for our data, and, we suspect not for many other data sets as well.</Paragraph>
    <Paragraph position="4"> The recent literature has seen some discussion of the appropriate statistical methods for analyzing the contingency tables that contain the counts of how a word is distributed inside and outside the windows around a seed term. Dunning (1993) has called attention to the log-likelihood ratio, G 2, as appropriate for the analysis of such contingency tables, especially when such contingency tables concern very low frequency words. Pedersen (1996) and Pedersen, Kayaalp, and Bruce (1996) follow up Dunning's suggestion that Fisher's exact test might be even more appropriate for such contingency tables.</Paragraph>
    <Paragraph position="5"> We have therefore investigated for the full range of word frequencies whether there is an optimal window size with respect to recall and the number of significant words extracted using both the log-likelihood ratio and Fisher's exact test. In Section 2, we will show that indeed there seems to be an optimal window size for both statistical tests. However, a recurrent pattern of local optima calls this conclusion into question. Upon closer inspection, this recurrent pattern appears at fixed ratios of the number of words inside the window to the number of words outside the window (complement).</Paragraph>
    <Paragraph position="6">  Weeber, Vos, and Baayen Extracting the Lowest-Frequency Words In Section 3, we will relate the recurrent patterns of local optima at fixed window-complement ratios (henceforth W/C-ratios) to the distributions of the lowest-frequency words over window and complement. We will call attention to the critical effect of the choice of W/C-ratios on the significance of the lowest-frequency words.</Paragraph>
    <Paragraph position="7"> As the improvement in the extraction of side-effect terms from medical abstracts, as gauged by the F-measure, which combines recall and precision (Rijsbergen 1979), is small, we also applied the same approach to the extraction of Dutch verb-particle combinations from a newspaper corpus. In Section 4, we report substantially better results for this more lexical extraction task, which is subject to the same statistical behavior of the lowest-frequency words.</Paragraph>
    <Paragraph position="8"> In the last section, we will discuss the consequences of our findings for the optimization of word-based extraction systems and collocation research with respect to the lowest-frequency words.</Paragraph>
    <Paragraph position="9"> 2. An Optimal Window Size for Medical Abstracts? The MEDLINE bibliographic database contains a large number of abstracts of scientific journal papers discussing medical and drug-related research. Typically, abstracts discussing medical drugs mention the side effects of these drugs briefly. Information on side effects is potentially relevant for finding new applications for existing drugs (Rikken and Vos 1995). We are therefore interested in any terms related to the side effects of drugs.</Paragraph>
    <Paragraph position="10"> Before proceeding, it may be useful to clarify the way in which the present research differs from standard research on collocations. In the latter kind of research, there is no a priori knowledge of which combinations of words are true collocations. Moreover, the most salient collocations generally are found at the top of a list ranked according to measures for surprise or association, such as G 2 or mutual information (Manning and Sch~itze 1999). The large numbers of word combinations with significant but low values for these measures are often of less interest. Low-frequency words are predominant among these kinds of collocations. In our research, we likewise find many low-frequency terms for side effects with low ranks in medical abstracts. The relatively well-known side effects that are mentioned frequently can be captured by examining the top ranks in the lists of extracted words. At the same time, the rarely mentioned side-effect terms are no less important, and in post marketing surveillance the extraction of such side-effect terms may be crucial for the acceptance or rejection of new medicines.</Paragraph>
    <Paragraph position="11"> Is reliable automatic extraction of both low- and high-frequency side-effect terms from MEDLINE abstracts feasible? To answer this question, we explored the efficacy of a standard collocation-based term extraction method that extracts those words that appear more frequently in the immediate neighborhood of a given seed term than might be expected under chance conditions.</Paragraph>
    <Paragraph position="12"> We compiled two corpora on the side effects of the cardiovascular drugs captopril and enalapril from MEDLINE abstracts. The first corpus contains all abstracts mentioning captopril and the word side. The second corpus contains all abstracts mentioning captopril and at least one of the compounds side-effect, side effect, side-effects, and side effects. Thus, the second corpus is a subset of the first. The first corpus is comprised of 118,675 tokens and 7,678 types; the second corpus 103,603 tokens and 6,582 types. A medical expert marked 432 of the latter word types as side-effect-related terms. The left panel of Figure 1 summarizes the head of the frequency distribution of these terms in the larger corpus. Note that most side-effect-related terms have a frequency lower  Computational Linguistics Volume 26, Number 3 Table 1 General 2x2 contingency table. A = frequency of the target in the window corpus, B = frequency of the target in the complement corpus, W = total number of words in the window, C = total number of words in the complement. Corpus size N = W + C.</Paragraph>
    <Paragraph position="13"> window complement frequency of target A B sum frequency of other words W - A C - B</Paragraph>
    <Paragraph position="15"> than five. What we need, then, is an extraction method that is sensitive enough to select such very low frequency terms.</Paragraph>
    <Paragraph position="16"> In the collocation-based method studied here, the neighborhood of a given seed term is defined in terms of a window around the seed term. We constructed windows around all seed terms in the corpus, leading to a window corpus and a complement corpus. The window corpus contains all words that appear within a given window size of the seed term. For instance, with a window size of 10, any word appearing from five words before the seed to five words after the seed as well as the seed itself is included in the window corpus. The word tokens not in the window corpus comprise the complement corpus. Any type in the window corpus is a potential side-effect-related term. For any such target type, we tabulate its distribution in window and complement corpora in a contingency table like Table 1.</Paragraph>
    <Paragraph position="17"> Given W and C, we need to know whether the frequency of the target in the window corpus, A, is high enough to warrant extraction. Typically, given the marginal B and distribution of the contingency table, a target is extracted for which wA--~A &gt; ~-2-~, for which the tabulated distribution is nonhomogeneous according to tests such as G 2 and Fisher's exact test for a given cMevel.</Paragraph>
    <Paragraph position="18"> In this approach, the window size is a crucial variable. At small window sizes, many potentially relevant terms fail to appear in the window corpus. However, at large window sizes, many irrelevant words are found in the window corpus and may be extracted spuriously.</Paragraph>
    <Paragraph position="19"> To see to what extent window size may affect the results of the extraction procedure, consider the solid lines in panels (a) and (b) of Figure 2. The left panel shows the results for recall when we use the log-likelihood ratio, G 2, the right panel the results for Fisher's exact test. We define recall as the proportion of the number of side-effect words extracted and the total number of side-effect words available in the window.</Paragraph>
    <Paragraph position="20"> For both statistical tests, recall seems to be optimal at window size 2. However, at this window size, the number of words extracted is very small. This can be seen in panels (c) and (d). Considered jointly, panels (a) and (c) suggest an optimal window size of 24 for our larger corpus (corpus 1), as recall is still high, and the number of significant words is maximal. When Fisher's exact test is used instead of G 2, panels (b) and (d) suggest 42 as the optimal size.</Paragraph>
    <Paragraph position="21"> The dashed lines in panels (a) to (d) show the corresponding results for our smaller corpus (corpus 2). Unsurprisingly, the general pattern for this subcorpus is quite similar, although the drops in recall and the number of significant words, Nsig, occur at somewhat smaller window sizes.</Paragraph>
    <Paragraph position="22"> Interestingly, we can synchronize the curves for both corpora by plotting recall and the number of significant items, Nsig, against the window-complement ratio (W/C).</Paragraph>
    <Paragraph position="23"> This is shown in panels (e) and (f). These panels suggest not an optimal window size  Weeber, Vos, and Baayen Extracting the Lowest-Frequency Words  size. Panel (b) shows recall values for Fisher's exact test. Panel (c) shows the total number of significant words (Nsig) as a function of the window size for G 2. Panel (d) shows the same as (c) but for Fisher's exact test. Panel (e), G 2, and (f), Fisher's exact test, also show the total number of significant words, but as a function of the W/C-ratio; the ratio of the number of words in the window corpus to the number of words in the complement corpus.</Paragraph>
    <Paragraph position="24"> but an optimal W/C-ratio (0.17 for G 2 and 0.29 for Fisher's exact test). Although we now seem to have shown that recall and Nsig depend on the choice of window size, the sudden drops in recall and Nsig and the reoccurrence of such drops at various W/C-ratios is a source of worry, not only for G 2 results, but also for the results based  on Fisher's exact test. A further source of worry is the fact that the two tests diverge considerably with respect to the optimal W/C-ratio.</Paragraph>
    <Paragraph position="25"> 3. Contingency Tables and the Lowest-Frequency Words  Before we can have any confidence in the optimality of a given W/C-ratio, we should understand why the saw-tooth-shaped patterns of Nsig arise. Both the log-likelihood ratio (G 2) and Fisher's exact test compute the significance of contingency tables similar to Table 1. So why is it that the left panels in Figure 2 differ from the right panels? G 2 has a 2-distribution as N --* cx~. This convergence is not guaranteed for low expected frequencies and sparse tables, which renders use of G 2 problematic for our lowest-frequency words in that it may suggest words to be more remarkable than they  Computational Linguistics Volume 26, Number 3 Table 2 Contingency tables for hapax legomena, dis legomena, and tris legomena.</Paragraph>
    <Paragraph position="26"> W = number of words in window corpus; C = number of words in complement corpus. Total corpus size: N = W + C.</Paragraph>
    <Paragraph position="27"> (a): 1 0 (b): 2 0 (c): 1 1</Paragraph>
    <Paragraph position="29"> really are. Fisher's exact test, on the other hand, does not use an approximation to a probability distribution but computes the exact hypergeometric distribution given the marginal totals of the contingency table. While Fisher's exact test is suitable for the analysis of sparse tables, it is inherently conservative because it regards the marginal totals not as stochastic variables but as fixed boundary conditions. Consequently, this test is likely to reject words that are in fact remarkably distributed in the contingency table. The difference in behavior of the two tests is clearly visible in panels (c) and (d) of Figure 2: the number of significant words (Nsig) according to G 2 is roughly twice as large as that according to Fisher's exact test.</Paragraph>
    <Paragraph position="30"> When a hapax legomenon 1, a word with frequency 1, occurs in the window corpus, we use contingency table (a) as shown in Table 2. For dis legomena, words with a frequency of 2, that appear at least once in the window corpus, we obtain the two contingency tables (b) and (c). The interesting contingency tables for tris legomena are tables (d) to (f). These six tables are relevant for 63.8% of the side-effect-related terms as judged by our medical expert.</Paragraph>
    <Paragraph position="31"> How do changes in the W/C-ratio affect G 2 and Fisher's exact test, when applied to contingency tables (a) to (f)? In other words, how does the choice of the window size affect whether a low-frequency word is judged to be a significant term, for fixed A and B (e.g., A = 1 and B = 0 for a hapax legomenon)? First, consider contingency tables with B = 0, for instance tables (a), (b), and (d).</Paragraph>
    <Paragraph position="32"> For small A, (A ~ W, C), it is easily seen (see the appendix) that the critical W/C-ratio based on the log-likelihood ratio is:</Paragraph>
    <Paragraph position="34"> with X the X 2 value corresponding to a given s-level with 1 degree of freedom. For A = 1 and c~ -- 0.05, X = 3.84, the critical W/C-ratio equals 0.1718. This is exactly the W/C-ratio in panel (e) in Figure 2 at which the first and largest drop in the number of significant words occurs. Up to this ratio, any hapax legomenon appearing in the window corpus is judged to be a significant term. For W/C &gt; 0.1718, no hapax legomenon will be extracted.</Paragraph>
    <Paragraph position="35"> Fisher's exact test is far more conservative. For this test, the critical W/C-ratio is 1 The term hapax legomenon (literally 'read once') goes back to classical studies and was originally used to refer to the words used once only in the works of a given author, e.g., Homer. By analogy, dis legomenon and tris legomenon have come into use to refer to words occurring only twice or three times.  Weeber, Vos, and Baayen Extracting the Lowest-Frequency Words Table 3 Critical W/C-ratios where sparse and skewed contingency tables lose significance. Equations 1 and 2 provide the ratios for the B = 0 cases. The other ratios are obtained by simulations.</Paragraph>
    <Paragraph position="36"> distribution G 2 Fisher A-B</Paragraph>
    <Paragraph position="38"> (see the appendix for details):</Paragraph>
    <Paragraph position="40"> where P is the s-level. For A -- 1 and P = 0.05, the critical W/C-ratio for a hapax legomenon equals 0.0526. In panel (f) of Figure 2, we observe the first drop in the number of significant words at precisely this W/C-ratio. For very small W/C-ratios, any hapax legomenon in the window corpus is also judged to be significant according to Fisher's exact test. Compared to G 2, Fisher's exact test rejects hapax legomena as significant at much smaller W/C-ratios. Note that when W/C -- 0.05/0.95 = 0.0526, i.e., when the window corpus is exactly 1/20 of the total corpus, the probability that a hapax legomenon appears in the window corpus equals 0.05. Our conclusion is that, with the W/C-ratio as the only determinant of significance, the windowing method is not powerful enough to distinguish between relevant and irrelevant hapax legomena.</Paragraph>
    <Paragraph position="41"> In other words, hapax legomena should be removed from consideration a priori.</Paragraph>
    <Paragraph position="42"> For dis legomena that appear exclusively in the window corpus, the critical ratios are 0.6204 for G 2, corresponding to the second major drop in panel (e) of Figure 2, and 0.2880 for Fisher's exact test, corresponding to the severe drop following the maximum of Nsig in panel (f). The third major drop in this panel corresponds to the critical W/C-ratio for tris legomena occurring three times in the window corpus.</Paragraph>
    <Paragraph position="43"> For contingency tables with B &gt; 0; A &gt; B; A, B &lt;~ W, C, critical W/C-ratios are not easy to capture analytically. We therefore carried out a simulation study for W + C = 100,000. For fixed A and B and a given s-level, we calculated the critical W/C-ratio by iterative approximation. Results are summarized in Table 3.</Paragraph>
    <Paragraph position="44"> When we highlight these critical ratios in Figure 2 by means of vertical dashed lines, we obtain Figure 3. Panels (a) to (d) correspond to the curves for corpus 2 in the first four panels of Figure 2. For the log-likelihood ratio, we observe that both the major and minor drops in recall and the number of significant words (Nsig) occur at the W/Cratios where different distributions of the lowest-frequency words lose significance. For Fisher's exact test, we observe exactly the same pattern. Panels (e) and (f) show the number of significant words for a pseudorandomized version of corpus 2 where we used the same tokens but randomized the order of their appearance. Although the number of significant words is lower, the saw-tooth-shaped pattern with the sudden drops at fixed ratios reemerges.</Paragraph>
    <Paragraph position="45"> We conclude that W and C are the prime determinants of both recall and the number of significant words. At first sight, Fisher's test is clearly preferable to the  Results of word extraction procedure (a = 0.05) with A-B distributions. Panels (a), log-likelihood ratio, G 2, and (b), Fisher's exact test, show the recall results of the extraction procedure for corpus 2. Panels (c) and (d) show the total number of significant words (Nsig), again for G 2 and Fisher's exact test, respectively (see also Figure 2). Panels (e) and (f) show the results for a randomized corpus for G 2 and Fisher's exact test. The numbers above the panels indicate the A-B distribution of the contingency tables in Table 2.</Paragraph>
    <Paragraph position="46"> log-likelihood ratio because the extreme saw-tooth-shaped pattern is substantially reduced. However, the use of Fisher's exact test does not eliminate the effect of the choice of window and complement size on the number of significant words and recall. At specific W/C-ratios, nonnegligible numbers of words with the lowest frequency of occurrence suddenly lose significance. Moreover, in our discussion thus far, we have not taken extraction precision into account nor the trade-off between precision and recall.</Paragraph>
    <Paragraph position="47"> For the assessment of overall extraction results, we turn to the F-measure (Rijsbergen 1979), a measure that assigns equal weights to precision (P) and recall (R):</Paragraph>
    <Paragraph position="49"> Figure 4 plots precision, recall, and F as a function of the W/C-ratio. The common trade-off between recall and precision is clearly present for the smaller window sizes, with the F-measure providing a kind of average.</Paragraph>
    <Paragraph position="50"> Thus far, we have applied a common collocation extraction technique to a semantic association task. Actual extraction performance is low: F is maximally 0.17. To gauge  Weeber, Vos, and Baayen Extracting the Lowest-Frequency Words</Paragraph>
    <Paragraph position="52"> F, recall, and precision as a function of the W/C-ratio. Recall (R, dashed line), F (solid line), and precision (P, dotted line) using G 2 (left panel) and Fisher's exact test (right panel) for our second corpus plotted as a function of the W/C-ratio.</Paragraph>
    <Paragraph position="53"> whether better results can be obtained with the present techniques, we examined the extraction of Dutch verb-particle combinations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML