XML Viewer - j90-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/j90-1003_metho.xml
Size: 32,730 bytes
Last Modified: 2025-10-06 14:12:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="J90-1003">
  <Title>X and Y Separation Relation Word x Word y Mean Variance</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 PRACTICAL APPLICATIONS
</SectionTitle>
    <Paragraph position="0"> The proposed statistical description has a large number of potentially important applications, including: (a) constraining the language model both for speech recognition and optical character recognition (OCR), (b) providing disambiguation cues for parsing highly ambiguous syntactic structures such as noun compounds, conjunctions, and prepositional phrases, (c) retrieving texts from large databases (e.g. newspapers, patents), (d) enhancing the productivity of computational linguists in compiling lexicons of lexicosynWctic facts, and (e) enhancing the productivity of lexicographers in identifying normal and conventional usage.</Paragraph>
    <Paragraph position="1"> Consider the optical character recognizer (OCR) application. Suppose that we have an OCR device as in Kahan et al. (1987), and it has assigned about equal probability to having recognized farm and form, where the context is either: (1) federal credit or (2) some of.</Paragraph>
    <Paragraph position="2">  The proposed association measure can make use of the fact that farm is much more likely in the first context and form is much more likely in the second to resolve the ambiguity.</Paragraph>
    <Paragraph position="3"> Note that alternative disambiguation methods based on syntactic constraints such as part of speech are unlikely to help in this case since both form and farm are commonly used as nouns.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 WORD ASSOCIATION AND
PSYCHOLINGUISTICS
</SectionTitle>
    <Paragraph position="0"> Word association norms are well known to be an important factor in psycholinguistic research, especially in the area of lexical retrieval. Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor.</Paragraph>
    <Paragraph position="1"> Some results and implications are summarized from reaction-time experiments in which subjects either (a) classified successive strings of letters as words and nonwords, or (b) pronounced the strings. Both types of response to words (e.g. BUTTER) were consistently faster when preceded by associated words (e.g. BREAD) rather than unassociated words (e.g. NURSE) (Meyer et al. 1975, p. 98) Much of this psycholinguistic research is based on empirical estimates of word association norms as in Palermo and Jenkins (1964), perhaps the most influential study of its kind, though extremely small and somewhat dated. This study measured 200 words by asking a few thousand subjects to write down a word after each of the 200 words to be measured. Results are reported in tabular form, indicating which words were written down, and by how many subjects, factored by grade level and sex. The word doctor, for example, is reported on pp. 98-100 to be most often associated with nurse, followed by sick, health, medicine, hospital, man, sickness, lawyer, and about 70 more words.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 AN INFORMATION THEORETIC MEASURE
</SectionTitle>
    <Paragraph position="0"> We propose an alternative measure, the association ratio, for measuring word association norms, based on the information theoretic concept of mutual information. 1 The proposed measure is more objective and less costly than the subjective method employed in Palermo and Jenkins (1964).</Paragraph>
    <Paragraph position="1"> The association ratio can be scaled up to provide robust estimates of word association norms for a large portion of the language. Using the association ratio measure, the five most associated words are, in order: dentists, nurses, treating, treat, and hospitals.</Paragraph>
    <Paragraph position="2"> What is &amp;quot;mutual information?&amp;quot; According to Fano (1961), if two points (words), x and y, have probabilities P(x) and P(y), then their mutual information, I(x,y), is defined to be</Paragraph>
    <Paragraph position="4"> Informally, mutual information compares the probability of observing x and y together (the joint probability) with the probabilities of observing x and y independently (chance). If there is a genuine association between x and y, then the joint probability P(x,y) will be much larger than chance P(x) P(y), and consequently I(x,y) &gt;&gt; 0. If there is no interesting relationship between x and y, then P(x,y) P(x) P(y), and thus, I(x,y) ~ O. If x and y are in complementary distribution, then P(x,y) will be much less than</Paragraph>
    <Paragraph position="6"> In our application, word probabilities P(x) and P(y) are estimated by counting the number of observations of x and y in a corpus, f (x) andf(y), and normalizing by N, the size of the corpus. (Our examples use a number of different corpora with different sizes: 15 million words for the 1987 AP corpus, 36 million words for the 1988 AP corpus, and 8.6 million tokens for the tagged corpus.) Joint probabilities, P(x,y), are estimated by counting the number of times that x is followed by y in a window of w words, fw (x,y), and normalizing by N.</Paragraph>
    <Paragraph position="7"> The window size parameter allows us to look at different scales. Smaller window sizes will identify fixed expressions (idioms such as bread and butter) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales.</Paragraph>
    <Paragraph position="8"> Table 1 may help show the contrast. 2 In fixed expressions, such as bread and butter and drink and drive, the words of interest are separated by a fixed number of words and there is very little variance. In the 1988 AP, it was found that the two words are always exactly two words apart whenever they are found near each other (within five words). That is, the mean separation is two, and the variance is zero.</Paragraph>
    <Paragraph position="9"> Compounds also have very fixed word order (little variance), but the average separation is closer to one word rather than two. In contrast, relations such as man/woman are less fixed, as indicated by a larger variance in their separation. (The nearly zero value for the mean separation for man/women indicates the words appear about equally  often in either order.) Lexical relations come in several varieties. There are some like refraining from that are fairly fixed, others such as coming from that may be separated by an argument, and still others like keeping from that are almost certain to be separated by an argument. null The ideal window size is different in each case. For the remainder of this paper, the window size, w, will be set to five words as a compromise; this setting is large enough to show some of the constraints between verbs and arguments, but not so large that it would wash out constraints that make use of strict adjacency) Since the association ratio becomes unstable when the counts are very small, we will not discuss word pairs with f(x,y) _&lt; 5. An improvement would make use of t-scores, and throw out pairs that were not significant. Unfortunately, this requires an estimate of the variance off(x,y), which goes beyond the scope of this paper. For the remainder of this paper, we will adopt the simple but arbitrary threshold, and ignore pairs with small counts.</Paragraph>
    <Paragraph position="10"> Technically, the association ratio is different from mutual information in two respects. First, joint probabilities are supposed to be symmetric: P(x,y) = P(y, x), and thus, mutual information is also symmetric: I(x,y) = I(y, x). However, the association ratio is not symmetric, sincef(x, y) encodes linear precedence. (Recall thatf(x, y) denotes the number of times that word x appears before y in the window of w words, not the number of times the two words appear in either order.) Although we could fix this problem by redefiningf(x, y) to be symmetric (by averaging the matrix with its transpose), we have decided not to do so, since order information appears to be very interesting. Notice the asymmetry in the pairs in Table 2 (computed from 44 million words of 1988 AP text), illustrating a wide variety of biases ranging from sexism to syntax.</Paragraph>
    <Paragraph position="11"> Second, one might expect f(x, y) &lt;_ f(x) and f(x, y) &lt;_ f(y), but the way we have been counting, this needn't be the case if x and y happen to appear several times in the window. For example, given the sentence, &amp;quot;Library workers were prohibited from saving books from this heap of ruins,&amp;quot; which appeared in an AP story on April 1, 1988, f(prohibited) = 1 and f(prohibited, from) = 2. This problem can be fixed by dividingf(x, y) by w - 1 (which has the consequence of subtracting log2 (w - 1) = 2 from our association ratio scores). This adjustment has the addi-</Paragraph>
    <Paragraph position="13"> doctors nurses 99 10 man woman 256 56 doctors lawyers 29 19 bread butter 15 1 save life 129 11 save money 187 11 save from 176 18 supposed to 1188 25 tional beneft of assuring that Z f(x,y) = ~ f(x) = Zf(y) = N.</Paragraph>
    <Paragraph position="14"> When I(x, y) is large, the association ratio produces very credible results not unlike those reported in Palermo and Jenkins (1964), as illustrated in Table 3. In contrast, when I(x, y) ---: 0, the pairs are less interesting. (As a very rough rule; of thumb, we have observed that pairs with I(x, y) &gt; 3 tend to be interesting, and pairs with smaller I(x, y) are generally not. One can make this statement precise by calibrating the measure with subjective measures. Alternatively, one could make estimates of the variance and then make statements about confidence levels, e.g. with 95%  confidence, P(x, y) &gt; e(x) P(y).) If I(x, y) &lt;&lt; 0, we would predict that x and y are in complementary distribution. However, we are rarely able to observe I(x, y) &lt;&lt; 0 because our corpora are too small  (and our measurement techniques are too crude). Suppose, for example, that both x and y appear about 10 times per million words of text. Then, P(x) = P(y) = 10 -5 and chance is P(x) P(x) = 10 -Ideg. Thus, to say that I(x, y) is much less than 0, we need to say that P(x, y) is much less than 10 -tdeg, a statement that is hard to make with much confidence given the size of presently available corpora. In fact, we cannot (easily) observe a probability less than 1/N ~ 10 -7, and therefore it is hard to know if I(x, y) is much less than chance or not, unless chance is very large. (In fact, the pair a... doctors in Table 3, appears significantly less often than chance. But to justify this statement, we need to compensate for the window size (which shifts the score downward by 2.0, e.g. from 0.96 down to - 1.04), and we need to estimate the standard deviation, using a method such as Good (1953). 4</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 LEXICO-SYNTACTIC REGULARITIES
</SectionTitle>
    <Paragraph position="0"> Although the psycholinguistic literature documents the significance of noun/noun word associations such as doctor/ nurse in considerable detail, relatively little is said about  associations among verbs, function words, adjectives, and other non-nouns. In addition to identifying semantic relations of the doctor/nurse variety, we believe the association ratio can also be used to search for interesting lexico-syntactic relationships between verbs and typical arguments/adjuncts. The proposed association ratio can be viewed as a formalization of Sinclair's argument: How common are the phrasal verbs with set? Set is particularly rich in making combinations with words like about, in, up, out, on, off, and these words are themselves very common. How likely is set offto occur? Both are frequent words \[set occurs approximately 250 times in a million words and off occurs approximately 556 times in a million words... \[T\]he question we are asking can be roughly rephrased as follows: how likely is off to occur immediately after set?... This is 0.00025 x 0.00055 \[P(x) P(y)\], which gives us the tiny figure of 0.0000001375 ... The assumption behind this calculation is that the words are distributed at random in a text \[at chance, in our terminology\]. It is obvious to a linguist that this is not so, and a rough measure of how much set and offattract each other is to compare the probability with what actually happens ... Set off occurs nearly 70 times in the 7.3 million word corpus \[P(x, y) = 70/(7.3 x 106) &gt;&gt; P(x) P(y)\]. That is enough to show its main patterning and it suggests that in currently-held corpora there will be found sufficient evidence for the description of a substantial collection of phrases ...</Paragraph>
    <Paragraph position="1"> (Sinclair 1987c, pp. 151-152).</Paragraph>
    <Paragraph position="2"> Using Sinclair's estimates P(set) ~ 250 x 10 -6, P(off) ~556 x 10 -6, and P(set, off) ~ 70/(7.3 x 106), we would estimate the mutual information to be I(set; off) =</Paragraph>
    <Paragraph position="4"> estimates, we would compute the mutual information to be l(set; off) ~ 6.2.</Paragraph>
    <Paragraph position="5"> In this example, at least, the values seem to be fairly comparable across corpora. In other examples, we will see some differences due to sampling. Sinclair's corpus is a fairly balanced sample of (mainly British) text; the AP corpus is an unbalanced sample of American journalese.</Paragraph>
    <Paragraph position="6"> This association between set and offis relatively strong; the joint probability is more than 26 = 64 times larger than chance. The other particles that Sinclair mentions have association ratios that can be seen in Table 4.</Paragraph>
    <Paragraph position="7"> The first three, set up, set off, and set out, are clearly  associated; the last three are not so clear. As Sinclair suggests, the approach is well suited for identifying the phrasal verbs, at least in certain cases.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 PREPROCESSING WITH A PART
OF SPEECH TAGGER
</SectionTitle>
    <Paragraph position="0"> Phrasal verbs involving the preposition to raise an interesting problem because of the possible confusion with the infinitive marker to. We have found that if we first tag every word in the corpus with a part of speech using a method such as Church (1988), and then measure associations between tagged words, we can identify interesting contrasts between verbs associated with a following preposition to~in and verbs associated with a following infinitive marker to~to. (Part of speech notation is borrowed from Francis and Kucera (1982); in = preposition; to = infinitive marker; vb = bare verb; vbg = verb + ing; vbd = verb + ed; vbz = verb + s; vbn = verb + en.) The association ratio identifies quite a number of verbs associated in an interesting way with to; restricting our attention to pairs with a score of 3.0 or more, there are 768 verbs associated with the preposition to~in and 551 verbs with the infinitive marker to/to. The ten verbs found to be most associated before to/in are:</Paragraph>
    <Paragraph position="2"> Thus, we see there is considerable leverage to be gained by preprocessing the corpus and manipulating the inventory of tokens.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 PREPROCESSING WITH A PARSER
</SectionTitle>
    <Paragraph position="0"> Hindle (Church et al. 1989) has found it helpful to preprocess the input with the Fidditch parser (Hindle 1983a, 1983b) to identify associations between verbs and arguments, and postulate semantic classes for nouns on this basis. Hindle's method is able to find some very interesting associations, as Tables 5 and 6 demonstrate.</Paragraph>
    <Paragraph position="1"> After running his parser over the 1988 AP corpus (44 million words), Hindle found N = 4,112,943 subject/verb/ object (SVO) triples. The mutual information between a verb and its object was computed from these 4 million triples by counting how often the verb and its object were found in the same triple and dividing by chance. Thus, for example, disconnect/V and telephone/0 have a joint probability of 7/N. In this case, chance is 84/N x 481/N because there are 84 SVO triples with the verb disconnect, and 481 SVO triples with the object telephone. The mutual information is log z 7N/(84 x 481) = 9.48. Similarly, the mutual information for drink/Vbeer/O is 9.9 = log 2 29N/ (660 x 195). (drink/V and beer/O are found in 660 and Computational Linguistics Volume 16, Number 1, March 1990 25 Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography  of these triples).</Paragraph>
    <Paragraph position="2"> This application of Hindle's parser illustrates a second example of preprocessing the input to highlight certain constraints of interest. For measuring syntactic constraints, it may be useful to include some part of speech information and to exclude much of the internal structure of noun phrases. For other purposes, it may be helpful to tag items and/or phrases with semantic labels such as *person*, *place*, *time*, *body part*, *bad*, and so on.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 APPLICATIONS IN LEXICOGRAPHY
</SectionTitle>
    <Paragraph position="0"> Large machine-readable corpora are only just now becoming available to lexicographers. Up to now, lexicographers have been reliant either on citations collected by human  readers, which introduced an element of selectivity and so inevitably distortion (rare words and uses were collected but common uses of common words were not), or on small corpora of only a million words or so, which are reliably informative for only the most common uses of the few most frequent words of English. (A million-word corpus such as the Brown Corpus is reliable, roughly, for only some uses of only some of the forms of around 4000 dictionary entries.</Paragraph>
    <Paragraph position="1"> But standard dictionaries typically contain twenty times this number of entries.) The computational tools available for studying machine-readable corpora are at present still rather primitive. These are concordancing programs (see Figure 1), which are basically KWIC (key word in context; Aho et al. 1988) indexes with additional features such as the ability to extend the context, sort leftward as well as rightward, and so on. There is very little interactive software. In a typical situation in the lexicography of the 1980s, a lexicographer is giwen the concordances for a word, marks up the printout with colored pens to identify the salient senses, and then writes syntactic descriptions and definitions.</Paragraph>
    <Paragraph position="2"> Although this technology is a great improvement on using human readers to collect boxes of citation index cards (tlhe method Murray used in constructing The Oxford English Dictionary a century ago), it works well if there are no more than a few dozen concordance lines for a word, and only two or three main sense divisions. In analyzing a complex word such as take, save, or from, the lexicographer is trying to pick out significant patterns and subtle distinctions that are buried in literally thousands of concordance lines: pages and pages of computer printout. The unaided human mind simply cannot discover all the signifi-Is Su~Say, calling for ~x~ater economic reforms to mmi~:ion asseaed that &amp;quot; the Postal Se~wice could Then. sl0e said, the family hopes to e out-of-work steelworker, &amp;quot; because that doesn't .... We suspend reality when we say we'll sclent~ts has won the first round in an effort to about three children in a mining town who plot to GM executives say the slmtdow~ will rtr~ent as receiver, lilstracted officials to U3, to The package, which is to newly enhanced image as the moderate who moved to mffiina offer from chairman Victor Posner to help after telling a delivery-room doctor not to try to h bliffiday Tmr~day, cheered by those who fought to at be ~sl formed an alliance with Moslem rebels to * ' Basically we could We worked for a year to their expet~ive mirrors, just like in wartime, to ald of many who risked their Own lives in order to We must increase tile amount Americans save Oatha ~ poveay.</Paragraph>
    <Paragraph position="3"> save enormous sums of money in conwacling out individual e save enough for a down payment on a boule.</Paragraph>
    <Paragraph position="4"> save jobs, that costs jobs. &amp;quot; save money by spending $10,000 in wage~ for a public work~ save one of Egypt's great m:Lsxtre.s, the decaying tomb of R save the &amp;quot; pit ponies &amp;quot; doomed to be slaughtered. save the automaker $500 million a year in operating e~ts a save the C/C/m3pany rather than liquidate it and then declared save the counW/nearly $2 billion, also includes a program save the counw/.</Paragraph>
    <Paragraph position="5"> save the financially troubled company, but said Pc~er stil save the infant by imsertlnli a tube in its throat to belp i save the majestic Beaux Arts arcl~tecmral mE~-telpiece.</Paragraph>
    <Paragraph position="6"> save ate nation from commumsm.</Paragraph>
    <Paragraph position="7"> save the operating costs of the Pershing, s and ground-launch save the ~te at enormous expense to us, &amp;quot; said Leveil\]ee. save them fi~m diamken yankee brawlel~, &amp;quot; Ta~ said. save those who were p~=aengers. &amp;quot; save. &amp;quot; Figure 1 Short Sample of the Concordance to &amp;quot;save&amp;quot; from the AP 1987 Corpus.</Paragraph>
    <Paragraph position="8"> 26 Computational Linguistics Volume 16, Number 1, March 1990 Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography cant patterns, let alone group them and rank them in order of importance.</Paragraph>
    <Paragraph position="9"> The AP 1987 concordance to save is many pages long; there are 666 lines for the base form alone, and many more for the inflected forms saved, saves, saving, and savings. In the discussion that follows, we shall, for the sake of simplicity, not analyze the inflected forms and we shall only look at the patterns to the right of save (see Table 7).</Paragraph>
    <Paragraph position="10"> It is hard to know what is important in such a concordance and what is not. For example, although it is easy to see from the concordance selection in Figure 1 that the word &amp;quot;to&amp;quot; often comes before &amp;quot;save&amp;quot; and the word &amp;quot;the&amp;quot; often comes after &amp;quot;save,&amp;quot; it is hard to say from examination of a concordance alone whether either or both of these co-occurrences have any significance.</Paragraph>
    <Paragraph position="11"> Two examples will illustrate how the association ratio measure helps make the analysis both quicker and more accurate.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.1 EXAMPLE 1: &amp;quot;SAVE ... FROM&amp;quot;
</SectionTitle>
      <Paragraph position="0"> The association ratios in Table 7 show that association norms apply to function words as well as content words. For example, one of the words significantly associated with save is from. Many dictionaries, for example Webster's Ninth New Collegiate Dictionary (Merriam Webster), make no explicit mention of from in the entry for save, although  British learners' dictionaries do make specific mention of from in connection with save. These learners' dictionaries pay more attention to language structure and collocation than do American collegiate dictionaries, and lexicographers trained in the British tradition are often fairly skilled at spotting these generalizations. However, teasing out such facts and distinguishing true intuitions from false intuitions takes a lot of time and hard work, and there is a high probability of inconsistencies and omissions.</Paragraph>
      <Paragraph position="1"> Which other verbs typically associate with from, and where does save rank in such a list? The association ratio identified 1530 words that are associated with from; 911 of them were tagged as verbs. The first 100 verbs are: refrain/vb, gleaned/vbn, stems/vbz, stemmed/vbd, stemming/vbg, ranging/vbg, stemmed/vbn, ranged/ vbn, derived/vbn, ranged/vbd, extort/vb, graduated/ vbd, barred/vbn, benefiting/vbg, benefitted/vbn, benefited/vbn, excused/vbd, arising/vbg, range/vb, exempts/ vbz, suffers/vbz, exempting/vbg, benefited/vbd, prevented/vbd (7.0), seeping/vbg, barred/vbd, prevents/ vbz, suffering/vbg, excluded/vbn, marks/vbz, profiting/ vbg, recovering/vbg, discharged/vbn, rebounding/vbg, vary/vb, exempted/vbn, separate/vb, banished/vbn, withdrawing/vbg, ferry/vb, prevented/vbn, profit/vb, bar/vb, excused/vbn, bars/vbz, benefit/vb, emerges/ vbz, emerge/vb, varies/vbz, differ/vb, removed/vbn, exempt/vb, expelled/vbn, withdraw/vb, stem/vb, separated/vbn, judging/vbg, adapted/vbn, escaping/vbg, inherited/vbn, differed/vbd, emerged/vbd, withheld/vbd, leaked/vbn, strip/vb, resulting/vbg, discourage/vb, prevent/vb, withdrew/vbd, prohibits/vbz, borrowing/vbg, preventing/vbg, prohibit/vb, resulted/vbd (6.0), preclude/vb, divert/vb, distinguish/vb, pulled/vbn, fell/ vbn, varied/vbn, emerging/vbg, suffer/vb, prohibiting/ vbg, extract/vb, subtract/vb, recover/vb, paralyzed/ vbn, stole/vbd, departing/vbg, escaped/vbn, prohibited/ vbn, forbid/vb, evacuated/vbn, reap/vb, barring/vbg, removing/vbg, stolen/vbn, receives/vbz.</Paragraph>
      <Paragraph position="2"> Save...from is a good example for illustrating the advantages of the association ratio. Save is ranked 319th in this list, indicating that the association is modest, strong enough to be important (21 times more likely than chance), but not so strong that it would pop out at us in a concordance, or that it would be one of the first things to come to mind.</Paragraph>
      <Paragraph position="3"> If the dictionary is going to list save.., from, then, for consistency's sake, it ought to consider listing all of the more important associations as well. Of the 27 bare verbs (tagged 'vb') in the list above, all but seven are listed in Collins Cobuild English Language Dictionary as occurring with from. However, this dictionary does not note that vary, ferry, strip, divert, forbid, and reap occur with from. If the Cobuild lexicographers had had access to the proposed measure, they could possibly have obtained better coverage at less cost.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
8.2 EXAMPLE 2: IDENTIFYING SEMANTIC CLASSES
</SectionTitle>
      <Paragraph position="0"> Having established the relative importance of save ...</Paragraph>
      <Paragraph position="1"> from, and having noted that the two words are rarely Computational Linguistics Volume 16, Number 1, March 1990 27 Kenneth Church and Patrick Hanks Word Association Norms, Mutual Information, and Lexicography adjacent, we would now like to speed up the labor-intensive task of categorizing the concordance lines. Ideally, we would like to develop a set of semi-automatic tools that would help a lexicographer produce something like Figure 2, which provides an annotated summary of the 65 concordance lines for save ... from. 5 The save ... from pattern occurs in about 10% of the 666 concordance lines for save.</Paragraph>
      <Paragraph position="2"> Traditionally, semantic categories have been only vaguely recognized, and to date little effort has been devoted to a systematic classification of a large corpus. Lexicographers have tended to use concordances impressionistically; semantic theorists, AI-ers, and others have concentrated on a few interesting examples, e.g. bachelor, and have not given much thought to how the results might be scaled up.</Paragraph>
      <Paragraph position="3"> With this concern in mind, it seems reasonable to ask how well these 65 lines for save...from fit in with all other uses of save A laborious concordance analysis was undertaken to answer this question. When it was nearing completion, we noticed that the tags that we were inventing to capture the generalizations could in most cases have been suggested by looking at the lexical items listed in the association ratio table for save. For example, we had failed to notice the significance of time adverbials in our analysis of save, and no dictionary records this. Yet it should be save X from Y (65 concordance lines)  1 save PERSON from Y (23 concordance lines) 1.1 save PERSON from BAD (19 concordance lines) ( Robert DeNiro ) to save Indian tribes(PERSON\] from genocide\[DESTRUCT\[BAD\]\] at the hands of &amp;quot; We wanted to save him(PERSON\] ~orn undue ~ouble\[BAD\] and loss(BAD\] of money , &amp;quot; Murphy was sacrificed to save more powerful Democrats(PERSON\] from harm(BAD\] . &amp;quot; God sent this man to save my five children(PERSON\] from being burned to death(DESTRUCT(BAD\]\] and Pope John Paul I\] to &amp;quot; save us(PERSON\] fl~m sin(BAD\] . &amp;quot; 1.2 save PERSON from (BAD) LOC(AT1ON) (4 concordance lines)  rescuers who helped save the toddler(PERSON\] from an abandoned weU\[LOC\] will be feted with a parade while attempting to save two drowning hoys\[PERSON\] from a turbulent(BAD\] creeklLOC\] in Otdo\[LOC\] 2. save INST(ITUTION) from (ECON) BAD (27 concordance lines) member states to help save the EEC\[INSTI from possible bankaxlptcy\[BCON\]\[BAD\] this year. should be sought &amp;quot; to save the compeny\[CORP\[1NST\]\] from bankmptfy\[BCON\]\[BAD\]. law was necessary to save the counffy\[NATIOlq\[lNST\]\] flora disaster(BAD\]. operation &amp;quot; to save the nation(NATION(INS'r\]\] from COmmUnL~n\[BAD\]\[POL1TICAL\] . were not needed to save the system from benkauptcy\[ECON\]\[BAD\].</Paragraph>
      <Paragraph position="4"> his efforts to save the wodd\[INST\] from the like~ of Lothax and the Spider Woman 3. save ANIMAL from DESTRUCT(ION) (5 concordance lines) give them the money to save the dogs(ANIMAL\] from being destroyed(DESTRUCT\] , program intended to save the giant birds(ANIMAL\] ~om extinction\[DESTRUCTI, UNCLASSIFIED (10 concordance lines) walnut and ash tx~es to save them from the axes and saws of a logging company. after the a~aek to save the ship from a temble\[BAD\] fire, Navy reports concluded Thursday. cemficates that would save shopper~\[pERSON\] anywhere f~m $50\[MONEY\] \[NUMBER\] to $500\[MONEY\] (/flu Figure 2 Some AP 1987 Concordance Lines to &amp;quot;save...from, &amp;quot; Roughly Sorted into Categories.</Paragraph>
      <Paragraph position="5"> clear fi'om the association ratio table above that annually and month 6 are commonly found with save. More detailed inspection shows that the time adverbials correlate interestingly with just one group of save objects, namely those tagged \[MONEY\]. The AP wire is full of discussions of saving $1.2 billion per month; computational lexicography should measure and record such patterns if they are general, even when traditional dictionaries do not.</Paragraph>
      <Paragraph position="6"> A,; another example illustrating how the association ratio tables would have helped us analyze the save concordance lines, we found ourselves contemplating the semantic tag ENV(IRONMENT) to analyze lines such as: the trend to save the forests\[ENV\] it's our turn to save the lake\[ENV\], joined a fight to save their forests\[ENV\], can we get busy to save the planet\[ENV\] ? If we had looked at the association ratio tables before labC.ing the 65 lines for save ... from, we might have noticed the very large value for save.., forests, suggesting that there may be an important pattern here. In fact, this pattern probably subsumes most of the occurrences of the &amp;quot;save \[ANIMAL\]&amp;quot; pattern noticed in Figure 2. Thus, these tables do not provide semantic tags, but they provide a powerful set of suggestions to the lexicographer for what needs to be accounted for in choosing a set of semantic tags.</Paragraph>
      <Paragraph position="7"> It may be that everything said here about save and other words is true only of 1987 American journalese. Intuitively, however, many of the patterns discovered seem to be good candidates for conventions of general English. A future step would be to examine other more balanced corpora and test how well the patterns hold up.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML