I 
Automatic Suggestion of Significant Terms for a Predefined Topic 
Joe Zhou and Pe~Dapkus 
LEXIS-NEXIS, a Division of Reed Elsevier, Inc. 
9555 Springboro Pike 
Miamisburg, OH 45342 
{joez,peted} @ lexis-nexis.com 
ABSTRACT 
This paper presents a preliminary experiment in automatically suggesting significant terms for a 
predefined topic. The general method is to compare a topically focused sample created around 
the predefined topic with a larger and more general base sample. A set of statistical measures 
are used to identify significant word units in both samples. Identification of single word terms is 
based on the notion of word intervals. Two-word terms are identified through the computation of 
mutual information, and the extension of mutual information assists in capturing multi-word 
terms. Once significant terms of all these three types are identified, a comparison algorithm is 
applied to differentiate terms across the two data samples. If significant changes in the values of 
certain statistical variables are detected, associated terms will selected as being topic-oriented 
and included in a suggested list. To check the quality of the suggested terms, we compare them 
against terms manually determined by the domain expert. Though overlaps vary, we find that the 
automatical suggestion provides more terms that are useful for describing the predefined topic. 
1. INTRODUCTION 
As we are facing the growing amount of on-line text, the use of text analysis techniques to 
access information from electronic sources has become more popular and, at the same time, 
more difficult. Currently, the effectiveness of such techniques is evaluated not only on how easily 
they can be applied to text sources to extract information and represent it in a systematic format 
(Walker 1983), but also on whether they can be applied to large text corpora of several tens of 
thousand of words. 
One of the applications of text analysis is to identify and extract significant terminology from 
running text. Choueka (1988), for example, describes an experiment for locating interesting 
collocational expressions from large textual databases. A collocational expression, as Choueka 
defines it, is =sequences of words whose unambiguous meaning cannot be dedved from that of 
their components". Other representative collocation research can be found in Church and Hanks 
(1990) and Smadja (1993). Though all statistically-based, their definitions of collocations are 
different from one another. Unlike Choueka (1988), Church and Hanks (1990) identify as 
collocations both interrupted and uninterrupted sequences of words. Unlike Church and Hanks 
(1990), Smadja (1993) goes beyond the "two-word" limitation and deals with "collocations of 
arbitrary length". 
131 
The primary goal of collocation research is to build a comprehensive lexicographic toolkit, or to 
assist automatic language generation applications. Therefore, the focus is on the extraction of all 
Interesting word pattems without distinction of domain specificity. Identifying domain-specific 
terminology is another research effort. Gierl and Frost (1992) descdbe their approach to 
extracting terminological knowledge from medical texts. Following Church and Hanks (1990), 
they use mutual information to select significant two-word patterns, but, at the same time, a 
lexical inductive process is incorporated which, as they claim, can improve the collection of 
domain-specific terms. Justeson and Katz (1993) introduce an algorithm by which technical 
terms in running text can be identified. Prior to the development of their algorithm, they 
performed a thorough study on the linguistic properties of technical terminology. They report that, 
structurally, technical terms make heavy use of noun compounds. In technical terminology, word 
constituents are limited to adjectives, nouns and occasionally prepositions. Verbs, adverbs, or 
conjunctions are extremely rare. At the discourse level, technical terms tend to be repetitive. With 
these observations in mind, they developed an algorithm which has proved to be effective and 
domain independent. 
In this paper, a preliminary experiment is presented in automatically suggesting significant terms 
for a predefined topic. The general method is to compare a topic focused sample based on the 
predefined topic with a larger and more general base sample. A set of statistical measures are 
used to identify significant word units in both samples. Identification of single word terms is 
based on the notion of word intervals. Two-word terms are identified through the computation of 
mutual information, and an extension of mutual information assists in capturing multi-word terms. 
Once significant terms of all these three types are identified, a comparison algorithm is applied to 
differentiate terms across the two samples. If significant changes in the values of certain 
statistical variables are detected, associated terms are selected from the focused sample as 
being topic-oriented and included in a suggested list. 
To check the quality of the suggested terms, we compare them against terms manually 
determined by a domain expert. Though the numbers of matches vary, we find that our automatic 
suggestion process provides more terms (than the manual process) that are useful for describing 
the predefined topic. 
2. METHODOLOGY 
2.1 Manual versus Automatic Term Suggestion 
TO manually select significant terms for a predefined topic, the domain expert first creates a topic 
focused sample from one specific source or a combination of sources. Then, he or she reads the 
documents, providing a relevance judgment (i.e. a reader-assigned score) to each document. By 
carefully examining relevant documents in the focused sample, a list of terms that are deemed to 
be significant for the definition of the topic is identified. In many cases, it is possible that the 
domain expert would introduce some terms based on his or her own professional knowledge 
about the topic. These terms may be highly prominent for the topic, yet may not necessarily 
occur in the focused sample. 
132 
For automatic suggestion of topical terms, initial attempts were made using the sample 
documents the domain expert created. The results were not impressive. The statistical 
Information generated from the sample documents was not rich and sufficient enough for any 
discriminative judgment. Our experience showed that, to draw terms that are reflective of a given 
topic, a much larger and more general base sample is required. Such a base sample should be 
randomly sampled from the same source as the focused sample and it should contain an array of 
different topics. Once the baseline statistics are generated from both data collections, a 
meaningful comparison could spot terms that occur with unusual frequency in the focused 
sample. These terms would constitute good candidates for topically sensitive terminological units 
(Steier and Belew 1994). 
2.2 Focused Sample and Base Sample 
For our experiments of automatic term suggestion, we selected a predefined topic called 
"European Politics and Business". The focused sample was originally created by the domain 
expert using the 1988 United Press International (UPI). Table 1 presents statistical information 
about this dataset. After reading each of the relevant documents found in the focused sample, 
the domain expert manually determined 347 topical terms. Table 2 provides the statistical 
breakdown of these terms. 
Table 1: Focused and Base Samples 
Data File Source/Name Size (bytes) Unique Words 
Focused Sample Sample from 1988 UPI 1,015200 12,065 (5,045") 
Base Sample Sample from 27,322,598 73,583 (33,114") 
1987,1988,1989 UPI 
* only words which occur more than 3 times were used in the experiments 
Table 2: Predefined Topic and its Manually Determined Topical Terms 
Predefined Topic one-word two.word multi.word total terms 
i 
European Politics 
& Business 276 36 35 347 
Since the focused sample was drawn from the source of 1988 UPI, the construction of its 
corresponding base sample was also initiated from the same source of the same year. Our 
experiments demonstrated that, in order to obtain a random assortment of topics to be included 
in the base sample, it may be meaningful to sample documents from the time pedod before and 
after the focused documents. Therefore, the final base sample was created by randomly drawing 
documents from the years of 1987, 1988 and 1989. The size of this dataset is about 27 times 
larger than the sample data file (see Table 1 ). 
133 
Though the ratio between the focused and base samples was arbitrary, in order to generate 
meaningful statistics, we felt that the base sample should be at least 20 times larger in size than 
the focused sample. (For the sake of discussion, hereafter, we may sometimes refer to the 
focused sample as "focused" and the base sample as "base".) 
2.3 Experimental Procedure 
The general method we adopted is as follows. First, we identified statistically significant terms 
from both samples. Next, a comparison algorithm was applied to these two sets of terms to 
single out those that were common to both samples, yet whose patterns of occurrences differed 
between these two samples. Finally, we analyzed and presented this set of terms as content 
odented candidates for the predefined topic, in this case "European Politics and Business". 
The terms suggested are split into three categones: single word terms, two-word terms and 
multi-word terms (or phrases). The following three sections descnbe in detail the methods for 
generating each of the three categories. 
2.4 Suggesting Single Word Terms 
Automatically suggesting single word terms as being topically oriented has been most 
challenging. Our experiments indicated that the ffirst order" statistics, probability and entropy 
alone, are not sufficient for gathering information about the topicality of a word in running text. 
The information in both measurements is essentially equivalent since entropy is just the log 
inverse of probability. 
We found that the "second-order" statistics, such as vadance or standard deviation of term 
frequencies across documents, provide greater insight into topicality. We selected the interval 
between the occurrences of a word as the basis for analysis. Our intuitions led us to believe that 
topical single words should appear more frequently and more regularly, i.e. at approximately 
even intervals, in the focused sample than in the base sample. The focused sample represents, 
more or less, a topical sublanguage set while the base sample a general language set. Unlike 
probability and entropy statistics which yield average scores for the whole document, the use of 
interval makes it possible to get an "instantaneous" measure at any location in the document. 
More specifically, an interval can be measured "instantaneously" at any point in the text between 
the occurrences of a particular word. Though using interval alone might still not be sufficient for 
identifying word topicality, it allowed us to measure the vadance which would help identify words 
that were always changing in their rate of occurrences. 
Thus, three scores were generated for each word: the mean log interval, the standard deviation 
of the mean log interval, and the normalized standard deviation of the mean log interval. The use 
of a log scale for these measurements is to minimize the effect of unduly large variations in 
words with long mean intervals. The normalized standard deviation is produced by simply 
dividing the raw standard deviation by the mean log interval. In most cases, raw standard 
deviation is found to be larger for words having long mean intervals. In order to compare the 
standard deviations across words of different intervals, we found this normalization process quite 
useful. 
134 
i 
After scores were generated for all the words in both the focused sample and the base sample, 
score comparisons between the two samples were carried out in two ways: comparing the 
intervals and comparing the standard deviations. 
To compare the intervals, the =base" mean log interval was subtracted from the "focused" mean 
log Interval and divided by the raw standard deviation from the base sample. The result 
represents the change of mean log intervals. More explicitly, it yields the number of standard 
deviations that the "focused" mean log interval is different from the =base" mean log interval. The 
more negative;the value, the more significant the change, and the more prominent the word 
would appear in the focused sample. 
To compare the standard deviations, the normalized =base" standard deviation was subtracted 
from the normalized "focused" standard deviation. The difference symbolizes how the word is 
distributed in ~e focused sample. The more negative the value is, the more "bursty" the word is 
distributed, and the more likely it is content oriented since "content words tend to appear in 
'bursts" (Church and Mercer 1993). 
If a single word term is found in both data samples and it receives negative scores from both 
interval and standard deviation comparisons, it would be included in the suggested list as being 
topical onented. 
2.5 Suggesting Two-Word Terms 
The method for suggesting two-word terms tumed out to be much simpler than that for single 
word terms though the same techniques are equally applicable. Here, the traditional mutual 
information score was used. As stated in Church, et al. (1991) and elsewhere, the mutual 
information measurement can be expressed as: 
. (:p(WlW2)) l(WlW2) = l°g~,p(wl)p(w2) 
where p(wlw2) is the frequency in the data collection of the two-word compound (wl ,w2); and 
p(wl) and p(w2) the frequency of the word constituents. The highest mutual information score 
indicates that the individual probabilities are low while the two words occur together frequently. 
Two steps led to our automatic suggestion of topic-oriented two-word terms. First, the mutual 
information score was computed for each pair of words that occur in each of the two samples. To 
capture topicality, we were only interested in pairs of words with high mutual information scores. 
Therefore, any pair which contained =closed class" words, such as determiners, prepositions, 
auxiliaries, or single letters, digit numbers, or overly common verbs like "give", "take", etc., were 
excluded. Such an exclusion not only helped getting pairs of words with high mutual information 
scores, but also sped up computation significantly. A threshold value was also set such that if 
any two-word unit occurred less than 3 times in the sample or received a mutual information 
score lower than 6.0, it was eliminated and would not participate in the next comparison 
measurement. 
135 
With the mutual information scores in hand, a "delta" score was generated by subtracting the 
"base" mutual information score from the ffocused" mutual information score. Topically, 
prominent two-word terms normally have lower scores in the focused sample that is "keyed" to 
their topic. This is because the constituent words distribute in wider range of contexts. The 
probability of them occurring separately increases relative to the probability of them occurring 
together (Steier and Belew 1994). Therefore, the more negative the "delta" score, the more 
topically sensitive the two-word term is. 
If a two-word term occurs in both data samples and receives a negative "delta" score, it would be 
included in the suggested list as being topically onented. 
2.6 Suggesting Multi-Word Terms 
When automatically suggesting content two-word terms, we looked at the mutual information 
scores for adjacent words. For multi-word terms, the mutual information score was calculated for 
non-adjacent words. Our intuitions led us to believe that if there is a significant statistical linkage, 
i.e. a high mutual information score, between such a pair of words, it is highly possible that they 
belong to a larger linguistic component. 
Our first step was to compute mutual information scores for a word unit separated by a distance 
of two (i.e. having one unspecified word separating them). Two cdteda apply when selecting 
"interesting" word units. Their mutual information score must be 10 or greater. Following the 
observations by Steier and Belew (Steier and Belew1994), we only selected pairs which received 
lower mutual information score in the focused sample than in the base sample. 
Once an "interesting" word unit of distance two was selected, a concordance was built of all 
sentences containing that word unit. These sentences were compared for matching text. If a 
stdng of text was found to include that word unit and, at the same time, occur most frequently in 
the concordance, its leading and trailing "closed-set" words (if any) were chopped off. The 
remaining text stdng was presented as a suggested multi-word term. 
3. RESULTS and DISCUSSION 
3.1 Suggested Single Terms 
The focused sample drawn from the 1988 UPI data contains 12,065 unique words. Among them, 
5,045 are frequent enough (occurring 3 times or more) to calculate statistics for our experiments 
(refer to Table 1). The comparison algorithm identified 2,010 suggested terms based on the fact 
that they received negative scores for both "change of mean log interval" and "distribution 
burstiness" comparisons. These negative scores indicate that these single word terms have 
shorter intervals and more regular occurrences in the focused sample. 
We compared the suggested list against the single word terms manually selected by the domain 
expert. The results are summarized in Table 3. 
136 
Table 3: Statistics of the Suggested Single Word Terms 
suggested 
2,010 
Comparison of Suggested and Manual Terms 
total 
manual 
276 
not 
possible* 
129 
no 
statistics* 
91 
possible* 
56 
hits 
42 
percent 
included 
75% 
* not possible: terms not existing In the focused sample 
* no statbtics: terms which have less than 3 occurrences in the focused sample 
* possible: targeted terms 
Of the 276 topical single terms determined by the domain expert, 129 terms do not exist in the 
focused sample. As explained earlier, these are the terms intellectually introduced by the domain 
expert. Almost half of these terms are geographical names in Europe, such as 
albania, albertville, andorra, barcelona, belarus, belorus, bosnia, byelorussia, chancellors, 
comecon, cp, croatia, erm, eurocurrency, eurofed, europeanization, europeanwide, 
europeenne, europewide, gaullist, gaullists, gilbraltar, greenland, guemsey, kazakhstan, 
kirghizia, kirgizia, kyrgystan, kzakhstan, labour, liechtenstein, moldavia, moldova, monaco, 
nc, nib, nicosia, nuuk, pentagonale, reunify, reykjavik, salzburg, sicily, slovenia, svalbard, 
tadzhikistan, tajikistan, tajikstan, tirana, tirane, tories, torshavn, turkmenia, turkmenistan, 
uk, ussr, uzbekistan, vaduz, valletta, weu 
Of the remaining 147 actually occurring terms, 91 are not frequent enough to be included in our 
experiments. They occur in the focused sample two times or less. Again, some of them are 
geographical names in Europe. 
amsterdam, athens, azerbaljan, bulgaria, estonia, euro, eurodollar, eurodollars, georgia, 
hamburg, holland, iceland, jersey, latvia, liberals, lithuania, naples, oecd, prague, reunified, 
rome, russia, serbia, sofia, tory, ukraine, unification 
These non-existent and under-represented terms left us with a maximum of 56 terms we could 
catch in the suggested ten'ns list. Of these, 42 were caught with an accuracy rate of 75% (see 
Appendix for details). 
Further analysis of the missing 14 terms reveals that they were not found in the suggested list 
due to the statistical constraints we established for our experiments. As shown in Table 4, 13 of 
these terms received negative scores either for "change of mean log interval" or for "distribution 
burstiness', but not for both. We believe that their inclusion is possible since they represent what 
we would call "border-line" suggested terms. 
137 
Table 4: =Missed" single word terms 
single.word term dgtl dgt2 dgt3 dgt4 dgt5 dgt6 dgt7 dgt8 
portugal 10 13.75 0.26 16.86 0.23 3.83 -0.81 0.03 
europeans 23 12.55 0.35 15.64 0.32 4.98 -0.62 0.03 
eec 3 15.49 0.39 19.28 0.32 6.21 -0.61 0.06 
luxembourg 12 13.49 0.42 17.06 0.36 6.07 -0.59 0.07 
!copenhagen 3 15.49 0.47 18.54 0.34 6.23 -0.49 0.14 
i 
;~ cyprus 6 14.49 0.44 18.28 0.43 7.89 -0.48 0.01 
yugoslavia 12 13.49 0.47 15.33 0.37 5.66 -0.32 0.10 
finland 10 13.75 0.51 15.52 0.46 7.19 -0.25 0.05 
kgb 5 14.75 0.57 16.41 0.44 7.29 .-0.23 0.13 
sweden 13 13.38 0.48 14.26 0.44 6.33 -0.14 0.03 
turkey 11 13.62 0.53 14.47 0.50 7.25 -0.12 0.03 
czechoslovakia 9 13.91 0.09 13.70 0.46 6.29 0.03 -0.36 
switzerland 9 13.91 0.21 13.81 0.47 6.48 0.01-0.26 
Statistics Measurements (dgt = digit) 
dgtl: number of occurrences On the focused sample) 
dgt2: mean log interval (in the focused sample) 
dgt3: normalized SD of mean log interval (in the focused sample) 
dgt4: mean log interval (in the base sample) 
dgtS: normalized SD of mean log interval (in the base sample) 
dgt6: raw SD of mean log interval (in the base sample) 
dgt7: ((2nd digit - 4th digit) / 6th digit)) 
dgtS: (3rd digit. Sth digit) 
Admittedly, the suggested list with the total of 2,010 terms is a fairly large one. It obviously 
contains terms that are not topic oriented. We followed the observations made by Justeson and 
Katz (1993) and introduced a =post-editing" process. As a result, the list was reduced to 886 
terms. Basically, we removed from the original list all the =closed-set" words such as determiners, 
prepositions, auxiliaries, conjunctions, single letters, etc., as well as other less semantically 
laden words such as adverbs and verbs. 
3.2 Suggested Two-Word Terms 
Among 512 =interesting" two-word terms, 170 receive negative =delta" scores. These 164 terms 
were presented in our suggested two-word terms (see Appendix for details). 
138 
I 
A total of 36 topical terms were manually determined based on the UPI focused sample. Of this 
number, only 26 are actually existent terms, which means that 10 terms were introduced 
independent of the source material. Among these 26 terms, 6 were too infrequent to generate 
meaningful statistics though the mutual information scores are high (see Table 5). Five terms, i.e. 
E C, U K, the Channel, the Continent, and the Wal/failed to participate in statistical screening 
because they contain "closed-set" words, i.e. single letters and the determiner the. 
Table 5: 'No statistics" two-word terms 
two-word term digit1 digit2 
monte carlo 1 13.61674723 
i 
social democrats 1 9.58432575 
coalition govea'nment . 1 7.59034954 
supreme soviet 1 5.06985277 
J 
downing street 1 11.75425075 
socialist party 2 6.36709503 
Statistical measurements 
digitl: frequency (in the focused sample) 
digit2: mutual information score 
Of the remaining catchable15 two-word terms, 8 are included in the suggested list. Table 6 
summarizes the statistics of the suggested two-word terms. 
Table 6: Statistics of the Suggested Two-Word Terms 
Comparison of Suggested and Manual Terms 
total total not no l percent 
suggested manual possible* statistics* possible* hits i included 
\[ 
170 36 10 11 15 8 53% 
* not possible: terms not existing In the focused sample 
* no statistics: terms which have less than 3 occurrences in the focused sample 
* possible: targeted terms 
Further screening revealed that 3 manually selected two-word terms (i.e. cold war, common 
market, and North Sea) were actually captured in the 512 "interesting" list. They were not 
included in the suggested list because they did not receive negative "delta" scores. The 
suggested list fails to include 4 manually selected two-word terms because their mutual 
information scores go up. Typically, content oriented two-word terms within the topically related 
subset of documents are expected to go down. This might be caused by the individual word 
probabilities. To use Steier and Belew's terms (Steier and Belew 1994), these pairs appear more 
"opaque", meaning that their constituent words are more probable individually than when they 
are combined inthe focused sample. Table 7 lists these 4 two-word terms appearing in both 
samples. 
139 
Table 7: "Missed" two-word terms 
Sample two-word term frequency MI score 
"base" atlantic alliance 11 8.80256520 
"focused" atlantic alliance 4 9.36193333 
"base" cold war 54 8.04486800 
"focused" cold war 11 9.97241419 
"base" common market 26 6.86310460 
"focused" common market 17 7.84030540 
"base" 49 
"focused" 
united kingdom 
united kingdom 25 
7.55353160 
7.80705217 
Our suggested two-word terms list (see the Appendix) contains quite a number of useful 
additional terms about the targeted predefined topic "European Politics and Business". The 
following are some examples: 
US-European relations/politics: 
armed forces, diplomatic relations, nuclear missiles, nuclear weapons, trade barriers 
European Business: 
bilateral trade, economic reform, market integration, pdvate enterprise, pdvate investment 
Notable European entities: 
banca commerciale, berlin wall, bdtish spies, swiss francs, brussels belgium 
Heads of state: 
felipe gonzalez, francois mitterrand, mikhail gorbachev 
3.3 Suggested Multi-Word Terms 
A total of 97 multi-word terms were extracted from the focused sample for inclusion in the 
suggested list (see Appendix). Admittedly, some of them are simply sentence fragments instead 
of real phrases. 
Of the 35 multi-word terms manually selected by the domain expert, 26 actually occur in the 
focused sample. As with the single word and two-word terms, the other 9 multi-word terms are 
simply intellectual introductions from the domain expert. Of the 26 tenns, 22 occur frequently 
enough to generate meaningful statistics. Out of these 22 catchable terms, only 5 are included in 
the suggested list. Table 8 presents the statistical summary. 
140 
Table 8: Statistics of the Suggested Multi-Word Terms 
total 
suggested 
97 
Comparison of Suggested and Manual Terms 
total 
manual 
35 
not 
possible* 
no 
statistics* 
4 
possible*! hits 
22 5 
* not possible: terms not existing in the focused sample 
* no statistics: terms which have less than 3 occurrences in the focused sample 
* possible: targeted terms 
percent 
included 
23% 
One possible explanation for not being able to match more manual selections is that most of the 
two-word terms that could have been used to detect these phrases consist of two common 
words, such as house, lords, fund, system. These two-word terms typically generate fairly low 
mutual information scores since the constituent words occur frequently by themselves. 
It is important to point out that the suggested list does contain a number of useful multi-word 
terms that are related to the targeted predefined topic =European Politics and Business". For 
example, 
US-European relations/politics: 
short range nuclear missiles, tactical nuclear weapons, conventional arms reduction, multi 
party system 
European Business: 
gross national product, higher interest rates and inflation, Bank of England, North Sea Oil 
Notable European entities: 
predominantly Catholic Idsh Republic, three Bdtish hostages, World War II, Roman 
Catholic Church 
Heads of state or notable dignitaries: 
Secretary of State James Baker, Secretary of State George Shultz, French President 
Francois Mitterrand, West German Chancellor Helmut Kohl, Soviet leader Mikhail 
Gorbachev, Soviet Foreign Minister Eduard Shevardnadze 
141 
4. CONCLUSION 
This paper presents a preliminary experiment in identifying significant terminological units from 
running text. By comparing a focused sample randomly drawn for a predefined topic against a 
larger and more general base sample, we can automatically suggest topic-oriented terms based 
on the detection of significant changes in some statistical measurements. Our experiment on one 
predefined topic demonstrated that, compared to the manual selection of the topical terms, our 
suggested lists do contain more useful terms that can be used to descdbe the topic. We also 
found that the method is efficient enough for applications to very large textual corpora. Our next 
step is to further refine the methods by carrying out more experiments across different topics. We 
mentioned a number of times that our methods were developed based on our intuitive 
assumptions or hypotheses. More experiments on more topics will prove whether we can obtain 
positive and consistent results. 
Identification of significant terms from running text can be very useful in building intelligent 
information management systems. Terms identified are good candidates for key word indexing of 
electronic sources. Topic specificity can assist in grouping or clustering on-line documents. For 
an information retrieval system, terms identified for a pre-determined subject can be used to 
develop specialized libraries or files for targeted user groups. Our experiment demonstrated that 
the methods described can identify vadous people names, organization enlJties and other proper 
names. Those special text tokens are important for constructing text extraction systems. 
ACKNOWLEDGEMENTS 
This research was done while the second author worked at LEXIS-NEXIS dudng the summer of 
1994. The authors would like to thank Dan Pliske, Mark Wasson and Rob Keefer for helpful 
comments on this paper, and Rita Freese for proofreading the final draft. The authors were also 
benefited from numerous conversations with Ken Church at Bell Labs. 

REFERENCES 
K. Church and P. Hanks. Word association norms, mutual information and lexicography. 
Computational Linguistics, 16(1 ), March 1990. 
K. Church and R. Mercer. Introduction to the special issue in computational linguistics using 
large corpora. Computational Linguistics, 19(1 ), March 1993. 
K. Church, et al. Using statistics in lexical analysis. In U. Zemik, editor, Lexica/Acquisition: 
Exploring On-line Resources to Build a Lexicon, Lawrence Erlbaum Association, 1991. 
Y. Choueka. Looking for needles in a haystack. In proceedings, R/AO, Conference on User- 
Oriented Context Based Text and Image Handling. Cambridge, MA. 1988. 
C. Gierl and D. Frost. Identification of domain-specific ten'ninology by combining mutual 
information and lexical induction. In B. Neumann, editor, lOth European Conference on 
Artificial Intelligence. 1992. 
S. Justeson and S. Katz. Technical terrninology: some linguistic properties and an algorithm for 
identification in text. Research Report. IBM Research Division, T. J. Watson Research 
Center. 1993. 
F. Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), March 
1993. 
A. Steier and R. Belew. Exploring phrases: a statistical analysis of topical language. Technical 
Report. University of California - San Diego. 1994. 
D. Walker. Text analysis. In proceedings. Conference on Applied Natural Language Processing. 
1983.
