A Corpus-based Approach to Automatic 
Compound Extraction 
Keh-Yih Su Ming-Wen Wu Jing-Shin Chang 
Dept. of Electrical Engineering Behavior Design Corporation Dept. of Electrical Engineering 
National Tsing-Hua University No. 28, 2F, R&D Road II National Tsing-Hua University 
Hsinchu, Taiwan 30043, R.O.C. Science-Based Industrial Park Hsinchu, Talwan 30043, R.O.C. 
kysu©bdc, com. tw Hsinchu, Taiwan 30077, R.O.C. shin©hera, ee.nthu, edu. 1;w 
mingwen~bdc, com. tw 
Abstract 
An automatic compound retrieval method is pro- 
posed to extract compounds within a text mes- 
sage. It uses n-gram mutual information, relative 
frequency count and parts of speech as the features 
for compound extraction. The problem is mod- 
eled as a two-class classification problem based 
on the distributional characteristics of n-gram to- 
kens in the compound and the non-compound clus- 
ters. The recall and precision using the proposed 
approach are 96.2% and 48.2% for bigram com- 
pounds and 96.6% and 39.6% for trigram com- 
pounds for a testing corpus of 49,314 words. A 
significant cutdown in processing time has been 
observed. 
Introduction 
In technical manuals, technical compounds 
\[Levi 1978\] are very common. Therefore, the qual- 
ity of their translations greatly affects the per- 
formance of a machine translation system. If a 
compound is not in the dictionary, it would be 
translated incorrectly in many cases; the reason 
is: many compounds are not compositional, which 
means that the translation of a compound is not 
the composite of the respective translations of the 
individual words \[Chen and Su 1988\]. For exam- 
ple, the translation of 'green house' into Chinese 
is not the composite of the Chinese ~anslations of 
'green' and 'house'. Under such circumstances, 
the number of parsing ambiguities will also in- 
crease due to the large number of possible parts 
of speech combinations for the individual words. 
It will then reduce the accuracy rate in disam- 
biguation and also increase translation time. 
In practical operations, a computer-translated 
• manual is usually concurrently processed by sev- 
eral posteditors; thus, to maintain the consistency 
of translated terminologies among different poste- 
ditors is very important, because terminological 
consistency is a major advaatage of machine trans- 
lation over human translation. If all the termi- 
nologies can be entered into the dictionary before 
translation, the consistency can be automatically 
maintained, the translation quality can be greatly 
improved, and lots of postediting time and consis- 
tency maintenance cost can be saved. 
Since compounds are rather productive and 
new compounds are created from day to day, it 
is impossible to exhaustively store all compounds 
in a dictionary. Also, it is too costly and time- 
consuming to inspect the manual by people for 
the compound candidates and update the dictio- 
nary beforehand. Therefore, it is important that 
the compounds be found and entered into the dic- 
tionary before translation without much human 
effort; an automatic and quantitative tool for ex- 
tracting compounds from the text is thus seriously 
required. 
Several compound extracting approaches have 
been proposed in the literature \[Bourigault 1992, 
Calzolari and Bindi 1990\]. Traditional rule-based 
systems are to encode some sets of rules to ex- 
tract likely compounds from the text. However, a 
lot of compounds obtained with such approaches 
may not be desirable since they are not assigned 
objective preferences. Thus, it is not clear how 
likely one candidate is considered a compound. 
In LEXTER, for example, a text corpus is ana- 
lyzed and parsed to produce a list of likely ter- 
minological units to be validated by an expert 
\[Bourigault 1992\]. While it allows the test to be 
done very quickly due to the use of simple anal- 
ysis and parsing rules, instead of complete syn- 
tactic analysis, it does not suggest quantitatively 
to what extent a unit is considered a terminology 
and how often such a unit is used in real text. It 
might therefore extract many inappropriate termi- 
nologies with high false alarm. In another statis- 
tical approach by \[Calzolari and Bindi 1990\], the 
association ratio of a word pair and the disper- 
sion of the second word are used to decide if it. 
is a fixed phrase (a compound). The drawback is 
that it does not take the number of occurrences 
of the word pair into account; therefore, it is not. 
242 
known if the word pair is commonly or rarely used. 
Since there is no performance evaluation reported 
in both frameworks, it is not clear how well they 
work. 
A previous framework by \[Wu and Su 1993\] 
shows that the mutual information measure and 
the relative frequency information are discrimi- 
native for extracting highly associated and fre- 
quently encountered n-gram as compound. How- 
ever, many non-compound n-grams, like 'is a', 
which have high mutual information and high rel- 
ative frequency of occurrence are also recognized 
as compounds. Such n-grams can be rejected if 
syntactic constraints are applied. In this paper, 
we thus incorporate parts of speech of the words 
as a third feature for compound extraction. An 
automatic compound retrieval method combining 
the joint features of n-gram mutual information, 
relative frequency count and parts of speech is pro- 
posed. A likelihood ratio test method, designed 
for a two-class classification task, is used to check 
whether an n-gram is a compound. Those n-grams 
that pass the test are then listed in the order of 
significance for the lexicographers to build these 
entries into the dictionary. It is found that, by 
incorporating parts of speech information, both 
the recall and precision for compound extraction 
is improved. The simulation result shows that the 
proposed approach works well. A significant cut- 
down of the postediting time has been observed 
when using this tool in an MT system, and the 
translation quality is greatly improved. 
A Two Cluster Classification Model 
for Compound Extraction 
The first step to extract compounds is to find 
the candidate list for compounds. According to 
our experience in machine translation, most com- 
pounds are of length 2 or 3. Hence, only bigrams 
and trigrams compounds are of interest to us. The 
corpus is first processed by a morphological ana- 
lyzer to normalize every word into its stem form, 
instead of surface form, to reduce the number' of 
possible alternatives. Then, the corpus is scanned 
from left to right with the window sizes 2 and 3. 
The lists of bigrams and trigrams thus acquired 
then form the lists of compound candidates of in- 
terest. Since the part of speech pattern for the n- 
grams (n=2 or 3) is used as a compound extraction 
feature, the text is tagged by a discrimination ori- 
ented probabilistic lexical tagger \[Lin et al. 1992\]. 
The n-gram candidates are associated with a 
number of features so that they can be judged as 
being compound or non-compound. In particular, 
we use the mutual information among the words 
in an n-gram, the relative frequency count of the 
n-gram, and the part of speech patterns associated 
243 
with the word n-grams for the extraction task. 
Such features form an 'observation vector' £ (to be 
described later) in the feature space for an input 
n-gram. Given the input features, we can model 
the compound extraction problem as a two-class 
classification problem, in which an n-gram is ei- 
ther classified as a compound or a non-compound, 
using a likelihood ratio )t for decision making: 
,x = P(,~IM¢) x P(M¢) 
P(~IMn¢) x P(M,~) 
where Mc stands for the event that 'the n-gram 
is produced by a compound model', Mnc stands 
for the alternative event that 'the n-gram is pro- 
duced by a non-compound model', and £ is the 
observation associated with the n-gram consisting 
of the joint features of mutual information, rela- 
tive frequency and part of speech patterns. The 
test is a kind of likelihood ratio test commonly 
used in statistics \[Papoulis 1990\]. If A > 1, it is 
more likely that the n-gram belongs to the com- 
pound cluster. Otherwise, it is assigned to the 
non-compound cluster. Alternatively, we could 
use the logarithmic likelihood ratio In A for testing: 
if In A > O, the n-gram is considered a compound; 
it is, otherwise, considered a non-compound. 
Features for Compound Retrieval 
The statistics of mutual information among the 
words in the n-grams, the relative frequency count 
for each n-gram and the transition probabilities 
of the parts of speech of the words are adopted 
as the discriminative features for classification as 
described in the following subsections. 
Mutual Information Mutual information is a 
measure of word association. It compares the 
probability of a group of words to occur together 
(joint probability) to their probabilities of occur- 
ring independently. The bigram mutual informa- 
tion is known as \[Church and Hanks 1990\]: 
P(x, y) 
I(x; y) = log2 P(x) x P(y) 
where x and y are two words in the corpus, and 
I(x;y) is the mutual information of these two 
words (in this order). The mutual information of 
a trigram is defined as \[Su et al. 1991\]: 
PD(X,y,z) I(x; y; z) = 
log 2 Pz(x, y, z) 
where PD(X,y,z) -- P(x,y,z) is the probability 
for x, y and z to occur jointly (Dependently), and 
Pi(x, y, z) is the probability for x, y and z to oc- 
cur by chance (Independently), i.e., Pz(x, y, z) =_ 
P(x) x P(y) x P(z)+P(x) x P(y, z)+P(x, y) x P(z). 
In general, I(.) >> 0 implies that the words in the 
u-gram are strongly associated. Ot.herwise, their 
appearance may be simply by chance. 
Relative Frequency Count The relative fre- 
quency count for the i th n-gram is defined as: 
f~ 
K 
where fi is the total number of occurrences of the 
i th n-gram in the corpus, and K is the average 
number of occurrence of all the entries. In other 
words, f~ is normalized with respect to K to get 
the relative frequency. Intuitively, a frequently en- 
countered word n-gram is more likely to be a com- 
pound than a rarely used n-gram. Furthermore, it 
may not worth the cost of entering the compound 
into the dictionary if it occurs very few times. The 
relative frequency count is therefore used as a fea- 
ture for compound extraction. 
Using both the mutual information and rel- 
ative frequency count as the extraction features 
is desirable since using either of these two fea- 
tures alone cannot provide enough information for 
compound finding. By using relative frequency 
count alone, it is likely to choose the n-gram 
with high relative frequency count but low as- 
sociation {mutual information) among the words 
comprising the n-gram. For example, if P(x) 
and P(y) are very large, it may cause a large 
P(z,y) even though they are not related. How- 
ever, P(x, y)/P(z) × P(y) would be small for this 
case. 
On the other hand, by using mutual informa- 
tion alone it may be highly unreliable if P(x) and 
P(y) are too small. An n-gram may have high 
mutual information not because the words within 
it are highly correlated but due to a large estima- 
tion error. Actually, the relative frequency count 
and mutual information supplement each other. 
A group of words of both high relative frequency 
and mutual information is most likely to be com- 
posed of words which are highly correlated, and 
very commonly used. Hence, such an n-gram is a 
preferred compound candidate. 
The distribution statistics of the training cor- 
pus, excluding those n-grams that appear only 
once or twice, is shown in Table 1 and 2 (MI: mu- 
tual information, RFC: relative frequency count, 
cc: correlation coefficient, sd: standard devia- 
tion). Note that the means of mutual informa- 
tion and relative frequency count of the compound 
cluster are, in general, larger than those in the 
non-compound cluster. The only exception is the 
means of relative frequencies for trigrams. Since 
almost 86.5% of the non-compound trigrams oc- 
cur only once or twice, which are not considered 
in estimation, the average number of occurrence 
of such trigrams are smaller, and hence a larger 
In°°f I mean°f I sd°f I tokens MI MI 
bigram I 862 I 7.49 I 3.08 I trigram 245 7.88 2.51 
I I I RFC I covariance cc 
bigram I 3.18 I -0.71 I-0.0721 trigram 2.18 -0.41 -0.074 
Table 1: Distribution statistics of 
pounds 
mean of 
RFC 
2.43 
2.92 
corn- 
inoof I mo nof I sdof I tokens MI MI 
trigram 8057 3.55 2.24 
I RFC I covariance cc 
bigram I 3.50 -0.45 l-0.0511 trigram 2.99 -0.33 -0.049 
mean of 
RFC 
2.28 
3.14 
Table 2: Distribution statistics of non- 
compounds 
relative frequency than the compound cluster, in 
which only about 30.6% are excluded from consid- 
eration. 
Note also that mutual information and rel- 
ative frequency count are almost uncorrelated in 
both clusters since the correlation coefficients are 
close to 0. Therefore, it is appropriate to take 
the mutual information measure and relative fre- 
quency count as two supplementary features for 
compound extraction. 
Parts of Speech Part of speech is a very impor- 
tant feature for extracting compounds. In most 
cases, part of speech of compounds has the forms: 
\[noun, noun\] or \[adjective, noun\] (for bigrams) 
and \[noun, noun, noun\], \[noun, preposition, noun\] 
or \[adjective, noun, noun\] (for trigrams). There- 
fore, n-gram entries which violate such syntactic 
constraints should be filtered out even with high 
mutual information and relative frequency count. 
The precision rate of compound extraction will 
then be greatly improved. 
Parameter Estimation and 
Smoothing 
The parameters for the compound model Mr and 
non-compound model M,c can be evaluated form 
a training corpus that is tagged with parts of 
speech and normalized into stem forms. The cor- 
244 
pus is divided into two parts, one as the training 
corpus, and the other as the testing set. The n- 
grams in the training corpus are further divided 
into two clusters. The compound cluster com- 
prises the n-grams already in a compound dictio- 
nary, and the non-compound cluster consists of the 
n-grams which are not in the dictionary. How- 
ever, n-grams that occur only once or twice are 
excluded from consideration because such n-grams 
rarely introduce inconsistency and the estimation 
of their mutual information and relative frequency 
are highly unreliable. 
Since each n-gram may have different part 
of speech (POS) patterns Li in a corpus (e.g., 
Li = \[n n\] for a bigram) the mutual information 
and relative frequency counts will be estimated for 
each of such POS patterns. Furthermore, a partic- 
ular POS pattern for an n-gram may have several 
types of contextual POS's surrounding it. For ex- 
ample, a left context of 'adj' category and a right 
context of 'n' together with the above example 
POS pattern can form an extended POS pattern, 
such as Lij = \[adj (n n) n\], for the n-gram. By 
considering all these features, the numerator fac- 
tor for the log-likelihood ratio test is simplified in 
the following way to make parameter estimation 
feasible: 
P(aT\]Mc) x P(Mc) Hi:I 
\[P(it,, RL \[Mc) n, P(Mc) " , I-Ij=l P(Lij IMc)\] x 
where n is the number of POS patterns occuring 
in the text for the n-gram, rt i is the number of 
extended POS patterns corresponding to the i th 
POS pattern, Li, Lij is the jth extended POS pat- 
tern for Li, and MLI and RL~ represent the means 
of the mutual information and relative frequency 
count, respectively, for n-grams with POS pattern 
Li. The denominator factor for the non-compound 
cluster can be evaluated in the same way. 
For simplicity, a subscript c (/nc) is used 
for the parameters of the compound (/non- 
compound) model, e.g., P(~.IMc) ~- Pc(Z). As- 
sume that ML. and RL~ are of Gaussian distribu- 
tion, then the bivariate probability density func- 
tion Pc(ML,,RL,) for MLi and RL~ can be evalu- 
ated from their estimated means and standard de- 
viations \[Papoulis 1990\]. Further simplification on 
the factor Pc(Lij) is also possible. Take a bigram 
for example, and assume that the probability den- 
sity function depends only on the part of speech 
pattern of the bigram (C1, C2) (in this order), one 
left context POS Co and one right lookahead POS 
C3, the above formula can be decomposed as: 
P(Lo \[Me) 
= Pc(CO, C1, C2, C3) 
Pc(CaJC=) x Pc(C2\[C,) x Pc(C, lCo) x &(Co) 
A similar formulation for trigrams with one left 
context POS and one right context POS, i.e., 
Pc(Co, C1, C2, C3, C4), can be derived in a similar 
way. 
The n-gram entries with frequency count _ < 2 
are excluded from consideration before estimating 
parameters, because they introduce little inconsis- 
tency problem and may introduce large estimation 
error. After the distribution statistics of the two 
clusters are first estimated, we calculate the means 
and standard deviations of the mutual informa- 
tion and relative frequency counts. The entries 
with outlier values (outside the range of 3 stan- 
dard deviations of the mean) are discarded for es- 
timating a robust set of parameters. The factors, 
like Pc(C2\[C1), are smoothed by adding a flatten- 
ing constant 1/2 \[Fienberg and Holland 1972\] to 
the frequency counts before the probability is es- 
timated. 
Simulation Results 
After all the required parameters are estimated, 
both for the compound and non-compound clus- 
ters, each input text is tagged with appropriate 
parts of speech, and the log-likelihood function 
In$ for each word n-gram is evaluated. If it turns 
out that In ~ is greater than zero, then the n-gram 
is included in the compound list. The entries in 
the compound list are later sorted in the descend- 
ing order of A for use by the lexicographers. 
The training set consists of 12,971 sentences 
(192,440 words), and the testing set has 3,243 
sentences (49,314 words) from computer manu- 
als. There are totally 2,517 distinct bigrams and 
1,774 trigrams in the testing set, excluding n- 
grams which occur less than or equal to twice. 
The performance of the extraction approach for 
bigrams and trigrams is shown in Table 3 and 4. 
The recall and precision for the bigrams are 96.2% 
and 48.2%, respectively, and they become 96.6% 
and 39.6% for the trigrams. The high recall rates 
show that most compounds can be captured to the 
candidate list with the proposed approach. The 
precision rates, on the other hand, indicate that a 
real compound can be found approximately every 
2 or 3 entries in the candidate list. The method 
therefore provides substantial help for updating 
the dictionary with little human efforts. 
Note that the testing set precision of bigrams 
is a little higher than the training set. This sit- 
uation is unusual; it is due to the deletion of the 
low frequency n-grams from consideration. For in- 
stance, the number of compounds in the testing set 
occupies only a very small portion (about 2.8%) 
after low frequency bigrams are deleted from con- 
sideration. The recall for the testing set is there- 
fore higher than for the training set. 
245 
To make better trade-off between the preci- 
sion rate and recall, we could adjust the threshold 
for ln~. For instance, when ln~ = -4 is used 
for separating the two clusters, the recall will be 
raised with- a lower precision. On the contrary, by 
raising the threshold for In ~ to positive numbers, 
the precision will be raised at the cost of a smaller 
recall. 
training set testing set I 
recall rate (%) 97.7 96.2 
precision rate (%) 44.5 48.2 
Table 3: Performance for bigrams 
\[ training set testing set 
recall rate (%) I 97.6 96.6 
precision rate (%) I 40.2 39.6 
Table 4: Performance for trigrams 
Table 5 shows the first five bigrams and tri- 
grams with the largest ,~ for the testing set. 
Among them, all five bigrams and four out of five 
trigrams are plausible compounds. 
-------~ram I tr~gram \] 
dialog box 
mail label 
Word User's guide 
Microsoft Word User's 
main document 
data file 
File menu 
Template option button 
new document base 
File Name box 
Table 5: The first five bigrams and trigrams 
with the largest A for the testing set. 
Concluding Remarks 
In machine translation systems, information of 
the source compounds should be available before 
any translation process can begin. However, since 
compounds are very productive, new compounds 
are created from day to day. It is obviously im- 
possible to build a dictionary to contain all com- 
pounds. To guarantee correct parsing and transla- 
tion, new compounds must be extracted from the 
input text and entered into the dictionary. How- 
ever, it is too costly and time-consuming for the 
human to inspect the entire text to find the com- 
pounds. Therefore, an automatic method to ex- 
tract compounds from the input text is required. 
The method proposed in this paper uses mu- 
tual information, relative frequency count and 
part of speech as the features for discriminating 
compounds and non-compounds. The compound 
extraction problem is formulated as a two cluster 
classification problem in which an n-gram is as- 
signed to one of those two clusters using the like- 
lihood test method. With this method, the time 
for updating missing compounds can be greatly 
reduced, and the consistency between different 
posteditors can be maintained automatically. The 
testing set performance for the bigram compounds 
is 96.2% recall rate and 48.2% precision rate. For 
trigrams, the recall and precision are 96.6% and 
39.6%, respectively. 

References 
\[Bourigault 1992\] D. Bouriganlt, 1992. "Surface 
Grammar Analysis for the Extraction of Ter- 
minological Noun Phrases," In Proceedings of 
COLING-92, vol. 4, pp. 977-981, 14th Inter- 
national Conference on Computational Linguis- 
tics, Nantes, France, Aug. 23-28, 1992. 
\[Calzolari and Bindi 1990\] N. Calzolari and R. 
Bindi, 1990. "Acquisition of Lexical Infor- 
mation from a Large Textual Italian Corpus," 
In Proceedings of COLING-90, vol. 3, pp. 54- 
59, 13th International Conference on Computa- 
tional Linguistics, Helsinki, Finland, Aug. 20- 
25, 1990. 
\[Chen and Su 1988\] S.-C. Chen and K.-Y. Su, 
1988. "The Processing of English Compound 
and Complex Words in an English-Chinese Ma- 
chine Translation System," In Proceedings of 
ROCLING L Nantou, Taiwan, pp. 87-98, Oct. 
21-23, 1988. 
\[Church and Hanks 1990\] K. W. Church and P. 
Hanks, 1990. "Word Association Norms, Mu- 
tual Information, and Lexicography," Compu- 
tational Linguistics, pp. 22-29, vol. 16, Mar. 
1990. 
\[Fienberg and Holland 1972\] S. E. Fienberg and 
P. W. Holland, 1972. "On the Choice of Flat- 
tening Constants for Estimating Multinominal 
Probabilities," Journal of Multivariate Analy- 
sis, vol. 2, pp. 127-134, 1972. 
\[Levi 1978\] J.-N. Levi, 1978 The Syntax and Se- 
mantics of Complex Nominals, Academic Press, 
Inc., New York, NY, USA, 1978. 
\[Linet al. 1992\] Y.-C. Lin, T.-H. Chiang and K.- 
Y. Su, 1992. "Discrimination Oriented Proba- 
bilistic Tagging," In Proceedings of ROCLING 
V, Taipei, Taiwan, pp. 85-96, Sep. 18-20, 1992. 
\[Papoulis 1990\] A. Papoulis, 1990. Probability ~' 
Statistics, Prentice Hall, Inc., Englewood Cliffs, 
N J, USA, 1990. 
\[Su et al. 1991\] K.-Y. Su, Y.-L. Hsu and C. Sail- 
lard, 1991. "Constructing a Phrase Structure 
Grammar by Incorporating Linguistic Knowl- 
edge and Statistical Log-Likelihood Ratio," In 
Proceedings of ROCLING IV, Kenting, Taiwan, 
pp. 257-275, Aug. 18-20, 1991. 
\[Wu and Su 1993\] Ming-Wen Wu and Keh-Yih 
Su, 1993. "Corpus-based Automatic Com- 
pound Extraction with Mutual Information and 
Relative Frequency Count", In Proceedings of 
ROCLING VI, Nantou, Taiwan, ROC Compu- 
tational Linguistics Conference VI, pp. 207-216, 
Sep. 2-4, 1993. 
