Learning New Compositions from Given Ones 
Ji Donghong 
Dept. of Computer Science 
Tsinghua University 
Beijing, 100084 
P. R. China 
j dh©sl000e, cs. t s±nghua, edu. cn 
He Jun 
Dept. of Computer Science 
Herbin Institute of Technology 
hj ©pact 518. hit. edu. cn 
Huang Changning 
Dept. of Computer Science 
Tsinghua University 
Beijing, 100084 
P. R. China 
hcn@mail.tsinghua.edu.cn 
Abstract 
In this paper, we study the problem of 
"learning new compositions of words from 
given ones with a specific syntactic struc- 
ture, e.g., A-N or V-N structures. We first 
cluster words according to the given com- 
positions, then construct a cluster-based 
compositional frame for each word cluster, 
which contains both new and given compo- 
sitions relevant with the words in the clus- 
ter. In contrast to other methods, we don't 
pre-define the number of clusters, and for- 
malize the problem of clustering words as 
a non-linear optimization one, in which we 
specify the environments of words based on 
word clusters to be determined, rather than 
their neighboring words. To solve the prob- 
lem, we make use of a kind of cooperative 
evolution strategy to design an evolution- 
ary algorithm. 
1 Introduction 
Word compositions have long been a concern in lex- 
icography(Benson et al. 1986; Miller et al. 1995), 
and now as a specific kind of lexical knowledge, 
it has been shown that they have an important 
role in many areas in natural language process- 
ing, e.g., parsing, generation, lexicon building, word 
sense disambiguation, and information retrieving, 
etc.(e.g., Abney 1989, 1990; Benson et al. 1986; 
Yarowsky 1995; Church and Hanks 1989; Church, 
Gale, Hans, and Hindle 1989). But due to the 
huge number of words, it is impossible to list all 
compositions between words by hand in dictionar- 
ies. So an urgent problem occurs: how to auto- 
matically acquire word compositions? In general, 
word compositions fall into two categories: free 
compositions and bound compositions, i.e., collo- 
cations. Free compositions refer to those in which 
words can be replaced by other similar ones, while 
in bound compositions, words cannot be replaced 
freely(Benson 1990). Free compositions are pre- 
dictable, i.e., their reasonableness can be determined 
according to the syntactic and semantic properties 
of the words in them. While bound compositions 
are not predictable, i.e., their reasonableness cannot 
be derived from the syntactic and semantic prop- 
erties of the words in them(Smadja 1993). Now 
with the availability of large-scale corpus, automatic 
acquisition of word compositions, especially word 
collocations from them have been extensively stud- 
ied(e.g., Choueka et al. 1988; Church and Hanks 
1989; Smadja 1993). The key of their methods is 
to make use of some statistical means, e.g., frequen- 
cies or mutual information, to quantify the compo- 
sitional strength between words. These methods are 
more appropriate for retrieving bound compositions, 
while less appropriate for retrieving free ones. This 
is because in free compositions, words are related 
with each other in a more loose way, which may 
result in the invalidity of mutual information and 
other statistical means in distinguishing reasonable 
compositions from unreasonable ones. In this pa- 
per, we start from a different point to explore the 
problem of automatic acquisition of free composi- 
tions. Although we cannot list all free compositions, 
we can select some typical ones as those specified 
in some dictionaries(e.g., Benson 1986; Zhang et al. 
1994). According to the properties held by free com- 
positions, we can reasonably suppose that selected 
compositions can provide strong clues for others. 
Furthermore we suppose that words can be classi- 
fied into clusters, with the members in each cluster 
similar in their compositional ability, which can be 
characterized as the set of the words able to com- 
bined with them to form meaningful phrases. Thus 
any given composition, although specifying the rela- 
tion between two words literally, suggests the rela- 
tion between two clusters. So for each word(or clus- 
Ji, He and Huang 25 Learning New Compositions 
Ji Donghong, He Jun and Huang Changning (1997) Learning New Compositions from Given Ones. In T.M. 
Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 25-32. 
(~) 1997 Association for Computational Linguistics 
ter), there exist some word clusters, the word (or 
the words in the cluster) can and only can combine 
with the words in the clusters to form meaningful 
phrases. We call the set of these clusters composi- 
tional frame of the word (or the cluster). A seem- 
ingly plausible method to determine compositional 
frames is to make use of pre-defined semantic classes 
in some thesauri(e.g., Miller et al. 1993; Mei et 
al. 1996). The rationale behind the method is to 
take such an assumption that if one word can be 
combined with another one to form a meaningful 
phrase, the words similar to them in meaning can 
also be combined with each other. But it has been 
shown that the similarity between words in meaning 
doesn't correspond to the similarity in compositional 
ability(Zhu 1982). So adopting semantic classes to 
construct compositional frames will result in consid- 
erable redundancy. An alternative to semantic class 
is word cluster based on distributional environment 
(Brown et al., 1992), which in general refers to the 
surrounding words distributed around certain word 
(e.g., Hatzivassiloglou et al., 1993; Pereira et al., 
1993), or the classes of them(Bensch et al., 1995), or 
more complex statistical means (Dagan et al., 1993). 
According to the properties of the clusters in com- 
positional frames, the clusters should be based on 
the environment, which, however, is narrowed in the 
given compositions. Because the given compositions 
are listed by hand, it is impossible to make use of 
statistical means to form the environment, the re- 
maining choices are surrounding words or classes of 
them. 
Pereira et a1.(1993) put forward a method to clus- 
ter nouns in V-N compositions, taking the verbs 
which can combine with a noun as its environment. 
Although its goal is to deal with the problem of 
data sparseness, it suffers from the problem itself. 
A strategy to alleviate the effects of the problem is 
to cluster nouns and verbs simultaneously. But as 
a result, the problem of word clustering becomes a 
bootstrapping one, or a non-linear one: the environ- 
ment is also to be determined. Bensch et al. (1995) 
proposed a definite method to deal with the general- 
ized version of the non-linear problem, but it suffers 
from the problem of local optimization. 
In this paper, we focus on A-N compositions in 
Chinese, and explore the problem of learning new 
compositions from given ones. In order to copy with 
the problem of sparseness, we take adjective clusters 
as nouns' environment, and take noun clusters as ad- 
jectives' environment. In order to avoid local opti- 
mal solutions, we propose a cooperative evolutionary 
strategy. The method uses no specific knowledge of 
A-N structure, and can be applied to other struc- 
tures. 
The remainder of the paper is organized as fol- 
lows: in section 2, we give a formal description of the 
problem. In section 3, we discuss a kind of coopera- 
tive evolution strategy to deal with the problem. In 
section 4, we explore the problem of parameter es- 
timation. In section 5, we present our experiments 
and the results as well as their evaluation. In section 
6, we give some conclusions and discuss future work. 
2 Problem Setting 
Given an adjective set and a noun set, suppose for 
each noun, some adjectives are listed as its composi- 
tional instances. Our goal is to learn new reasonable 
compositions from the instances. To do so, we clus- 
ter nouns and adjectives simultaneously and build a 
compositional frame for each noun. 
Suppose A is the set of adjectives, N is the set of 
nouns, for any a E A, let f(a) C N be the instance 
set of a, i.e., the set of nouns in N which can be 
combined with a, and for any n E N, let g(n) C A 
be the instance set of n, i.e., the set of adjectives in 
A which can be combined with n. We first give some 
formal definitions in the following: 
Definition 1 partition 
Suppose U is a non-empty finite set, we call < 
U1, U2, ..., Uk > a partition of U, if: 
i) for any Ui, and Uj, i ¢ j, Ui M Uj = 
ii) U = Ul<t<kUl 
We call Ui a cluster of U. 
Suppose U --< A1,A2,...,Ap > is a partition 
of A, V ~< N1,N2,...,Nq > is a partition of 
N, f and g are defined as above, for any N/, let 
g(N ) = {& : n # ¢), and for 
any n, let ,f<U,V>(n ) =l {a : 3At,Al E g(Nk),a E 
Ajl} -g(n) I, where n E Nk. Intuitively, 5<Uv>(n) 
is the number of the new instances relevant with n. 
We define the general learning amount as the fol- 
lowing: 
Definition 2 learning amount 
hEN 
Based on the partitions of both nouns and adjec- 
tives, we can define the distance between nouns and 
that between adjectives. 
Definition 3 distance between words 
for anya EA, let fv(a) = {Ni : 1<i < q, Ni M 
f(a) ~ ~}, for any n E N, let g~= {Ai : 1 < i < 
p, Ai Mg(n) ¢ ¢), for any two nouns nl and ha, any 
two adjectives al and a2, we define the distances 
between them respectively as the following: 
Ji, He and Huang 26 Learning New Compositions 
i) 
ii) 
dis~(nl, n2) = 1 - gff(nl) V1 gu(n2) 
gu(nl) U gu(n2) 
disv(al, a2) = 1 - f~-(al) N fv(a2) 
fV(al) t_J fv(a2) 
According to the distances between words, we can 
define the distances between word sets. 
Definition 4 distance between word sets 
Given any two adjective sets X1, X2 C A, any two 
noun sets Y1, Y2 C N, their distances are: 
i) 
disv(Zl, X2) = max {disv(al, a2)} alEXi,a2EX2 
ii) 
max dis~r hi, dis (Yl,Y2) = { v( n2)} 
nl EYi,n2EY2 
Intuitively, the distance between word sets refer 
to the biggest distance between words respectively 
in the two sets. 
We formalize the problem of clustering nouns and 
adjectives simultaneously as an optimization prob- 
lem with some constraints. 
(1)To determine a partition U =< 
A1,A2,...,Ap > of A, and a partition 
V =< N1,N2,...,Nq > of N, where p,q > O, 
which satisfies i) and ii), and minimize ~<e,v>" 
i) for any al, a2 E Ai, 1 < i < p, disg(al, as) < tl; 
for Ai and Aj, 1 < i # j < p, disv(Ai,Aj) > tl; 
ii) for any nl,n2 E Ni,1 < i < q, disg(nl,n2) < 
t2; for Ni and Ny, 1 _< i ¢ j _< p, disg(Ni, Nj) k t2; 
Intuitively, the conditions i) and ii) make the dis- 
tances between words within clusters smaller, and 
those between different clusters bigger, and to min- 
imize 6 ~ ~_ means to minimize the distances be- 
tween the words within clusters. 
In fact, (U, V) can be seen as an abstraction model 
over given compositions, and tl, t2 can be seen as its 
abstraction degree. Consider the two special case: 
one is tl = t2 = 0, i.e., the abstract degree is the 
lowest, when the result is that one noun forms a clus- 
ter and on adjective forms a cluster, which means 
that no new compositions are learned. The other 
is tl = t2 = 1, the abstract degree is the highest, 
when a possible result is that all nouns form a clus- 
ter and all adjectives form a cluster, which means 
that all possible compositions, reasonable or unrea- 
sonable, are learned. So we need estimate appropri- 
ate values for the two parameters, in order to make 
an appropriate abstraction over given compositions, 
i.e., make the compositional frames contain as many 
reasonable compositions as possible, and as few un- 
reasonable ones as possible. 
3 Cooperative Evolution 
Since the beginning of evolutionary algorithms, they 
have been applied in many areas in AI(Davis et 
al., 1991; Holland 1994). Recently, as a new and 
powerful learning strategy, cooperative evolution has 
gained much attention in solving complex non-linear 
problem. In this section, we discuss how to deal with 
the problem (1) based on the strategy. 
According to the interaction between adjective 
clusters and noun clusters, we adopt such a coop- 
erative strategy: after establishing the preliminary 
solutions, for any preliminary solution, we optimize 
N's partition based on A's partition, then we opti- 
mize A's partition based on N's partition, and so 
on, until the given conditions are satisfied. 
3.1 Preliminary Solutions 
When determining the preliminary population, we 
also cluster nouns and adjectives respectively. How- 
ever, we see the environment of a noun as the set 
of all adjectives which occur with it in given com- 
positions, and that of an adjective as the set of all 
the nouns which occur with it in given compositions. 
Compared with (1), the problem is a linear cluster- 
ing one. 
Suppose al,a2 E A, f is defined as above, we de- 
fine the linear distance between them as (2): 
(2) dis(a1 a2) -- 1 - I/(ax)nl(a2)l ' \[f(ax)Of(a2)l 
Similarly, we can define the linear distance be- 
tween nouns dis(nl,n2) based on g. In contrast, 
we call the distances in definition 3 non-linear dis- 
tances. 
According to the linear distances between adjec- 
tives, we can determine a preliminary partition of 
N: randomly select an adjective and put it into an 
empty set X, then scan the other adjectives in A, for 
any adjective in A - X, if its distances from the ad- 
jectives in X are all smaller than tl, then put it into 
X, finally X forms a preliminary cluster. Similarly, 
we can build another preliminary cluster in (A- X). 
So on, we can get a set of preliminary clusters, which 
is just a partition of A. According to the different 
order in which we scan the adjectives, we can get dif- 
ferent preliminary partitions of A. Similarly, we can 
determine the preliminary partitions of N based on 
the linear distances between nouns. A partition of A 
and a partition of N forms a preliminary solution of 
(1), and all possible preliminary solutions forms the 
Ji, He and Huang 27 Learning New Compositions 
population of preliminary solutions, which we also 
call the population of Oth generation solutions. 
3.2 Evolution Operation 
In general, evolution operation consists of recom- 
bination, mutation and selection. Recombination 
makes two solutions in a generation combine with 
each other to form a solution belonging to next gen- 
eration. Suppose < U~ i), Vi(')> and < U~ i), V2(') > 
are two ith generation solutions, where U~ t) and U~ i) 
are two partitions of A, V? i) and V2 (i) are two par- 
titions of N, then < U~ '), V2 (i) > and < U2 (i), V1 (i) > 
forms two possible (i+l)th generation solutions. 
Mutation makes a solution in a generation im- 
prove its fitness, and evolve into a new one belonging 
to next generation. Suppose < U (i), U (i) > is a ith 
generation solution, where U (i) =< A1, A2, ..., Ap >, 
V (i) =< N1,N2,...,Nq > are partitions of A and 
N respectively, the mutation is aimed at optimizing 
V(0 into V (t+l) based on U (t), and makes V (t+l) sat- 
isfy the condition ii) in (1), or optimizing U (t) into 
U(t+l) based on V (0, and makes U (l+1) satisfy the 
condition i) in (1), then moving words across clusters 
to minimize d<u,v>" 
We design three steps for mutation operation: 
splitting, merging and moving, the former two are 
intended for the partitions to satisfy the conditions 
in (1), and the third intended to minimize (f<U,v > . 
In the following, we take the evolution of V (t+l) as 
an example to demonstrate the three steps. 
Splitting Procedure. For any Nk, 1 _< k _<, if there 
exist hi,n2 • Nk, such that disv(,+~)(nl,n2 ) _> t2, 
then splitting Nk into two subsets X and Y. The 
procedure is given as the following: 
i) Put nl into X, n2 into Y, 
ii) Select the noun in (Nk -- (X U Y)) whose dis- 
tance from nl is the smallest, and put it into X, 
iii) Select the noun in (Nk -- (X t_J Y)) whose dis- 
tance from n2 is the smallest, and put it into Y, 
iv) Repeat ii) and iii), until X t3 Y = Nk. 
For X (or Y), if there exist nl,n2 • X (or Y), 
disv(o >_ t2, then we can make use of the above 
procedure to split it into more smaller sets. Obvi- 
ously, we can split any Nk in V(0 into several subsets 
which satisfy the condition ii) in (1) by repeating the 
procedure. 
Merging procedure. If there exist Nj and Nk, 
where 1 _< j,k _< q, such that disu(~)(Nt,Nk ) < t2, 
then merging them into a new cluster. 
It is easy to prove that U (t) and V(0 will meet the 
condition i) and ii) in (1) respectively, after splitting 
and merging procedure. 
Moving procedure. We call moving n from Nj to 
Nk a word move, where 1 < j ¢ k < q, denoted 
as (n, Nj, Nk), if the condition (ii) remains satisfied. 
The procedure is as the following: 
i) Select a word move (n, Nj, Na) which minimizes 
~<U,V> ' 
ii) Move n from Nj to Nk, 
iii) Repeat i) and ii) until there are no word moves 
which reduce 6<u,v>" 
After the three steps, U (i) and V (i) evolve into 
U (i+U and V (i+D respectively. 
Selection operation selects the solutions among 
those in the population of certain generation accord- 
ing to their fitness. We define the fitness of a solution 
as its learning amount. 
We use Ji to denote the set of i$h generation so- 
lutions, H(i, i + 1), as in (3), specifies the similarity 
between ith generation solutions and (i + 1)th gen- 
eration solutions. 
(3) 
H(i, i + 1) = 
min{5(u(,+l),v(i+l)) : (U (~+1), V (i+1)) • J~+l} 
min{5(u(,),v(,) ) : (U (~) , V (i)) E J~) 
Let t3 be a threshold for H(i, i + 1), the following 
is the general evolutionary algorithm: 
Procedure Clustering(A, N, f, g); 
begin 
i) Build preliminary solution population I0, 
ii) Determine 0th generation solution set J0 
according to their fitness, 
iii) Determine/i+1 based on Ji: 
a) Recombination: if (U~ i), Vff)), 
(U2 ('), V2 (')) E J,, then (U~ '), V2(')), (U (i), V2 (')) E 
I~+1, 
b) Mutation: if (U( ~),V (~)) E J~, then 
(U (i), V(~+I)), (U (~+D, V (~)) E I~+1, 
iv) Determine J~+l from Ii+1 according to 
their fitness, 
v) If H(i, i + 1) > t3, then exit, otherwise 
goto iii), 
end 
After determining the clusters of adjectives and 
nouns, we can construct the compositional frame for 
each noun cluster or each noun. In fact, for each 
noun cluster Ni,g(N~) = {Aj : 3n E Ni,Aj Ng(n) 7£ 
¢) is just its compositional frame, and for any noun 
in N/, g(Ni) is also its compositional frame. Simi- 
larly, for each adjective (or adjective cluster), we can 
also determine its compositional frame. 
4 Parameter Estimation 
The parameters tl and t2 in (1) are the thresholds 
for the distances between the clusters of A and N re- 
Ji, He and Huang 28 Learning New Compositions 
spectively. If they are too big, the established frame 
will contain more unreasonable compositions, on the 
other hand, if they are too small, many reason- 
able compositions may not be included in the frame. 
Thus, we should determine appropriate values for t~ 
and t2, which makes the fame contain as many rea- 
sonable compositions as possible, meanwhile as few 
unreasonable ones as possible. 
Suppose Fi is the compositional frame of Ni, let 
F =< F1,F~,...,Fq >, for any F~, let AF~ = {a : 
3X E F~, a E X}. Intuitively, AF~ is the set of the 
adjectives learned as the compositional instances of 
the noun in Ni. For any n ~ N~, we use An to de- 
note the set of all the adjectives which in fact can 
modify n to form a meaningful phrase, we now de- 
fine deficiency rate and redundancy rate of F. For 
convenience, we use (iF to represent 5(U, V). 
Definition 5 Deficiency rate o~F 
El<i<q EneN, \[ A~ - ARe \[ 
Intuitively, aF refers to the ratio between the rea- 
sonable compositions which are not learned and all 
the reasonable ones. 
Definition 6 Redundancy rate fir 
fiR ---- El_<i_<q EneNi \] AF~ -- An I 
5F 
Intuitively, fie refers to the ratio between unrea- 
sonable compositions which are learned and all the 
learned ones. 
So the problem of estimating tl and t2 can be 
formalized as (5): 
(5) to find tl and t2, which makes av = 0, and 
flF=0. 
But, (5) may exists no solutions, because its con- 
straints are two strong, on one hand, the sparseness 
of instances may cause ~F not to get 0 value, even if 
tl and t~ close to 1, on the other hand, the difference 
between words may cause fir not to get 0 value, even 
if tl and t2 close to 0. So we need to weaken (5). 
In fact, both O~F and flF can be seen as the 
functions of tl and t2, denoted as o~f(tl,t2) and 
l~F(tl, tu) respectively. Given some values for tl and 
t2, we can compute aF and fiR. Although there 
may exist no values (t~,t~) for (tl,t2), such that 
! ! aF(t~,t~) = flF(tx,t2) = 0, but with t~ and t2 in- 
creasing, off tends to decrease, while fiE tends to 
increase. So we can weaken (5) as (6). 
(6) to find tl and t2, which maximizes (7). (7) 
~(~l,~)~rl(~',,~'~) ~F(tl, t2) 
Fi (t' 1 , t'2) \[ 
~(ta,t:)eF2(t' 1,t~2) ~F(tl, 42)) 
I r2(t'l, I 
where rx(t~,t~) = {(tl,t2) : 0 < tl _< t~,0 _< t2 _< 
t~), r2(t~,t~) = {(tl,t2): t~ < tl < 1,t~ < t2 < 1} 
Intuitively, if we see the area (\[0, 1\]; \[0, 1\]) as a 
sample space for tl and t2, Fl(t~,t~) and F2(t~,t~) 
are its sub-areas. So the former part of (7) is the 
! ! mean deficiency rate of the points in Fl(tl, tz), and 
the latter part of (7) is the mean deficiency rate of 
the points in F2(t~,t~). To maximize (7) means to 
maximize its former part, while to minimize its latter 
part. So our weakening (5) into (6) lies in finding 
a point (t~,t~), such that the mean deficiency rate 
of the sample points in F2(t~,t~) tends to be very 
low, rather than finding a point (t~,t~), such that 
its deficiency rate is 0. 
5 Experiment Results and 
Evaluation 
We randomly select 30 nouns and 43 adjectives, 
and retrieve 164 compositions(see Appendix I) be- 
tween them from Xiandai Hanyu Cihai (Zhang et 
al. 1994), a word composition dictionary of Chinese. 
After checking by hand, we get 342 reasonable com- 
positions (see Appendix I), among which 177 ones 
are neglected in the dictionary. So the sufficiency 
rate (denoted as 7) of these given compositions is 
47.9%. 
We select 0.95 as the value of t3, and let tl = 
0.0, 0.1,0.2, ..., 1.0, t2 = 0.0, 0.1, 0.2, ..., 1.0 respec- 
tively, we get 121 groups of values for O~F and fiR. 
Fig.1 and Fig.2 demonstrate the distribution of aF 
and ~3F respectively. 
dc£ielcney 
i!iiiii!iiiiiii!i!iiiii! 
4o .... "    i i iiiiiiiiiiiiiiiiiiiiiiiiiii 
ii:: .... 3 
t2 
tl 
Figure 1: The distribution of O~F 
For any given tl, and t2,we found (7) get its 
biggest value when t I = 0.4 and t2 = 0.4, so we se- 
Ji, He and Huang 29 Learning New Compositions 
rcdundanec 
• atc(%) ..... ~:~'..::':ili::iii~i~i~i~ ~160-80 .~<.v..'.~!!!~i!!!i!!!ii!!!!:':':!:!:!!!!i!!!i!i!i !iiiii!iii!iii!i:i::':':':" 
D 4.0-50 L00:~:~iii::iiiiiii;;ii;iiiiiiii:: 
~ ....... i? :~i ~ .......... \[\]Z0-~  0:iiiiiiiiiiiiiiiii',i',iii',iiiiiii',0 iiiiiiiiiiiiiii  ii 
t2 (L/tO) 
Figure 2: The distribution of fir 
~(%) ~1 ~2 O/F (%) BF(%) 
32.5 0.5 0.6 13.2 34.5 
47.9 0.4 0.4 15.4 26.4 
58.2 0.4 0.4 10.3 15.4 
72.5 0.3 0.3 9.5 7.6 
Table 1: The relation between 7,~1,t2, aF and fiR. 
~(%) ~F(%) e1(%) BF(%) ~2(%) 
58.2 11.2 8.3 17.5 10.8 
72.5 7.4 4.1 8.7 5.4 
Table 2: The relation between 7, mean O~F and mean 
BF, el and e~ is the mean error. 
lect 0.4 as the appropriate value for both tl and t2. 
The result is listed in Appendix II. From Fig.1 and 
Fig.2, we can see that when tl = 0.4 and t2 = 0.4, 
both c~F and BF get smaller values. With the two 
parameters increasing, aF decreases slowly, while BF 
increases severely, which demonstrates the fact that 
the learning of new compositions from the given ones 
has reached the limit at the point: the other reason- 
able compositions will be learned at a cost of severely 
raising the redundancy rate. 
From Fig.l, we can see that o~F generally increases 
as ~1 and t2 increase, this is because that to in- 
crease the thresholds of the distances between clus- 
ters means to raise the abstract degree of the model, 
then more reasonable compositions will be learned. 
On the other hand, we can see from Fig.2 that when 
tl _> 0.4, t2 >_ 0.4, fiR roughly increases as ~1 and ~2 
increase, but when tz < 0.4, or t2 < 0.4, fir changes 
in a more confused manner. This is because that 
when tl < 0.4, or ~2 < 0.4, it may be the case that 
much more reasonable compositions and much less 
unreasonable ones are learned, with tl and t2 in- 
creasing, which may result in fiR's reduction, other- 
wise fir will increase, but when tz >_ 0.4, t2 > 0.4, 
most reasonable compositions have been learned, so 
it tend to be the case that more unreasonable com- 
positions will be learned as tl and t2 increase, thus 
fir increases in a rough way. 
To explore the relation between % aF and fiE, 
we reduce or add the given compositions, then es- 
timate Q and t2, and compute aRE and fiR. Their 
correspondence is listed in Table 1. 
From Table 1, we can see that as 7 increases, the 
estimated values for tl and t2 will decrease, and BE 
will also decrease. This demonstrates that if given 
less compositions, we should select bigger values for 
the two parameters in order to learn as many reason- 
Ji, He and Huang 
able compositions as possible, however, which will 
lead to non-expectable increase in fly. If given more 
compositions, we only need to select smaller values 
for the two parameters to learn as many reasonable 
compositions as possible. 
We select other 10 groups of adjectives and nouns, 
each group contains 20 adjectives and 20 nouns. 
Among the 10 groups, 5 groups hold a sufficiency 
rate about 58.2%, the other 5 groups a sufficiency 
rate about 72.5%. We let ~1 -~ 0.4 and t2 = 0.4 for 
the former 5 groups, and let tl = 0.3 and t2 = 0.3 for 
the latter 5 groups respectively to further consider 
the relation between 7, o~F and fiR, with the values 
for the two parameters fixed. 
Table 2 demonstrates that for any given compo- 
sitions with fixed sufficiency rate, there exist close 
values for the parameters, which make c~F and fir 
maintain lower values, and if given enough compo- 
sitions, the mean errors of O~FF and fie will be lower. 
So if given a large number of adjectives and nouns to 
be clustered, we can extract a small sample to esti- 
mate the appropriate values for the two parameters, 
and then apply them into the original tasks. 
6 Conclusions and Future work 
In this paper, we study the problem of learning new 
word compositions from given ones by establishing 
compositional frames between words. Although we 
focus on A-N structure of Chinese, the method uses 
no structure-specific or language-specific knowledge, 
and can be applied to other syntactic structures, and 
other languages. 
There are three points key to our method. First, 
we formalize the problem of clustering adjectives and 
nouns based on their given compositions as a non- 
linear optimization one, in which we take noun clus- 
ters as the environment of adjectives, and adjective 
30 Learning New Compositions 
P 
clusters as the environment of nouns. Second, we 
design an evolutionary algorithm to determine its 
optimal solutions. Finally, we don't pre-define the 
number of the clusters, instead it is automatically 
determined by the algorithm. 
Although the effects of the sparseness problem 
can be alleviated compared with that in traditional 
methods, it is still the main problem to influence the 
learning results. If given enough and typical compo- 
sitions, the result is very promising. So important 
future work is to get as many typical compositions 
as possible from dictionaries and corpus as the foun- 
dation of our algorithms. 
At present, we focus on the problem of learning 
compositional frames from the given compositions 
with a single syntactic structure. In future, we may 
take into consideration several structures to cluster 
words, and use the clusters to construct more com- 
plex frames. For example, we may consider both 
A-N and V-N structures in the meantime, and build 
the frames for them simultaneously. 
Now we make use of sample points to estimate 
appropriate values for the parameters, which seems 
that we cannot determine very accurate values due 
to the computational costs with sample points in- 
creasing. Future work includes how to model the 
sample points and their values using a continuous 
function, and estimate the parameters based on the 
function. 

References 
Abney, S. 1989. Parsing by Chunks. In C. Tenny 
ed. The MIT Parsing Volume, MIT Press. 
Abney, S. 1990. Rapid Incremental Parsing with 
Repair. in Proceedings of Waterloo Conference 
on Electronic Text Research. 
Bensch, P.A. and W. J. Savitch. 1995. An 
Occurrence-Based Model of Word Categoriza- 
tion, Annals of Mathematics and Artificial In- 
telligence, 14:1-16. 
Benson, M., Benson, E., and Ilson, R. 1986. The lex- 
icographic Description of English. John Ben- 
jamins. 
Benson, M. 1986. The BBI Combinatory Dictionary 
of English: A Guide to Word Combinations. 
John Benjamins. 
Benson. M. 1990. Collocations and General - Pur- 
pose Dictionaries. International Journal of 
Lexicography, 3(1): 23-35. 
Davis, L. et al. 1991. Handbook of Genetic Algo- 
rithms. New York: Van Nostrand, Reinhold. 
Choueka, Y., T. Klein, and E. Neuwitz. 1983. Au- 
tomatic Retrieval of Frequent Idiomatic and 
Collocational Expressions in a Large Corpus. 
Journal of Literary and Linguistic Computing, 
4: 34-38. 
Church, K. and P. Hanks. 1989. Word Association 
Norms, Mutual Information, and Lexicogra- 
phy, in Proceedings of 27th Annual Meeting 
of the Association for Computational Linguis- 
tics, 76-83. 
Church, K., W. Gale, P. Hanks, and D. Hindle. 
1989. Parsing, Word Associations and Typi- 
cal Predicate-Argument relations, in Proceed- 
ings of the International Workshop on Pars- 
ing Technologies, Carnegie Mellon University, 
Pittsburgh, PA. 103-112. 
Holland, J.H. 1992. Adaption in Natural and Arti- 
ficial Systems, 2nd edition, Cambridge, Mas- 
sachusetts, MIT Press. 
Hatzivassiloglou, V. and K.R.Mckeown. Towards the 
Automatic Identification of Adjectival Scales: 
Clustering of adjectives According to Meaning. 
In Proceedings of Annual Meeting of 31st ACL, 
Columbus, Ohio, USA. 
Lin, X.G. et al. 1994. Xiandai Hanyu Cihai. Renmin 
Zhongguo Press(in Chinese). 
Mei, J.J. et al. 1983. TongYiCi CiLin (A Chinese 
Thesaurus). Shanghai Cishu press, Shanghai. 
Miller, G.A., R. Backwith, C. Fellbaum, D. Gross, 
K.J. Miller. 1993 Introduction to WordNet: 
An On-line Lexical Database, International 
Journal of Lexicography, (Second Edition. 
Pereira, F., N. Tishby, and L. Lillian. 1993. Dis- 
tributional Clustering of English Words, In 
Proceedings of Annual Meeting of 31st ACL, 
Columbus, Ohio, USA, 1995. 
Smadia, F. 1993. Retrieving Collocations from Text: 
Xtract, Computational Linguistics, 19(1). 
Yarowsky, D. 1995. Unsupervised word sense disam- 
biguation rivaling supervised methods. In Pro- 
ceedings of the 33th Annual Meeting of the As- 
sociation for Computational Linguistics, Cam- 
bridge, Massachusetts. 
Zhu, D.X. 1982. Lectures in Grammar. Shanghai 
Education Press(in Chinese). 
