File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1004_metho.xml

Size: 21,887 bytes

Last Modified: 2025-10-06 14:14:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1004">
  <Title>Learning New Compositions from Given Ones</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Problem Setting
</SectionTitle>
    <Paragraph position="0"> Given an adjective set and a noun set, suppose for each noun, some adjectives are listed as its compositional instances. Our goal is to learn new reasonable compositions from the instances. To do so, we cluster nouns and adjectives simultaneously and build a compositional frame for each noun.</Paragraph>
    <Paragraph position="1"> Suppose A is the set of adjectives, N is the set of nouns, for any a E A, let f(a) C N be the instance set of a, i.e., the set of nouns in N which can be combined with a, and for any n E N, let g(n) C A be the instance set of n, i.e., the set of adjectives in A which can be combined with n. We first give some formal definitions in the following: Definition 1 partition Suppose U is a non-empty finite set, we call &lt; U1, U2, ..., Uk &gt; a partition of U, if: i) for any Ui, and Uj, i C/ j, Ui M Uj = ii) U = Ul&lt;t&lt;kUl We call Ui a cluster of U.</Paragraph>
    <Paragraph position="2"> Suppose U --&lt; A1,A2,...,Ap &gt; is a partition of A, V ~&lt; N1,N2,...,Nq &gt; is a partition of N, f and g are defined as above, for any N/, let g(N ) = {&amp; : n # C/), and for any n, let ,f&lt;U,V&gt;(n ) =l {a : 3At,Al E g(Nk),a E Ajl} -g(n) I, where n E Nk. Intuitively, 5&lt;Uv&gt;(n) is the number of the new instances relevant with n. We define the general learning amount as the following: null Definition 2 learning amount hEN Based on the partitions of both nouns and adjectives, we can define the distance between nouns and that between adjectives.</Paragraph>
    <Paragraph position="3"> Definition 3 distance between words for anya EA, let fv(a) = {Ni : 1&lt;i &lt; q, Ni M f(a) ~ ~}, for any n E N, let g~= {Ai : 1 &lt; i &lt; p, Ai Mg(n) C/ C/), for any two nouns nl and ha, any two adjectives al and a2, we define the distances between them respectively as the following: Ji, He and Huang 26 Learning New Compositions</Paragraph>
    <Paragraph position="5"> According to the distances between words, we can define the distances between word sets.</Paragraph>
    <Paragraph position="6"> Definition 4 distance between word sets Given any two adjective sets X1, X2 C A, any two noun sets Y1, Y2 C N, their distances are:</Paragraph>
    <Paragraph position="8"> Intuitively, the distance between word sets refer to the biggest distance between words respectively in the two sets.</Paragraph>
    <Paragraph position="9"> We formalize the problem of clustering nouns and adjectives simultaneously as an optimization problem with some constraints.</Paragraph>
    <Paragraph position="10"> (1)To determine a partition U =&lt; A1,A2,...,Ap &gt; of A, and a partition V =&lt; N1,N2,...,Nq &gt; of N, where p,q &gt; O, which satisfies i) and ii), and minimize ~&lt;e,v&gt;&amp;quot; i) for any al, a2 E Ai, 1 &lt; i &lt; p, disg(al, as) &lt; tl; for Ai and Aj, 1 &lt; i # j &lt; p, disv(Ai,Aj) &gt; tl; ii) for any nl,n2 E Ni,1 &lt; i &lt; q, disg(nl,n2) &lt; t2; for Ni and Ny, 1 _&lt; i C/ j _&lt; p, disg(Ni, Nj) k t2; Intuitively, the conditions i) and ii) make the distances between words within clusters smaller, and those between different clusters bigger, and to minimize 6 ~ ~_ means to minimize the distances between the words within clusters.</Paragraph>
    <Paragraph position="11"> In fact, (U, V) can be seen as an abstraction model over given compositions, and tl, t2 can be seen as its abstraction degree. Consider the two special case: one is tl = t2 = 0, i.e., the abstract degree is the lowest, when the result is that one noun forms a cluster and on adjective forms a cluster, which means that no new compositions are learned. The other is tl = t2 = 1, the abstract degree is the highest, when a possible result is that all nouns form a cluster and all adjectives form a cluster, which means that all possible compositions, reasonable or unreasonable, are learned. So we need estimate appropriate values for the two parameters, in order to make an appropriate abstraction over given compositions, i.e., make the compositional frames contain as many reasonable compositions as possible, and as few unreasonable ones as possible.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cooperative Evolution
</SectionTitle>
    <Paragraph position="0"> Since the beginning of evolutionary algorithms, they have been applied in many areas in AI(Davis et al., 1991; Holland 1994). Recently, as a new and powerful learning strategy, cooperative evolution has gained much attention in solving complex non-linear problem. In this section, we discuss how to deal with the problem (1) based on the strategy.</Paragraph>
    <Paragraph position="1"> According to the interaction between adjective clusters and noun clusters, we adopt such a cooperative strategy: after establishing the preliminary solutions, for any preliminary solution, we optimize N's partition based on A's partition, then we optimize A's partition based on N's partition, and so on, until the given conditions are satisfied.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Preliminary Solutions
</SectionTitle>
      <Paragraph position="0"> When determining the preliminary population, we also cluster nouns and adjectives respectively. However, we see the environment of a noun as the set of all adjectives which occur with it in given compositions, and that of an adjective as the set of all the nouns which occur with it in given compositions.</Paragraph>
      <Paragraph position="1"> Compared with (1), the problem is a linear clustering one.</Paragraph>
      <Paragraph position="2"> Suppose al,a2 E A, f is defined as above, we define the linear distance between them as (2): (2) dis(a1 a2) -- 1 - I/(ax)nl(a2)l ' \[f(ax)Of(a2)l Similarly, we can define the linear distance between nouns dis(nl,n2) based on g. In contrast, we call the distances in definition 3 non-linear distances. null According to the linear distances between adjectives, we can determine a preliminary partition of N: randomly select an adjective and put it into an empty set X, then scan the other adjectives in A, for any adjective in A - X, if its distances from the adjectives in X are all smaller than tl, then put it into X, finally X forms a preliminary cluster. Similarly, we can build another preliminary cluster in (A- X).</Paragraph>
      <Paragraph position="3"> So on, we can get a set of preliminary clusters, which is just a partition of A. According to the different order in which we scan the adjectives, we can get different preliminary partitions of A. Similarly, we can determine the preliminary partitions of N based on the linear distances between nouns. A partition of A and a partition of N forms a preliminary solution of (1), and all possible preliminary solutions forms the Ji, He and Huang 27 Learning New Compositions population of preliminary solutions, which we also call the population of Oth generation solutions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Evolution Operation
</SectionTitle>
      <Paragraph position="0"> In general, evolution operation consists of recombination, mutation and selection. Recombination makes two solutions in a generation combine with each other to form a solution belonging to next generation. Suppose &lt; U~ i), Vi(')&gt; and &lt; U~ i), V2(') &gt; are two ith generation solutions, where U~ t) and U~ i) are two partitions of A, V? i) and V2 (i) are two partitions of N, then &lt; U~ '), V2 (i) &gt; and &lt; U2 (i), V1 (i) &gt; forms two possible (i+l)th generation solutions.</Paragraph>
      <Paragraph position="1"> Mutation makes a solution in a generation improve its fitness, and evolve into a new one belonging to next generation. Suppose &lt; U (i), U (i) &gt; is a ith generation solution, where U (i) =&lt; A1, A2, ..., Ap &gt;, V (i) =&lt; N1,N2,...,Nq &gt; are partitions of A and N respectively, the mutation is aimed at optimizing V(0 into V (t+l) based on U (t), and makes V (t+l) satisfy the condition ii) in (1), or optimizing U (t) into U(t+l) based on V (0, and makes U (l+1) satisfy the condition i) in (1), then moving words across clusters to minimize d&lt;u,v&gt;&amp;quot; We design three steps for mutation operation: splitting, merging and moving, the former two are intended for the partitions to satisfy the conditions in (1), and the third intended to minimize (f&lt;U,v &gt; .</Paragraph>
      <Paragraph position="2"> In the following, we take the evolution of V (t+l) as an example to demonstrate the three steps.</Paragraph>
      <Paragraph position="3"> Splitting Procedure. For any Nk, 1 _&lt; k _&lt;, if there exist hi,n2 * Nk, such that disv(,+~)(nl,n2 ) _&gt; t2, then splitting Nk into two subsets X and Y. The procedure is given as the following: i) Put nl into X, n2 into Y, ii) Select the noun in (Nk -- (X U Y)) whose distance from nl is the smallest, and put it into X, iii) Select the noun in (Nk -- (X t_J Y)) whose distance from n2 is the smallest, and put it into Y, iv) Repeat ii) and iii), until X t3 Y = Nk.</Paragraph>
      <Paragraph position="4"> For X (or Y), if there exist nl,n2 * X (or Y), disv(o &gt;_ t2, then we can make use of the above procedure to split it into more smaller sets. Obviously, we can split any Nk in V(0 into several subsets which satisfy the condition ii) in (1) by repeating the procedure.</Paragraph>
      <Paragraph position="5"> Merging procedure. If there exist Nj and Nk, where 1 _&lt; j,k _&lt; q, such that disu(~)(Nt,Nk ) &lt; t2, then merging them into a new cluster.</Paragraph>
      <Paragraph position="6"> It is easy to prove that U (t) and V(0 will meet the condition i) and ii) in (1) respectively, after splitting and merging procedure.</Paragraph>
      <Paragraph position="7"> Moving procedure. We call moving n from Nj to Nk a word move, where 1 &lt; j C/ k &lt; q, denoted as (n, Nj, Nk), if the condition (ii) remains satisfied. The procedure is as the following:  i) Select a word move (n, Nj, Na) which minimizes ~&lt;U,V&gt; ' ii) Move n from Nj to Nk, iii) Repeat i) and ii) until there are no word moves  which reduce 6&lt;u,v&gt;&amp;quot; After the three steps, U (i) and V (i) evolve into U (i+U and V (i+D respectively.</Paragraph>
      <Paragraph position="8"> Selection operation selects the solutions among those in the population of certain generation according to their fitness. We define the fitness of a solution as its learning amount.</Paragraph>
      <Paragraph position="9"> We use Ji to denote the set of i$h generation solutions, H(i, i + 1), as in (3), specifies the similarity between ith generation solutions and (i + 1)th generation solutions.</Paragraph>
      <Paragraph position="11"> Let t3 be a threshold for H(i, i + 1), the following is the general evolutionary algorithm:  Procedure Clustering(A, N, f, g); begin i) Build preliminary solution population I0, ii) Determine 0th generation solution set J0 according to their fitness, iii) Determine/i+1 based on Ji: a) Recombination: if (U~ i), Vff)),</Paragraph>
      <Paragraph position="13"> After determining the clusters of adjectives and nouns, we can construct the compositional frame for each noun cluster or each noun. In fact, for each noun cluster Ni,g(N~) = {Aj : 3n E Ni,Aj Ng(n) 7PS C/) is just its compositional frame, and for any noun in N/, g(Ni) is also its compositional frame. Similarly, for each adjective (or adjective cluster), we can also determine its compositional frame.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Parameter Estimation
</SectionTitle>
    <Paragraph position="0"> The parameters tl and t2 in (1) are the thresholds for the distances between the clusters of A and N re-Ji, He and Huang 28 Learning New Compositions spectively. If they are too big, the established frame will contain more unreasonable compositions, on the other hand, if they are too small, many reasonable compositions may not be included in the frame. Thus, we should determine appropriate values for t~ and t2, which makes the fame contain as many reasonable compositions as possible, meanwhile as few unreasonable ones as possible.</Paragraph>
    <Paragraph position="1"> Suppose Fi is the compositional frame of Ni, let</Paragraph>
    <Paragraph position="3"> adjectives learned as the compositional instances of the noun in Ni. For any n ~ N~, we use An to denote the set of all the adjectives which in fact can modify n to form a meaningful phrase, we now define deficiency rate and redundancy rate of F. For convenience, we use (iF to represent 5(U, V).</Paragraph>
    <Paragraph position="4"> Definition 5 Deficiency rate o~F El&lt;i&lt;q EneN, \[ A~ - ARe \[ Intuitively, aF refers to the ratio between the reasonable compositions which are not learned and all the reasonable ones.</Paragraph>
    <Paragraph position="5"> Definition 6 Redundancy rate fir</Paragraph>
    <Paragraph position="7"> Intuitively, fie refers to the ratio between unreasonable compositions which are learned and all the learned ones.</Paragraph>
    <Paragraph position="8"> So the problem of estimating tl and t2 can be formalized as (5): (5) to find tl and t2, which makes av = 0, and flF=0.</Paragraph>
    <Paragraph position="9"> But, (5) may exists no solutions, because its constraints are two strong, on one hand, the sparseness of instances may cause ~F not to get 0 value, even if tl and t~ close to 1, on the other hand, the difference between words may cause fir not to get 0 value, even if tl and t2 close to 0. So we need to weaken (5). In fact, both O~F and flF can be seen as the functions of tl and t2, denoted as o~f(tl,t2) and l~F(tl, tu) respectively. Given some values for tl and t2, we can compute aF and fiR. Although there may exist no values (t~,t~) for (tl,t2), such that ! ! aF(t~,t~) = flF(tx,t2) = 0, but with t~ and t2 increasing, off tends to decrease, while fiE tends to increase. So we can weaken (5) as (6).</Paragraph>
    <Paragraph position="10">  (6) to find tl and t2, which maximizes (7). (7)</Paragraph>
    <Paragraph position="12"> Intuitively, if we see the area (\[0, 1\]; \[0, 1\]) as a sample space for tl and t2, Fl(t~,t~) and F2(t~,t~) are its sub-areas. So the former part of (7) is the ! ! mean deficiency rate of the points in Fl(tl, tz), and the latter part of (7) is the mean deficiency rate of the points in F2(t~,t~). To maximize (7) means to maximize its former part, while to minimize its latter part. So our weakening (5) into (6) lies in finding a point (t~,t~), such that the mean deficiency rate of the sample points in F2(t~,t~) tends to be very low, rather than finding a point (t~,t~), such that its deficiency rate is 0.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiment Results and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> We randomly select 30 nouns and 43 adjectives, and retrieve 164 compositions(see Appendix I) between them from Xiandai Hanyu Cihai (Zhang et al. 1994), a word composition dictionary of Chinese.</Paragraph>
      <Paragraph position="1"> After checking by hand, we get 342 reasonable compositions (see Appendix I), among which 177 ones are neglected in the dictionary. So the sufficiency rate (denoted as 7) of these given compositions is 47.9%.</Paragraph>
      <Paragraph position="2"> We select 0.95 as the value of t3, and let tl = 0.0, 0.1,0.2, ..., 1.0, t2 = 0.0, 0.1, 0.2, ..., 1.0 respectively, we get 121 groups of values for O~F and fiR. Fig.1 and Fig.2 demonstrate the distribution of aF and ~3F respectively.</Paragraph>
      <Paragraph position="3">  For any given tl, and t2,we found (7) get its biggest value when t I = 0.4 and t2 = 0.4, so we se-Ji, He and Huang 29 Learning New Compositions  BF, el and e~ is the mean error.</Paragraph>
      <Paragraph position="4"> lect 0.4 as the appropriate value for both tl and t2. The result is listed in Appendix II. From Fig.1 and Fig.2, we can see that when tl = 0.4 and t2 = 0.4, both c~F and BF get smaller values. With the two parameters increasing, aF decreases slowly, while BF increases severely, which demonstrates the fact that the learning of new compositions from the given ones has reached the limit at the point: the other reasonable compositions will be learned at a cost of severely raising the redundancy rate.</Paragraph>
      <Paragraph position="5"> From Fig.l, we can see that o~F generally increases as ~1 and t2 increase, this is because that to increase the thresholds of the distances between clusters means to raise the abstract degree of the model, then more reasonable compositions will be learned. On the other hand, we can see from Fig.2 that when tl _&gt; 0.4, t2 &gt;_ 0.4, fiR roughly increases as ~1 and ~2 increase, but when tz &lt; 0.4, or t2 &lt; 0.4, fir changes in a more confused manner. This is because that when tl &lt; 0.4, or ~2 &lt; 0.4, it may be the case that much more reasonable compositions and much less unreasonable ones are learned, with tl and t2 increasing, which may result in fiR's reduction, otherwise fir will increase, but when tz &gt;_ 0.4, t2 &gt; 0.4, most reasonable compositions have been learned, so it tend to be the case that more unreasonable compositions will be learned as tl and t2 increase, thus fir increases in a rough way.</Paragraph>
      <Paragraph position="6"> To explore the relation between % aF and fiE, we reduce or add the given compositions, then estimate Q and t2, and compute aRE and fiR. Their correspondence is listed in Table 1.</Paragraph>
      <Paragraph position="7"> From Table 1, we can see that as 7 increases, the estimated values for tl and t2 will decrease, and BE will also decrease. This demonstrates that if given less compositions, we should select bigger values for the two parameters in order to learn as many reason-Ji, He and Huang able compositions as possible, however, which will lead to non-expectable increase in fly. If given more compositions, we only need to select smaller values for the two parameters to learn as many reasonable compositions as possible.</Paragraph>
      <Paragraph position="8"> We select other 10 groups of adjectives and nouns, each group contains 20 adjectives and 20 nouns.</Paragraph>
      <Paragraph position="9"> Among the 10 groups, 5 groups hold a sufficiency rate about 58.2%, the other 5 groups a sufficiency rate about 72.5%. We let ~1 -~ 0.4 and t2 = 0.4 for the former 5 groups, and let tl = 0.3 and t2 = 0.3 for the latter 5 groups respectively to further consider the relation between 7, o~F and fiR, with the values for the two parameters fixed.</Paragraph>
      <Paragraph position="10"> Table 2 demonstrates that for any given compositions with fixed sufficiency rate, there exist close values for the parameters, which make c~F and fir maintain lower values, and if given enough compositions, the mean errors of O~FF and fie will be lower. So if given a large number of adjectives and nouns to be clustered, we can extract a small sample to estimate the appropriate values for the two parameters, and then apply them into the original tasks.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Conclusions and Future work
</SectionTitle>
    <Paragraph position="0"> In this paper, we study the problem of learning new word compositions from given ones by establishing compositional frames between words. Although we focus on A-N structure of Chinese, the method uses no structure-specific or language-specific knowledge, and can be applied to other syntactic structures, and other languages.</Paragraph>
    <Paragraph position="1"> There are three points key to our method. First, we formalize the problem of clustering adjectives and nouns based on their given compositions as a non-linear optimization one, in which we take noun clusters as the environment of adjectives, and adjective</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
30 Learning New Compositions
P
</SectionTitle>
    <Paragraph position="0"> clusters as the environment of nouns. Second, we design an evolutionary algorithm to determine its optimal solutions. Finally, we don't pre-define the number of the clusters, instead it is automatically determined by the algorithm.</Paragraph>
    <Paragraph position="1"> Although the effects of the sparseness problem can be alleviated compared with that in traditional methods, it is still the main problem to influence the learning results. If given enough and typical compositions, the result is very promising. So important future work is to get as many typical compositions as possible from dictionaries and corpus as the foundation of our algorithms.</Paragraph>
    <Paragraph position="2"> At present, we focus on the problem of learning compositional frames from the given compositions with a single syntactic structure. In future, we may take into consideration several structures to cluster words, and use the clusters to construct more complex frames. For example, we may consider both A-N and V-N structures in the meantime, and build the frames for them simultaneously.</Paragraph>
    <Paragraph position="3"> Now we make use of sample points to estimate appropriate values for the parameters, which seems that we cannot determine very accurate values due to the computational costs with sample points increasing. Future work includes how to model the sample points and their values using a continuous function, and estimate the parameters based on the function.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML