File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0610_metho.xml
Size: 19,907 bytes
Last Modified: 2025-10-06 14:15:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0610"> <Title>Retrieving Collocations From Korean Text</Title> <Section position="4" start_page="71" end_page="73" type="metho"> <SectionTitle> 3 Input Format </SectionTitle> <Paragraph position="0"> In this section, we discuss an input form relevant to Korean language structure and linguistic contents which would work well on an effimfinterrupted collocation is a sequence of words* To ~tvoid confusion of terms, we call a sequence of two words as ~ 'a(ljacent bigram' and a sequence of n words as a * ad?accnt n-gram ~.</Paragraph> <Paragraph position="1"> cient statistics. Korean is one of agglutinative languages as well as a propositional language.</Paragraph> <Paragraph position="2"> An elementary node being called as 'eojeol' is generally composed of a content word and function words. Namely, a word in English corresponds to a couple of morphemes in Korean.</Paragraph> <Paragraph position="3"> A key feature of Korean is that hmction words, such as propositions, endings, copula, auxiliary verbs, and particles, are highly developed as independent morphemes, while they are represented as word order or inflections in English. Functional morphemes determine grammatical relations, tense, modal, and aspect.</Paragraph> <Paragraph position="4"> In Korean, there are lots of multiple function words in a rigid forms. They can be viewed as collocations. For this reason, our system is designed at the morphological level. A set of twelve part of speech tags, { N, J, V, P, D, E, T, O, C, A, S, X } 3 was considered.</Paragraph> <Paragraph position="5"> Another feature is a free word order. Since the words of a collocation appear in text with the flexible ways, sufficient samples are required to compute accurate probabilities. We allow positional information to vary by using an interrupted bigram model.</Paragraph> <Paragraph position="6"> The basic input can be represented in (1). An object k means a pair of morphemes (mi,mk) and mk corresponds to one of all possible morphemes, being able to co-occur with mi. A variable j indicates the j-th position. Xij denotes the frequency of mk that occurs at the j-th position before mi.</Paragraph> <Paragraph position="8"> morpheme as a base morpheme, the range of window is from -1 to -10. This distance constraint is for the characteristic of SOV language. If a bigram includes an adverb morpheme, a larger window, from -20 to 10 is used because the components often appear widely separated from each other on text. In other cases, we considered the range from -5 to +5. This distant constraints are for an efficient statistics. An input data is transformed to a property matrix, T(Xi) as (2) that is a two dimensional</Paragraph> <Paragraph position="10"> ~rr~:~y of k object.s, k = 1,2,...,n, on four vari~t,|)les, V Frequency ~ VCondensation , V Randomness , ~md Vcorrelatio n.</Paragraph> <Paragraph position="12"> (2) ~\]~) continue explanations, we begin by mentioning the 'Xtrgct' tool by Smadja (Smadja, 1993). Our input form was designed in a similar manner with 'Xtract'. Smadja assumed that the components of a collocation should appear together in a relatively rigid way because of syntactic constraint. Namely, a bigram pair (mi, 'rnk), where mk Occurs at one(or several) spe(:ific position around mi, would be a meaningful bigrams for collocations. The rigid word order is related with the variance of frequency distribution of (mi, ink). 'Xtract' extracted the pairs whose variances are over a threshold and pulled out the interesting positions of them by standar(lizz~tion of the frequency distributions. Unfbrtunately, the approach for English has several limitations to work 4 on Korean structure ibr the following 'reasons: 1. For free order languages such as Korean, words are widely distributed in text, so that positional variance affects the overtiltering of Useful bigrams. Figure 1 shows that there is no pair which contains randomly distributed morphems such as function words or nouns. This indicates that very few pairs were produced when 'Xtract' is applied to Korean.</Paragraph> <Paragraph position="13"> 4~We ported Smadja's Xtract tool into a Korean version. null 2. Suppose that a meaning bigram, (mi, mk) prefers a position pj. Then, the number of concordances for condition probability P(mi, rnklpj) would be small, specially in a free order language. As shown in Table 1, the model produced a lot of long meaningless n-grams when compiling into n-grams. The precision value of Korean version of Xtract was estimated to be 40.4%.</Paragraph> <Paragraph position="14"> 3. The eliminated bigrams by the previous stage can appear again in n-gram collocations. When compiling, the model only keeps the words occupying the position with a probability greater than a given threshold from the concordances of (mi, ink, pj). As one might imagine, the first stage could be useless.</Paragraph> <Paragraph position="15"> As stated above, in Korean, the effect of position on collocations needs to be treated in some complex ways. Korean collocations can be divided into four types: 'idiom' 5, 'semantic collocation' 6, 'syntactic collocation' 7 and 'morphological collocation' s. Idioms and morphological collocations appear on text in a rigid way and word order but others do in the flexible ways. From a consideration of these more flexible collocations, we adopt an interrupted bi-gram model and suggest several statistics that consist with characteristics of Korean.</Paragraph> </Section> <Section position="5" start_page="73" end_page="74" type="metho"> <SectionTitle> 4 Algorithm </SectionTitle> <Paragraph position="0"> This section describes how properties are represented as numerical values and how meaningful objects are retrieved. In the first stage, we extract meaningful interrupted bigrams based on four properties. Next, the meaningful bigrams are extended into n-gram collocations using a a-compatibility relation.</Paragraph> <Paragraph position="1"> It empirically showed that a Weibull distribution (3) provides a close approximation of frequency distribution of bigrams.</Paragraph> <Paragraph position="3"> is more free than idioms.</Paragraph> <Paragraph position="4"> ~The combination of words is affected by selectionM restrictions of predicate, noun, or adverb.</Paragraph> <Paragraph position="5"> sit corresponds to multiple function word and appears on a adjacent word group.</Paragraph> <Paragraph position="6"> n-grams ..... _7_@(everyone). Noun-~-(objeet case)... ((u,d-))o--I(take)(@l-z,l))(drink) . .</Paragraph> <Paragraph position="8"> ..... -~-~1-7-. (even though). ((~))(not) (@},q))(drink) ,.~_(and) Noun Noun . Verb Ending Verb . .</Paragraph> <Paragraph position="10"> Thus, there are a lot of pairs with low frequency which interrupt to get reliable statistic. We clinfinated such pairs using median m that is a. value such that P{X > m} > 1/2 to a frequency distribution F. If median is less than 3, we took the value 3 as a median.</Paragraph> <Paragraph position="11"> Any quantity that depends on not any unknown parameters of population distribution but only the sample is called a statistic. We regarded four statistics relating to properties of (:ollocations as variables. Before the further explauation, consider Sm~, a sample space of mi as Table 2 whose cardinality \]Sm~l is n. Let one ob.iect be (mi, mk) and its frequency distribution be ./}k,,.rik2,''&quot; ,fiklO and,::k+ be ~pldeg_l like. Suppose that POS(mi) is J and POS(mk) &quot;s C/p'.</Paragraph> <Section position="1" start_page="73" end_page="74" type="sub_section"> <SectionTitle> 4.1 Properties </SectionTitle> <Paragraph position="0"> The properties which we considered are primarily concerned with the frequency and positional infbrmations of word pairs. As we have emphasized, the correlation between position and (:ollocation is very complicated in Korean.</Paragraph> <Paragraph position="1"> According to Breidt, MI or T-score thresholds work satisfactory as a filter for extraction of collocations, but filtered out at least half of the actual collocations (Breidt, 1993). Generally, assumed properties could not fully account tbr collocations. Therefore, in order to reduce a h)ss of infbrmation, the combination of observed vaxiables would be better than filtering. We delined tbur variables for properties of collocations 1. Vi: 2. ~: as follows.</Paragraph> <Paragraph position="2"> According to Benson's definition, a collocation is a recurrent word combination (Benson et al., 1986). We agree with this view that a word pair of high frequency would be served as a collocation. Vf statistic of an object (mi, ink), is represented as (4). Here, standardization demands attention.</Paragraph> <Paragraph position="3"> The mean and standard deviation are calculated in the 'JP' set which the object belongs to.</Paragraph> <Paragraph position="5"> Intuitively, two words that prefer specific positions must be related with each other. We seeked to recapture the idea with the flexibility of word order. For this, the concept of convergence on each position was employed. In a free order language, a meaningful pair can occur in text either with two distance or three distance. Let's consider two input vectors x, (0,1,0,0,0,1,0,0,1,0) and y, (0,0,0,1,1,1,0,0,0,0). They have the same variance but y would be more meaningful than x, because y can be interpreted as (0,0,0,0,3,0,0,0,0,0) within the free order framework. Therefore, a spatial mask (1/2,1,1/2) was devised for convergence on each position. The calculation of condensation value rnikv at p-th position is:</Paragraph> <Paragraph position="7"> The mik,, is c, omputed by neighborhoods that are locdted in the border of the l)-th position. The may_ALe_ is likely &quot;'3 *~ kk+ to represent :the condensation of (mi, ink) but it is; still deficient. Intuitively, (0,1,1,1,0,3,2,0,0,0) would be less condensed than (0,0,3,0,0,3,2,0,0,0). Therefi)re, n' was designed for a penalty factor. Irtikp ~. = max (5) p=1,2 ..... lo ~n'fia+ ',,' is the number of m, such that fikm 7 ~ 0 ti)r 0 <_ m <_ 10, and it is a reverse propor null tion to the condensation. Square root was used tbr preventing the excessive influence p of '/t .</Paragraph> <Paragraph position="8"> We were motivated by the idea that if a pair is randomly distributed in terms of position. then it Would not be meaningful. Especially in tim case of flmction words, they are likely to be randomly distributed over a given morpheme but distributions of meanin.gful pairs are not random, as shown in randomness is to measure how far the given distribution is away from a uniform distribution. In (6), fik means the expected number of (mi, ink) at each position on the a,ssumption that the pair randomly occurs</Paragraph> </Section> </Section> <Section position="6" start_page="74" end_page="1600" type="metho"> <SectionTitle> 4. Vat : </SectionTitle> <Paragraph position="0"> at the position. \]fikv-Tikl 71k can be viewed as an error rate at each position p based on the assumption. The big difference between the expected number and the actual observed frequency means that the distribution is not random. One might think that this concept is the same with one of variance. However, note the denominator.</Paragraph> <Paragraph position="1"> This calculation is somewhat better than variance which depends on frequency.</Paragraph> <Paragraph position="3"> To become a meaningful bigram, a pair should be syntactically valid. We viewed that if the frequency distribution of a pair keeps the overall frequency distribution of the POS relation set which the pair belongs to, then the pair would be syntactically valid. To verify this idea, we depict the overall frequency distributions in some POS relations in Figure 2. It shows the frequency distributions of pairs which are composed of postposition and predicate morpheme. It is quite interesting that all objects have the similar form of frequency distribution. They have sharp peaks at the first and third position. Clearly, this illustrates that a postposition has a high probability of appearing at the first and third position before a predicate. We can conclude from this that pairs keeping the overall frequency structure would be syntactically valid. We used correlation coetticient for the structural similarity. In the case of a pair rnik, the correlation value between (.fikl, fik2,'&quot;, .fiklo) and (./i+LI.lP null and y be two vectors whose components are mean corrected, xi - ~ for x, Yi - Y for y. The correlation between two variables is straightforward, if x and y is standardized through dividing each of their elements by the standard deviations, ax and ay, respectively. Let x* be x/ax and y* be y/ay, then the correlation between x and y, VeT can be represented as follows.</Paragraph> <Paragraph position="5"> The ranks of bigrams by four measures is summarized in Figure 3. It tells that each of the measures comes up with our expectation.</Paragraph> <Section position="1" start_page="1600" end_page="1600" type="sub_section"> <SectionTitle> 4.2 Evaluation Function </SectionTitle> <Paragraph position="0"> in this section, we analyze the correlations of fbur measures we defined and explain how to make an evaluation function for extracting meaningful bigrams. Table 3 shows the values of correlations which exist in the given measures: V/, V~, Ve, VeT. This explains that the defined measures have redundant parts. We can say that if a measure has the high values of correlations between others, then it has a redundant part to be eliminated. Since we don't know what factors are effective in determining useful bigram, the concept of weights is more reliable than filtering. We constructed an evaluation function, which reflects the correlations between the measures.</Paragraph> <Paragraph position="1"> First of all, we standardized four measures.</Paragraph> <Paragraph position="2"> Standardization gives an effect on adjustment of value range according to its variability. The degree of relationship between measurel and measure2 can be obtained by Cmeasurel,measure2 which is {corr(measurel, measure2)} +, where x + = x if x > 0, x + = 0 otherwise. The evaluation function is concerned with the degrees of relationships of measures.</Paragraph> <Paragraph position="4"> Here, the cor/stant a(~ 0.845) is for a compensation coefficient. The minimum value of Cr,</Paragraph> <Paragraph position="6"> ~., i,~:, and Vcr = 1. On the contray, the maxlmum value of C/~, C/C/, and C/cr is 1 respectively, where Cv:,v, = Cvf,v~ = cv:,v~ = 0 and all correlations of Vr, Vc, Vcr = 0. In other words, as the coefficients C/~, C/~, and C/c~ get closer to 1, the correlations between measures reduce.</Paragraph> <Paragraph position="7"> As shown in (8) and (9), we agree that Vf is a. primaryl factor of collocations. Each coefficient C/ indicates how much the property is reflected in evaluation. For example, in the case of Cr, a-~ z~ is a portion which is related with the property of condensation within randomness. Therefbre,i 1 - a--~ corresponds to the remainder, when subtracting this portion from randomness.</Paragraph> <Paragraph position="8"> The threshold for evaluation was set by testing. When the value for threshold was 0.5, good results were obtained but in noun morphems, a high value over 0.9 was required. The pairs are selected as meaningful bigrams whose values of the evaluation function are greater than the threshold.</Paragraph> </Section> <Section position="2" start_page="1600" end_page="1600" type="sub_section"> <SectionTitle> 4.3 :Extending to n-grams </SectionTitle> <Paragraph position="0"> The selected meaningful bigrams from the previous step are extended into n-gram collocations. At the final step, the longest ones among all (~-~:overs are Obtained as n-gram collocations by eliminating substrings. Here, n-gram collo~:ations mean interrupted collocations as well as n-character strings.</Paragraph> <Paragraph position="1"> We regarded Cohesive clusters of the meaningful bigrams as n-gram collocations on the assumption that members in a collocation have a high degree of cohesion (Kjellmer, 1995). To find cohesive chisters, a fuzzy compatibility rela.tion R is appl!ed. R on X x X, where Xis the set of all meaningful bigrams which contain ;,, l)~se morpheme mi, means a cohesive relation a.nd partitions of' set X obtained by R correspond to n-gram collocations. To say shortly, our problem hasshifted to clustering of a set X.</Paragraph> <Paragraph position="2"> A reason to employ the concept of fuzzy is that equivalence sets defined by the relation may be more desirable.</Paragraph> <Paragraph position="3"> A fuzzy compatability relation R(X,X) is represented as a matrix by a membership function.</Paragraph> <Paragraph position="4"> The membership function of a fuzzy set A E X is denoted by #A : X ~ \[0, 1\] and maps elements of a given set X into real numbers in \[0,1\]. These two membership functions #A were used to define the cohesive relation as follows.</Paragraph> <Paragraph position="6"> Let Ixl and lYl be the frequency of concordances which contains the bigram pairs x and y, respectively. IxAyl means how often two pairs x and y co-occur in the same concordances under the distance constraint. (10) is relative entropy measure and (11) is dice coefficient. This measures are concerned with a lexical relation for cohesive degrees.</Paragraph> <Paragraph position="7"> To get equivalence sets, it is very important to identify properties of the relation R we defined. A relation which is reflexive, symmetric and transitive is called as an equivalence relation or similarity relation. In our case, the fuzzy cohesive relation, R is certainly reflexive and symmetric. If R(x, z) > ma, xyEy min\[R(x, y), R(y, z)\] is satisfied for all (x, z) e X 2, then R is transitive. Generally, transitive closure is used for checking transitivity. The transitive closure of a relation is defined as the smallest fuzzy relation which is transitive and has the fewest possible members with containing the relation itself.</Paragraph> <Paragraph position="8"> Given a relation S(X,X), its max-min transitive closure ST(X, X) can be calculated by the following algorithm consisted of three steps: 1. S I = SU (S o S) , o is a max-min composition operator.</Paragraph> <Paragraph position="9"> 2. If S' # S, make S = S ' and go to Step 1.</Paragraph> <Paragraph position="10"> 3. Stop: S'= ST.</Paragraph> <Paragraph position="11"> If above algorithm terminates after the first iteration when applied to R, R satisfies transitivity. To verify its transitivity, above alogrithm were employed. As a result, R did not satisfy transitivity. It means that an element of X could be- null hmg to multiple (:lasses by R. This proves that the relation R is valid to explain collocations. A iuzzy binary relation R(X,X) which is reth~xive and symmetric is called as a fuzzy compa.til)i\[ity relation and is usually referred to as ~,. (lunsi-e(tuivalence relation. When R is a fuzzy compa, tibility relation, compatibility classes are ,l(,.fined in terms of a specified membership degre,'. (~. An a-compatibility class is a subset A of X. s,mh that it(x, y) > a for all x, y E A and the tnmily consisting of the compatibility classes is called as an a-cover of X to R in terms of a specifi,: membership degree a. An a-cover forms partitions of X and an element of X could belong to multiple a-compatibility classes. Here, we a.ccepted a-covers at 0.20 a-level in dice and (}.3{} in relative entropy.</Paragraph> <Paragraph position="12"> One might argue why we did not directly apply a\]\] bigrams to this stage with skipping the previous stage. We hope to deal with the comt)arision in a later paper.</Paragraph> </Section> </Section> class="xml-element"></Paper>