File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0610_evalu.xml

Size: 3,679 bytes

Last Modified: 2025-10-06 14:00:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0610">
  <Title>Retrieving Collocations From Korean Text</Title>
  <Section position="7" start_page="1600" end_page="1600" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We performed experiments for evaluation on 328,859 sentences(8.5 million-morphemes) from Yonsei balanced copora. 250 morphemes were selected for a test, such that frequency &gt;_ 150.</Paragraph>
    <Paragraph position="1"> The morphemes have 8,064 pairs and 773 were extracted as meaningful bigrams. In the second stage, 3,490 disjoint a-compatibility classes corresponding to lexicMly cohesive clusters were genera,ted. 698 longest n-gram collocations out of the a-compatibility classes were extracted by eliminating the fragments that can be subsumed in longer classes.</Paragraph>
    <Paragraph position="2"> The precision of extracted meaningful bigram was 86.3% and 92% in the case of n-gram collocations. We could take either o~-covers and the hmgest n-grams as n-gram collocations according to applications.</Paragraph>
    <Paragraph position="3"> Since unfortunately, there is no existing database of collocations for evaluation, it is not easy to compute precision values and recall values as well. We computed the precision values by hand. As a different approach to Korean collocations, (Lee et al., 1996) extracted interrupted bigrams using several filtering conditions and at least the 90% of the results were adjacent bigrams of length 1. By this comparison, we may conclude that our approach is more flexible to deal with Korean word order.</Paragraph>
    <Paragraph position="4"> Figure 3 9 displays the changes of rank according to measures we considered. It shows that in contrast to other models, the properties have been effective in retrieving collocations which contain pairs of morphemes with relatively low frequency. Since the ranks of bi-grams in four measures came up with our expectation, if we could make more adequate evaluation function, the precision would be improved. Table 4 shows some obtained meaningful bi-grams of 'o}.&gt;\] (not)'. There are a great deal of expressions relating negative sentences in Korean. The components of them occurs separated in various ways. When evaluating meaningflfl bigrams, the coetticients for tile evaluation flmction are as follows: Cr ~ 0.432, C/(: v 0.490, C/cr ~ 0.371 in the case of 'ol-q(not)'. This means that the influence of three other measures is 1.284 times more than that of frequency measure in 'JP' POS relation.</Paragraph>
    <Paragraph position="5"> We will illustrate all steps with a word, '~'(wear). The results of the first stage, meaningful bigrams of '4_! '(wear) m are shown in Figure 4. In the second stage, we calculated membership grades of inputs using dice measure and relative entropy measure. As Figure 4 shows, dice measure looks unsatisfactory in such cases as the pair '(~(object case), ~o} (much))'. Although the common frequency, 3 is a relatively high in the aspect of the word with lower frequency, 'Nol'(much), the value of dice is low.</Paragraph>
    <Paragraph position="6"> Thus, we also tested relative entropy based on the probability of low frequency. Two measures produce similar results if all values in the level set of R is considered instead of a specific value of o~, but entropy measure produces more good results.</Paragraph>
    <Paragraph position="7"> Figure 4 and 5 show all o~-compatibility classes and the longest n-gram collocations of '~'(wear). Through our method, various kind of collocations were extracted. In Figure 4, the order of components of a oe is by concordances.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML