File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0610_intro.xml
Size: 3,017 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0610"> <Title>Retrieving Collocations From Korean Text</Title> <Section position="3" start_page="0" end_page="71" type="intro"> <SectionTitle> 2 Related Works </SectionTitle> <Paragraph position="0"> In determining properties of collocations, most of corpus-based approaches accepted that the words of a collocation have a particular statistical distribution(Cruse, 1986). Although previous approaches have shown good results in retrieving collocations and many properties have been identified, they depend heavily on the frequency factor.</Paragraph> <Paragraph position="1"> (Choueka et al., 1983) proposed an algorithm for retrieving only uninterrupted collocations, 2 IBigrams and n-grams can be either adjacent morphemes or separated morphems by an arbitrary number of other words.</Paragraph> <Paragraph position="2"> sin(:e ,:hey assumed that a collocation is a se,luC/'n(:e of adjacent words that frequently apl:,(~ar t~)gether. (Church and Hanks, 1989) delhw:(I ;t collocation as a pair of correlated words :m(i ,,se(t mutual information to evaluate such \],~xi(:a,1 (:orrelations of word pairs of length two. They retrieved interrupted word pairs, as well as ,minterrupted word pairs. (Haruno et al., 1996) ,:onstructed collocations by combining adjacent n-grams with high value of mutual information.</Paragraph> <Paragraph position="3"> (Brei(lt, 1993)'s study was motivated by the fact than mutual information could not give realistic figures to low fl'equencies and used t-score for a significance test for V-N combinations.</Paragraph> <Paragraph position="4"> Martin noted that a span of 5 words on left nnd right sides captures 95% of significant collo(:ations in English (Martin, 1983). Based on this assumption, (Smadja, 1993) stored all bigrams of words along with their relative position, p (-5 < p _~ 5). He evaluated the lexical strength of a word pair using 'Z-score' and the variance of its t)osil;ion distribution using '.spread'. He defined ~,. (:()\]location as an arbitary, domain dependent, recurrent, and cohesive lexical cluster.</Paragraph> <Paragraph position="5"> (Nagao and Mori, 1994) developed an algorithm tbr calculating adjacent n-grams to an ar1)itrary large number of n. However, it was hard to find an efficient n and a lot of fragments were obtained. In Korean, statistics based on adjacent n-grams is not sufficient to capture various types of collocations. (Shimohata et al., 1997) employed entropy value to filter out fragments of the adjacent n-gram model. They evaluated disorderness with the distribution of adjacent words preceding and following a string. The strings with a high value of entropy were ac(:epted as collocations. This disorderness is effi(:ient to eliminate fragments but can not han(lle interrupted collocations. In general, previous ;studies on collocations have dealt with restricted types and depend on filtering measures in a lexically point of view.</Paragraph> </Section> class="xml-element"></Paper>