File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1049_intro.xml
Size: 2,270 bytes
Last Modified: 2025-10-06 14:05:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1049"> <Title>HYPOTHESIZING UNTAGGED TEXT WORD ASSOCIATION FROM</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Using mutual information for measuring word association has become popular since \[Church and Hanks, 1990\] defined word association ratio as mutual information between two words. Word association ratios are a promising tool for lexicography, but there seem to be at least two limitations to the method: 1) much data with low frequency words or word pairs cannot be used and 2) generalization of word usage still depends totally on lexicographers.</Paragraph> <Paragraph position="1"> In this paper, we propose an alternative (or extended) method for suggesting word associations using Chi-square statistics, which can be viewed as an approximation to mutual information. Rather than considering significance of joint frequencies of word pairs as \[Church and Hanks, 1990\] did, our algorithm uses joint frequencies of pairs of word groups instead. The algorithm employs a hill-climbing search for a pair of word groups that occur significantly frequently.</Paragraph> <Paragraph position="2"> The benefits of this new approach are: 1) that we can consider even low frequency words and word pairs, and 2) that word groups or word associations can be automatically generated, .namely automatic hypothesis of word associations, which can later be reviewed by a lexicographer.</Paragraph> <Paragraph position="3"> 3) word associations can be used in parsing and understanding natural language, as well as in natural language generation \[Smadja and McKeown, 1990\].</Paragraph> <Paragraph position="4"> Our method proved to be 87% accurate in hypothesizing word associations for unobserved combinations of words in Japanese text, where accuracy was tested by human verification of a random sample of hypothesized word pairs. We extracted 14,407 observations of word co-occurrences, involving 3,195 nouns and 4,365 verb/argument pairs. Out of this we hypothesized 7,050 word associations. The corpus size was 280,000 words. We would like to apply the same approach to English.</Paragraph> </Section> class="xml-element"></Paper>