XML Viewer - c04-1147

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1147_intro.xml
Size: 4,722 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1147">
  <Title>Fast Computation of Lexical Affinity Models</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Modeling term co-occurrence is important for many natural language applications, such as topic segmentation (Ferret, 2002), query expansion (Vechtomova et al., 2003), machine translation (Tanaka, 2002), language modeling (Dagan et al., 1999; Yuret, 1998), and term weighting (Hisamitsu and Niwa, 2002). For these applications, we are interested in terms that co-occur in close proximity more often than expected by chance, for example, a2 &amp;quot;NEW&amp;quot;,&amp;quot;YORK&amp;quot;a3 , a2 &amp;quot;ACCURATE&amp;quot;,&amp;quot;EXACT&amp;quot;a3 and a2 &amp;quot;GASOLINE&amp;quot;,&amp;quot;CRUDE&amp;quot;a3 . These pairs of terms represent distinct lexical-semantic phenomena, and as consequence the terms have an affinity for each other. Examples of such affinities include synonyms (Terra and Clarke, 2003), verb similarities (Resnik and Diab, 2000) and word associations (Rapp, 2002).</Paragraph>
    <Paragraph position="1"> Ideally, a language model would capture the patterns of co-occurrences representing the affinity between terms. Unfortunately, statistical models used to capture language characteristics often do not take contextual information into account. Many models incorporating contextual information use only a select group of content words and the end product is a model for sequences of adjacent words (Rosenfeld, 1996; Beeferman et al., 1997; Niesler and Woodland, 1997).</Paragraph>
    <Paragraph position="2"> Practical problems exist when modeling text statistically, since we require a reasonably sized corpus in order to overcome sparseness problems, but at the same time we face the difficulty of scaling our algorithms to larger corpora (Rosenfeld, 2000). Attempts to scale language models to large corpora, in particular to the Web, have often used general-purpose search engines to generate term statistics (Berger and Miller, 1998; Zhu and Rosenfeld, 2001). However, many researchers are recognizing the limitations of relying on the statistics provided by commercial search engines (Zhu and Rosenfeld, 2001; Keller and Lapata, 2003). ACL 2004 features a workshop devoted to the problem of scaling human language technologies to terabyte-scale corpora.</Paragraph>
    <Paragraph position="3"> Another approach to capturing lexical affinity is through the use of similarity measures (Lee, 2001; Terra and Clarke, 2003). Turney (2001) used statistics supplied by the Altavista search engine to compute word similarity measures, solving a set of synonym questions taken from a series of practice exams for TOEFL (Test of English as a Foreign Language). While demonstrating the value of Web data for this application, that work was limited by the types of queries that the search engine supported.</Paragraph>
    <Paragraph position="4"> Terra and Clarke (2003) extended Turney's work, computing different similarity measures over a local collection of Web data using a custom search system. By gaining better control over search semantics, they were able to vary the techniques used to estimate term co-occurrence frequencies and achieved improved performance on the same question set in a smaller corpus. The choice of the term co-occurrence frequency estimates had a bigger impact on the results than the actual choice of similarity measure. For example, in the case of the pointwise mutual information measure (PMI), values for a4a6a5a8a7a10a9a11a13a12 are best estimated by counting the number of times the terms a7 and a11 appear together within 10-30 words. This experience suggests that the empirical distribution of distances between adjacent terms may represent a valuable tool for assessing term affinity. In this paper, we present an novel algorithm for computing these distributions over large corpora and compare them with the expected distribution under an independence assumption. null In section 2, we present an independence model and a parametric affinity model, used to capture term co-occurrence with support for distance information. In section 3 we describe our algorithm for computing lexical affinity over large corpora. Using this algorithm, affinity may be computed between terms consisting of individual words or phrases. Experiments and examples in the paper were generated by applying this algorithm to a terabyte of Web data.</Paragraph>
    <Paragraph position="5"> We discuss practical applications of our framework in section 4, which also provides validation of the approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML