File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3243_intro.xml

Size: 2,633 bytes

Last Modified: 2025-10-06 14:02:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3243">
  <Title>On Log-Likelihood-Ratios and the Significance of Rare Events</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
X
</SectionTitle>
    <Paragraph position="0"> for dealing with rare events, Agresti (1990, p. 246) cites studies showing &amp;quot;X  is valid with smaller sample sizes and more sparse tables than G  can be unreliable when expected frequencies of less than 5 are involved, depending on circumstances. null The problem of rare events invariably arises whenever we deal with individual words because of the Zipfian phenomenon that, typically, no matter how large a corpus one has, most of the distinct words in it will occur only a small number of times. For example, in 500,000 English sentences sampled from the Canadian Hansards data supplied for the bilingual word alignment workshop held at HLT-NAACL 2003 (Mihalcea and Pedersen, 2003), there are 52,921 distinct word types, of which 60.5% oc- null Dunning did not use the name G  , but this appears to be its preferred name among statisticians (e.g., Agresti, 1990).  Following Agresti, we use X  to denote the test statistic and kh  to denote the distribution it approximates. cur five or fewer times, and 32.8% occur only once. The G  statistic has been most often used in NLP as a measure of the strength of association between words, but when we consider pairs of words, the sparse data problem becomes even worse. If we look at the 500,000 French sentences corresponding to the English sentences described above, we find 19,460,068 English-French word pairs that occur in aligned sentences more often than would be expected by chance, given their monolingual frequencies. Of these, 87.9% occur together five or fewer times, and 62.4% occur together only once. Moreover, if we look at the expected number of occurrences of these word pairs (which is the criteria used for determining the applicability of the X  significance tests), we find that 93.2% would be expected by chance to have fewer than five occurrences. Pedersen et al. (1996) report similar proportions for monolingual bigrams in the ACL/DCI Wall Street Journal corpus. Any statistical measure that is unreliable for expected frequencies of less than 5 would be totally unusable with such data.</Paragraph>
    <Paragraph position="1"> 2 How to Estimate Significance for Rare Events A wide variety of statistics have been used to measure strength of word association. In one paper alone (Inkpen and Hirst, 2002), pointwise mutual information, the Dice coefficient, X</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML