File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/01/j01-1001_abstr.xml

Size: 2,214 bytes

Last Modified: 2025-10-06 13:41:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-1001">
  <Title>Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Kenneth W. Churcht
AT&amp;T Labs--Research
</SectionTitle>
    <Paragraph position="0"> Bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. Suffix arrays (Manber and Myers 1990) were /irst introduced to compute the frequency and location of a substring (n-gram) in a sequence (corpus) of length N. To compute frequencies over all N(N + 1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. In this way, a prohibitive computation over substrings is reduced to a manageable computation over classes.</Paragraph>
    <Paragraph position="1"> This paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (dr)for all n-grams in two large corpora, an English corpus of 50 million words of Wall Street Journal and a Japanese corpus of 216 million characters of Mainichi Shimbun.</Paragraph>
    <Paragraph position="2"> The second half of the paper uses these frequencies to find &amp;quot;interesting&amp;quot; substrings. Lexicographers have been interested in n-grams with high mutual information (MI) where the joint term frequency is higher than what would be expected by chance, assuming that the parts of the n-gram combine independently. Residual inverse document frequency (RIDF) compares document frequency to another model of chance where terms with a particular term frequency are distributed randomly throughout the collection. MI tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas RIDF tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents). The combination of both MI and RIDF is better than either by itself in a Japanese word extraction task.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML