File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/j01-1001_intro.xml

Size: 65,680 bytes

Last Modified: 2025-10-06 14:01:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-1001">
  <Title>Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus</Title>
  <Section position="2" start_page="0" end_page="24" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> We will use suffix arrays (Manber and Myers 1990) to compute a number of type/token statistics of interest, including term frequency and document frequency, for all n-grams in large corpora. Type/token statistics model the corpus as a sequence of N tokens (characters, words, terms, n-grams, etc.) drawn from a vocabulary of V types. Different tokenizing rules will be used for different corpora and for different applications. In this work, the English text is tokenized into a sequence of English words delimited by white space and the Japanese text is tokenized into a sequence of Japanese characters (typically one or two bytes each).</Paragraph>
    <Paragraph position="1"> Term frequency (tf) is the standard notion of frequency in corpus-based natural language processing (NLP); it counts the number of times that a type (term/word/ngram) appears in a corpus. Document frequency (df) is borrowed for the information * Institute of Information Sciences and Electronics, 1-1-1 Tennodai, Tsukuba 305-8573, Japan t 180 Park Avenue, Florham Park, NJ 07932 Computational Linguistics Volume 27, Number 1 retrieval literature (Sparck Jones 1972); it counts the number of documents that contain a type at least once. Term frequency is an integer between 0 and N; document frequency is an integer between 0 and D, the number of documents in the corpus.</Paragraph>
    <Paragraph position="2"> The statistics, tf and df, and functions of these statistics such as mutual information (MI) and inverse document frequency (IDF), are usually computed over short n-grams such as unigrams, bigrams, and trigrams (substrings of 1-3 tokens) (Charniak 1993; Jelinek 1997). This paper will show how to work with much longer n-grams, including million-grams and even billion-grams.</Paragraph>
    <Paragraph position="3"> In corpus-based NLP, term frequencies are often converted into probabilities, using the maximum likelihood estimator (MLE), the Good-Turing method (Katz 1987), or Deleted Interpolation (Jelinek 1997, Chapter 15). These probabilities are used in noisy channel applications such as speech recognition to distinguish more likely sequences from less likely sequences, reducing the search space (perplexity) for the acoustic recognizer. In information retrieval, document frequencies are converted into inverse document frequency (IDF), which plays an important role in term weighting (Sparck Jones 1972).</Paragraph>
    <Paragraph position="4"> dr(t) IDF(t) = -log 2 D IDF(t) can be interpreted as the number of bits of information the system is given if it is told that the document in question contains the term t. Rare terms contribute more bits than common terms.</Paragraph>
    <Paragraph position="5"> Mutual information (MI) and residual IDF (RIDF) (Church and Gale 1995) both compare tf and df to what would be expected by chance, using two different notions of chance. MI compares the term frequency of an n-gram to what would be expected if the parts combined independently, whereas RIDF combines the document frequency of a term to what would be expected if a term with a given term frequency were randomly distributed throughout the collection. MI tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas RIDF tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents).</Paragraph>
    <Paragraph position="6"> Assuming a random distribution of a term (Poisson model), the probability p~(k) that a document will have exactly k instances of the term is:</Paragraph>
    <Paragraph position="8"> where 0 = np, n is the average length of a document, and p is the occurrence probability of the term. That is, Ntf tf 8--DN D&amp;quot; Residual IDF is defined as the following formula. Residual IDF = observed IDF - predicted IDF df log(1 = - log + df ---- -log~ +log(1 - e-~) The rest of the paper is divided into two sections. Section 2 describes the algorithms and the code that were used to compute term frequencies and document frequencies Yamamoto and Church Term Frequency and Document Frequency for All Substrings for all substrings in two large corpora, an English corpus of 50 million words of the Wall Street Journal, and a Japanese corpus of 216 million characters of the Mainichi Shimbun.</Paragraph>
    <Paragraph position="9"> Section 3 uses these frequencies to find &amp;quot;interesting&amp;quot; substrings, where what counts as &amp;quot;interesting&amp;quot; depends on the application. MI finds phrases of interest to lexicography, general vocabulary whose distribution is far from random combination of the parts, whereas RIDF picks out technical terminology, names, and keywords that are useful for information retrieval, whose distribution over documents is far from uniform or Poisson. These observations may be particularly useful for a Japanese word extraction task. Sequences of characters that are high in both MI and RIDF are more likely to be words than sequences that are high in just one, which are more likely than sequences that are high in neither.</Paragraph>
    <Paragraph position="10">  2. Computing tf and df for All Substrings</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Suffix Arrays
</SectionTitle>
      <Paragraph position="0"> This section will introduce an algorithm based on suffix arrays for computing tf and df and many functions of these quantities for all substrings in a corpus in O(NlogN) time, even though there are N(N + 1)/2 such substrings in a corpus of size N. The algorithm groups the N(N + 1)/2 substrings into at most 2N - 1 equivalence classes.</Paragraph>
      <Paragraph position="1"> By grouping substrings in this way, many of the statistics of interest can be computed over the relatively small number of classes, which is manageable, rather than over the quadratic number of substrings, which would be prohibitive.</Paragraph>
      <Paragraph position="2"> The suffix array data structure (Manber and Myers 1990) was introduced as a database indexing technique. Suffix arrays can be viewed as a compact representation of suffix trees (McCreight 1976; Ukkonen 1995), a data structure that has been extensively studied over the last thirty years. See Gusfield (1997) for a comprehensive introduction to suffix trees. Hui (1992) shows how to compute df for all substrings using generalized suffix trees. The major advantage of suffix arrays over suffix trees is space. The space requirements for suffix trees (but not for suffix arrays) grow with alphabet size: O(N\]~\]) space, where \]~\] is the alphabet size. The dependency on alphabet size is a serious issue for Japanese. Manber and Myers (1990) reported that suffix arrays are an order of magnitude more efficient in space than suffix trees, even in the case of relatively small alphabet size (IGI = 96). The advantages of suffix arrays over suffix trees become much more significant for larger alphabets such as Japanese characters (and English words).</Paragraph>
      <Paragraph position="3"> The suffix array data structure makes it convenient to compute the frequency and location of a substring (n-gram) in a long sequence (corpus). The early work was motivated by biological applications such as matching of DNA sequences. Suffix arrays are closely related to PAT arrays (Gonnet, Baeza-Yates, and Snider 1992), which were motivated in part by a project at the University of Waterloo to distribute the Oxford English Dictionary with indexes on CD-ROM. PAT arrays have also been motivated by applications in information retrieval. A similar data structure to suffix arrays was proposed by Nagao and Mori (1994) for processing Japanese text.</Paragraph>
      <Paragraph position="4"> The alphabet sizes vary considerably in each of these cases. DNA has a relatively small alphabet of just 4 characters, whereas Japanese has a relatively large alphabet of more than 5,000 characters. The methods such as suffix arrays and PAT arrays scale naturally over alphabet size. In the experimental section (Section 3) using the Wall Street Journal corpus, the suffix array is applied to a large corpus of English text, where the alphabet is assumed to be the set of all English words, an unbounded set. It is sometimes assumed that larger alphabets are more challenging than smaller ones, but ironically, Computational Linguistics Volume 27, Number 1 it can be just the reverse because there is often an inverse relationship between the size of the alphabet and the length of meaningful or interesting substrings. For expository convenience, this section will use the letters of the alphabet, a-z, to denote tokens.</Paragraph>
      <Paragraph position="5"> This section starts by reviewing the construction of suffix arrays and how they have been used to compute the frequency and locations of a substring in a sequence.</Paragraph>
      <Paragraph position="6"> We will then show how these methods can be applied to find not only the frequency of a particular substring but also the frequency of all substrings. Finally, the methods are generalized to compute document frequencies as well as term frequencies.</Paragraph>
      <Paragraph position="7"> A suffix array, s, is an array of all N suffixes, sorted alphabetically. A suffix, s\[i\], also known as a semi-infinite string, is a string that starts at position i in the corpus and continues to the end of the corpus. In practical implementations, it is typically denoted by a four-byte integer, i. In this way, a small (constant) amount of space is used to represent a very long substring, which one might have thought would require N space.</Paragraph>
      <Paragraph position="8"> A substring, sub(i,j), is a prefix of a suffix. That is, sub(i,j), is the first j characters of the suffix s\[i\]. The corpus contains N(N + 1)/2 substrings.</Paragraph>
      <Paragraph position="9"> The algorithm, suffix_array, presented below takes a corpus and its length N as input, and outputs the suffix array, s.</Paragraph>
      <Paragraph position="10"> suffix_array *-- function(corpus, N){ Initialize s to be a vector of integers from 0 to N - 1.</Paragraph>
      <Paragraph position="11"> Let each integer denote a suffix starting at s\[i\] in the corpus.</Paragraph>
      <Paragraph position="12"> Sort s so that the suffixes are in alphabetical order.</Paragraph>
      <Paragraph position="13"> Return s. } The C program below implements this algorithm.</Paragraph>
      <Paragraph position="14"> char *corpus ; int suffix_compare(int *a, int *b) { return strcmp(corpus+*a, corpus+*b);} int *suffix_array(int n){ int i, *s = (int *)malloc(n*sizeof(int)); for(i=O; i &lt; n; i++) s\[i\] = i; /* initialize */ qsort(s, n, sizeof(int), suffix_compare); /* sort */ return s;} Figures i and 2 illustrate a simple example where the corpus (&amp;quot;to_be_ormot_to_be&amp;quot;) consists of N = 18 (19 bytes): 13 alphabetic characters plus 5 spaces (and 1 null termination). The C program (above) starts by allocating memory for the suffix array (18 integers of 4 bytes each). The suffix array is initialized to the integers from 0 to 17. Finally, the suffix array is sorted by alphabetical order. The suffix array after initialization is shown in Figure 1. The suffix array after sorting is shown in Figure 2. As mentioned above, suffix arrays were designed to make it easy to compute the frequency (tf) and locations of a substring (n-gram or term) in a sequence (corpus). Given a substring or term, t, a binary search is used to find the first and last suffix that start with t. Let s\[i\] be the first such suffix and s~\] be the last such suffix. Then tf(t) = j - i + 1 and the term is located at positions: {s\[i\], s\[i + 1\] .... ,s~\]}, and only these positions.</Paragraph>
      <Paragraph position="15"> Figure 2 shows how this procedure can be used to compute the frequency and locations of the term &amp;quot;to_be&amp;quot; in the corpus &amp;quot;to_be_or_not_to_be&amp;quot;. As illustrated in the figure, s\[i = 16\] is the first suffix to start with the term &amp;quot;to_be&amp;quot; and s~ = 17\] is the last Yamamoto and Church Term Frequency and Document Frequency for All Substrings Input corpus: &amp;quot;to_be_or_not_to_be&amp;quot;</Paragraph>
      <Paragraph position="17"> .... .. j...: o or to j :&amp;quot; 'ii'' I ........ &amp;quot; ..,'&amp;quot; I be or not to be I .&amp;quot; / / / / I&amp;quot;~ be orNot_to_be ...... ,'&amp;quot;'&amp;quot; ...'/&amp;quot; .y ./&amp;quot; / Figure 1 Illustration of a suffix array, s, that has just been initialized and not yet sorted. Each element in the suffix array, s\[i\], is an integer denoting a suffix or a semi-infinite string, starting at position i in the corpus and extending to the end of the corpus.</Paragraph>
      <Paragraph position="19"> to be or_not_to_be Illustration of the suffix array in Figure 1 after sorting. The integers in s are sorted so that the semi-infinite strings are now in alphabetical order.</Paragraph>
      <Paragraph position="20"> Computational Linguistics Volume 27, Number 1 suffix to start with this term. Consequently, tf(&amp;quot;to_be&amp;quot;) = 17-16+ 1 = 2. Moreover, the term appears at positions(&amp;quot;to_be&amp;quot;) = {s\[16\], s\[17\] } = {13, 0}, and only these positions. Similarly, the substring &amp;quot;to&amp;quot; has the same tf and positions, as do the substrings, &amp;quot;to_&amp;quot; and &amp;quot;to_b&amp;quot;. Although there may be N(N + 1)/2 ways to pick i and j, it will turn out that we need only consider 2N - 1 of them when computing tf for all substrings.</Paragraph>
      <Paragraph position="21"> Nagao and Mori (1994) ran this procedure quite successfully on a large corpus of Japanese text. They report that it takes O(NlogN) time, assuming that the sort step performs O(N log N) comparisons, and that each comparison takes constant time.</Paragraph>
      <Paragraph position="22"> While these are often reasonable assumptions, we have found that if the corpus contains long repeated substrings (e.g., duplicated articles), as our English corpus does (Paul and Baker 1992), then the sort can consume quadratic time, since each comparison can take order N time. Like Nagao and Mori (1994), we were also able to apply this procedure quite successfully to our Japanese corpus, but for the English corpus, after 50 hours of CPU time, we gave up and turned to Manber and Myers's (1990) algorithm, which took only two hours. 1 Manber and Myers' algorithm uses some clever, but difficult to describe, techniques to achieve O(N log N) time, even for a corpus with long repeated substrings. For a corpus that would otherwise consume quadratic time, the Manber and Myers algorithm is well worth the effort, but otherwise, the procedure described above is simpler, and can even be a bit faster.</Paragraph>
      <Paragraph position="23"> The &amp;quot;to_be_or_not_to_be&amp;quot; example used the standard English alphabet (one byte per character). As mentioned above, suffix arrays can be generalized in a straightforward way to work with larger alphabets such as Japanese (typically two bytes per character).</Paragraph>
      <Paragraph position="24"> In the experimental section (Section 3), we use an open-ended set of English words as the alphabet. Each token (English word) is represented as a four-byte pointer into a symbol table (dictionary). The corpus &amp;quot;to_be_or_not_to_be&amp;quot;, for example, is tokenized into six tokens: &amp;quot;to&amp;quot;, &amp;quot;be&amp;quot;, &amp;quot;or&amp;quot;, &amp;quot;not&amp;quot;, &amp;quot;to&amp;quot;, and &amp;quot;be&amp;quot;, where each token is represented as a four-byte pointer into a dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Longest Common Prefixes (LCPs)
</SectionTitle>
      <Paragraph position="0"> Algorithms from the suffix array literature make use of an auxiliary array for storing LCPs (longest common prefixes). The lcp array contains N + 1 integers. Each element, lcp\[i\], indicates the length of the common prefix between s\[i - 1\] and s\[i\]. We pad the lcp vector with zeros (lcp\[O\] = lcp\[N\] = 0) to simplify the discussion. The padding avoids the need to test for certain end conditions.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the lcp vector for the suffix array of &amp;quot;to_be_or_not_to_be&amp;quot;. For example, since s\[10\] and s\[11\] both start with the substring &amp;quot;o_be&amp;quot;, lcp\[11\] is set to 4, the length of the longest common prefix. Manber and Myers (1990) use the Icp vector in their O(P+log N) algorithm for computing the frequency and location of a substring of length P in a sequence of length N. They showed that the Icp vector can be computed in O(N log N) time. These algorithms are much faster than the obvious straightforward implementation when the corpus contains long repeated substrings, though for many corpora, the complications required to avoid quadratic behavior are unnecessary.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="7" type="sub_section">
      <SectionTitle>
2.3 Classes of Substrings
</SectionTitle>
      <Paragraph position="0"> Thus far we have seen how to compute tf for a single n-gram, but how do we compute tf and df for all n-grams? As mentioned above, the N(N + 1)/2 substrings will be clustered into a relatively small number of classes, and then the statistics will be  Yamamoto and Church Term Frequency and Document Frequency for All Substrings computed over the classes rather than over the substrings, which would be prohibitive. The reduction of the computation over substrings to a computation over classes is made possible by four properties.</Paragraph>
      <Paragraph position="1"> Properties 1-2: all substrings in a class have the same statistics (at least for the statistics of interest, namely tf and dJ), Property 3: the set of all substrings is partitioned by the classes, and Property 4: there are many fewer classes (order N) than substrings (order N2).</Paragraph>
      <Paragraph position="2"> Classes are defined in terms of intervals. Let (i,j) be an interval on the suffix array, s\[i\], s\[i + 1\] ..... s~\]. Class((i,j)) is the set of substrings that start every suffix within the interval and no suffix outside the interval. It follows from this construction that all substrings in a class have tf = j - i + 1.</Paragraph>
      <Paragraph position="3"> The set of substrings in a class can be constructed from the lcp vector: class((i,j)) = {s\[i\]ml max(Icp\[i\], lcp~ + 1\]) &lt; m ~ min(lcp\[i + 1\], lcp\[i + 2\] ..... lcp~'\])}, where s\[i\]m denotes the first m characters of the suffix s\[i\]. We will refer to lcp\[i\] and</Paragraph>
      <Paragraph position="5"> The longest common prefix is a vector of N + 1 integers, lcp\[i\] denotes the length of the common prefix between the suffix s\[i - 1\] and the suffix s\[i\]. Thus, for example, s\[10\] and s\[11\] share a common prefix of four characters, and therefore lcp\[11\] = 4. The common prefix is highlighted by a dotted line in the suffix array. The suffix array is the same as in the previous figure.</Paragraph>
      <Paragraph position="6">  Computational Linguistics Volume 27, Number 1 Vertical lines denote lcps. Gray area denotes Boundaries  endpoints of substrings in class(&lt;10,11&gt;).</Paragraph>
      <Paragraph position="7"> S \[9\]lli' n o t to be ~'~f7 degf&lt;10'll&gt; I,Ill 0 __ b e - ~ X&lt;&amp;quot;;flO,11&gt; .......... / s \[ 10 \] s\[ll\]\]~o\]_ b el_or_not_to_be ~1 4_ I s\[12\]I lo:;Ir n~t to be ........ &lt;10,13&gt; s\[14\]s\[13\]\]i!I~to ~-t~_be e ............ ~ Bounding lcps, LBL, SIL, Interior lcp of &lt;10,11&gt;  {&amp;quot;O_&amp;quot;, &amp;quot;o_b&amp;quot;, &amp;quot;o_be&amp;quot;} 1 4 {&amp;quot;o&amp;quot;} 0 1 {&amp;quot;n&amp;quot;, &amp;quot;no&amp;quot;, &amp;quot;not&amp;quot;, ...} 0 infinity {} 4 infinity {&amp;quot;o_be_&amp;quot;,&amp;quot;o_beo&amp;quot;,...} 4 infinity {&amp;quot;or&amp;quot;, &amp;quot;or_&amp;quot;, &amp;quot;or_n&amp;quot;, ...} 1 infinity {&amp;quot;ot&amp;quot;, &amp;quot;ot_&amp;quot;, &amp;quot;oCt&amp;quot;, ...} 1 infinity &amp;quot;r&amp;quot;, &amp;quot;r &amp;quot; &amp;quot; &amp;quot; _, r_n,...} 0 infinity tf  Six suffixes are copied from Figure 3, s\[9\]-s\[14\], along with eight of their lcp-delimited intervals. Two of the lcp-delimited intervals are nontrivial (tf &gt; 1), and six are trivial (tf = 1). Intervals are associated with classes, sets of substrings. These substrings start every suffix within the interval and no suffix outside the interval. All of the substrings within a class have the same term frequency (and document frequency).</Paragraph>
      <Paragraph position="8"> above can be rewritten as class((i, jl ) = {s\[i\]mlLBL((i, jl) &lt; m &lt;_ SIL((i, jl)}, where LBL (longest bounding lcp) is</Paragraph>
      <Paragraph position="10"> By construction, the class will be empty unless there is some room for m between the LBL and SIL. We say that an interval is lcp-delimited when this room exists (that is, LBL &lt; SIL). Except for trivial intervals where tf = 1 (see below), classes are nonempty iff the interval is lcp-delimited. Moreover, the number of substrings in a nontrivial class depends on the amount of room between the LBL and the SIL. That is, \[class((i,j))l = SIL((i,j)) - LBL((i,j)).</Paragraph>
      <Paragraph position="11"> Figure 4 shows eight examples of lcp-delirnited intervals. The top part of the figure highlights the interval (10,11) with dashed horizontal lines. Solid vertical lines denote Yamamoto and Church Term Frequency and Document Frequency for All Substrings bounding lcps, and thin vertical lines denote interior lcps (there is only one interior lcp in this case). The interval {10, 11 / is lcp-delimited because the bounding lcps, Icp\[lO\] --= 0 and lcp\[12\] = 1, are smaller than the interior lcp, lcp\[11\] = 4. That is, the LBL (= 1) is less than the SIL (= 4). Thus there is room for m between the LBL of s\[10\]m and the SIL of s\[10\]m. The endpoints m between LBL and SIL are highlighted in gray. The class is nonempty. Its size depends on the width of the gray area: class((lO, 11)) = {s\[10\]mll &lt; m _&lt; 4} = {&amp;quot;o_&amp;quot;, &amp;quot;o_b&amp;quot;, &amp;quot;o_be&amp;quot;}. These substrings have the same tf: tf = j - i + 1 = 11 - 10 + 1 = 2. Each of these substrings occurs exactly twice in the corpus.</Paragraph>
      <Paragraph position="12"> Every substring in the class starts every suffix in the interval (10,11), and no suffix outside (10,11). In particular, the substring &amp;quot;o&amp;quot; is excluded from the class, because it is shared by suffixes outside the interval, namely s\[12\] and s\[13\]. The longer substring, &amp;quot;o_be_&amp;quot;, is excluded from the class because it is not shared by s\[10\], a suffix within the interval.</Paragraph>
      <Paragraph position="13"> We call an interval trivial if the interval starts and ends at the same place: (i, i).</Paragraph>
      <Paragraph position="14"> The remaining six intervals mentioned in Figure 4 are trivial intervals. We call the class of a trivial interval a trivial class. As in the nontrivial case, the class contains all (and only) the substrings that start every suffix within the interval and no suffix outside the interval. We can express the class of a trivial interval, class((i, i)), as {s\[i\]m\]LBL &lt; m &lt;_ SIL}. The trivial case is the same as the nontrivial case, except that the SIL of a trivial interval is defined to be infinite. As a result, trivial classes are usually quite large, because they contain all prefixes of s\[i\] that are longer than the LBL. They cover all (and only) the substrings with tf = 1, typically the bulk of the N(N+ 1)/2 substrings in a corpus. The trivial class of the interval (11,11), for example, contains 13 substrings: &amp;quot;o_be_&amp;quot;, &amp;quot;o_be_o&amp;quot;, &amp;quot;o_be_or&amp;quot;, and so on. Of course, there are some exceptions: the trivial class, class((lO, 10)), in Figure 4, for example, is very small (= empty set).</Paragraph>
      <Paragraph position="15"> Not every interval is lcp-delimited. The interval, Ill, 12), for example, is not lcp-delimited because there is no room for m of s\[ll\]m between the LBL (= 4) and the SIL (= 1). When the interval is not lcp-delimited, the class is empty. There are no substrings starting all the suffixes within the interval (11,12), and not starting any suffix outside the interval.</Paragraph>
      <Paragraph position="16"> It is possible for lcp-delimited intervals to be nested, as in the case of (10,11) and (10, 13). We say that one interval (i,j) is nested within another (u,v) if i &lt; u &lt; v &lt; j (and (i,j) ~ (u, v)). Nested intervals have distinct SILs and disjoint classes. (Two classes are disjoint if the corresponding sets of substrings are disjoint.) 2 The substrings in the class of the nested interval, (u, v), are longer than the substrings in the class of the outer interval, (i,j).</Paragraph>
      <Paragraph position="17"> Although it is possible for lcp-delimited intervals to be nested, it is not possible for lcp-delimited intervals to overlap. We say that one nontrivial interval Ca, b) overlaps another nontrivial interval (c, d) if a &lt; c &lt; b &lt; d. If two intervals overlap, then at least one of the intervals is not lcp-delimited and has an empty class. If an interval (a, b) is lcp-delimited, an overlapped interval (c, d) is not lcp-delimited. Because a bounding lcp of /a, b) must be within (c, d) and an interior lcp of (a, b) must be a bounding lcp of (c,d), SIL((c,d)) &lt;_ LBL((a,b)) &lt;_ SIL((a,b)) &lt;_ LBL((c,d)). That is, the overlapped interval (c, d) is not lcp-delimited. The fact that lcp-delimited intervals are nested and do not overlap will turn out to be convenient for enumerating lcp-delimited intervals.</Paragraph>
      <Paragraph position="18"> 2 Because (u, v I is lcp-delimited, there must be a bounding lcp of lu, v I that is smaller than any lcp within (u, v). This bounding lcp must be within (i,j), and as a result, class((i,j)) and class((u, v)) must be disjoint. Computational Linguistics Volume 27, Number 1</Paragraph>
    </Section>
    <Section position="4" start_page="7" end_page="10" type="sub_section">
      <SectionTitle>
2.4 Four Properties
</SectionTitle>
      <Paragraph position="0"> As mentioned above, classes are constructed so that it is practical to reduce the computation of various statistics over substrings to a computation over classes. This sub-section will discuss four properties of classes that help make this reduction feasible.</Paragraph>
      <Paragraph position="1"> The first two properties are convenient because they allow us to associate tf and df with classes rather than with substrings. The substrings in a class all have the same tf value (property 1) and the same df value (property 2). That is, if Sl and s2 are two substrings in class((i,j)) then</Paragraph>
      <Paragraph position="3"> Both of these properties follow straightforwardly from the construction of intervals.</Paragraph>
      <Paragraph position="4"> The value of tf is a simple function of the endpoints; the calculation of df is more complicated and will be discussed in Section 2.6. While tf and df treat each member of a class as equivalent, not all statistics do. Mutual information (MI) is an important counter example; in most cases, MI(sl) ~ MI(s2).</Paragraph>
      <Paragraph position="5"> The third property is convenient because it allows us to iterate over classes rather than substrings, without worrying about missing any of the substrings.</Paragraph>
      <Paragraph position="6"> Property 3: The classes partition the set of all substrings.</Paragraph>
      <Paragraph position="7"> There are two parts to this argument: every substring belongs to at most one class (property 3a), and every substring belongs to at least one class (property 3b).</Paragraph>
      <Paragraph position="8"> Demonstration of property 3a (proof by contradiction): Suppose there is a substring, s, that is a member of two distinct classes: class((i,j}) and class((u,v)). There are three possibilities: one interval precedes the other, they are properly nested, or they overlap. In all three cases, s cannot be a member of both classes. If one interval precedes the other, then there must be a bounding lcp between the two intervals which is shorter than s. And therefore, s cannot be in both classes. The nesting case was mentioned previously where it was noted that nested intervals have disjoint classes. The overlapping case was also discussed previously where it was noted that two overlapping intervals cannot both be lcp-delimited, and therefore at least one of the classes would have to be empty.</Paragraph>
      <Paragraph position="9"> Demonstration of property 3b (constructive argument): Let s be an arbitrary sub-string in the corpus. There will be at least one suffix in the suffix array that starts with s. Let i be the first such suffix and let j be the last such suffix. By construction, the interval (i,j} is lcp-delimited (LBL((i,j}) &lt; Is I and SIL((i,j}) &gt;__ Is\[), and therefore, s is an element of class((i,j}).</Paragraph>
      <Paragraph position="10"> Finally, as mentioned above, computing over classes is much more efficient than computing over the substrings themselves because there are many fewer classes (at most 2N - 1) than substrings (N(N + 1)/2).</Paragraph>
      <Paragraph position="11"> Property 4: There are at most N nonempty classes with tf = 1 and at most N - 1 nonempty classes with tf &gt; 1.</Paragraph>
      <Paragraph position="12"> The first clause is relatively straightforward. There are N trivial intervals (i, i/. These are all and only the intervals with tf = 1. By construction, these intervals are lcp-delimited, though it is possible that a few of the classes could be empty.</Paragraph>
      <Paragraph position="13"> To argue the second clause, we make use of a uniqueness property: an lcp-delimited interval (i, jl can be uniquely determined by an SIL and a representative ele- null Yamamoto and Church Term Frequency and Document Frequency for All Substrings ment k, where i &lt; k &lt; j. For convenience, we will choose k such that SIL(&lt;i,j&gt;) = Icp\[k\], but we could have uniquely determined the lcp-delimited interval by choosing any k such that i &lt; k G j.</Paragraph>
      <Paragraph position="14"> The uniqueness property can be demonstrated using a proof by contradiction.</Paragraph>
      <Paragraph position="15"> Suppose there were two distinct lcp-delimited intervals, &lt;i,j&gt; and &lt;u, v&gt;, with the same representative k, where i &lt; k &lt; j and u &lt; k G v. Since they share a common representative, k, one interval must be nested inside the other. But nested intervals have disjoint classes and different SILs.</Paragraph>
      <Paragraph position="16"> Given this uniqueness property, we can determine the N- 1 upper bound on the number of lcp-delimited intervals by considering the N - 1 elements in the Icp vector. Each of these elements, lcp\[k\], has the opportunity to become the SIL of an lcp-delimited interval &lt;i,j&gt; with a representative k. Thus there could be as many as N- 1 lcp-delimited intervals (though there could be fewer if some of the opportunities don't work out). Moreover, there cannot be any more intervals with tf &gt; 1, because if there were one, its SIL should have been in the Icp vector. (Note that this lcp counting argument does not count trivial intervals because their S1Ls \[= infinity\] are not in the lcp vector; the Icp vector contains integers less than N.) From property 4, it follows that there are at most N distinct values of RIDE The N trivial intervals &lt;i, i&gt; have just one RIDF value since tf = df = 1 for these intervals. The other N - 1 intervals could have as many as another N - 1 RIDF values. Similar arguments hold for many other statistics that make use of tf and dr, and treat all members of a class as equivalent.</Paragraph>
      <Paragraph position="17"> In summary, the four properties taken collectively make it practical to compute tf, dr, and RIDF over a relatively small number of classes; it would be prohibitively expensive to compute these quantities directly over the N(N + 1)/2 substrings.</Paragraph>
    </Section>
    <Section position="5" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
2.5 Computing All Classes Using Suffix Arrays
</SectionTitle>
      <Paragraph position="0"> This subsection describes a single-pass procedure, print_LDIs, for computing t/for all LDIs (lcp-delimited intervals). Since lcp-delimited intervals are properly nested, the procedure is based on a push-down stack. The procedure outputs four quantities for each lcp-delimited interval, &lt;i,j&gt;. The four quantities are the two endpoints (i and j), the term frequency (tf) and a representative (k), such that i &lt; k _&lt; j and lcp\[k\] : SIL(&lt;i,j&gt;).</Paragraph>
      <Paragraph position="1"> This procedure will be described twice. The first implementation is expressed in a recursive form; the second implementation avoids recursion by implementing the stack explicitly.</Paragraph>
      <Paragraph position="2"> The recursive implementation is presented first, because it is simpler. The function print_LDIs is initially called with print LDIs (0,0), which will cause the function to be called once for each value of k between 0 and N - 1. k is a representative in the range: i &lt; k &lt; j, where i and j are the endpoints of an interval. For each of the N values of k, a trivial LDI is reported at &lt;k, k&gt;. In addition, there could be up to N- 1 nontrivial intervals, where k is the representative and lcp\[k\] is the SIL. Recall that lcp-delimited intervals are uniquely determined by a representative k such that i &lt; k &lt; j where SIL(&lt;i,j)) = Icp\[k\]. Not all of these candidates will produce LDIs.</Paragraph>
      <Paragraph position="3"> The recursion searches for j's such that LBL((i,j&gt;) &lt; SIL(&lt;i,j&gt;), but reports intervals at (i,j&gt; only when the inequality is a strict inequality, that is, LBL(&lt;i,j&gt;) &lt; SIL(&lt;i,j&gt;). The program stack keeps track of the left and right edges of these intervals. While Icp\[k\] is monotonically increasing, the left edge is remembered on the stack, as pr+-nt LDIs is called recursively. The recursion unwinds as lcp\[j\] &lt; lcp\[k\]. Figure 5 illustrates the function calls for computing the nontrivial lcp-delimited intervals in Figure 4. C code is provided in Appendix A.</Paragraph>
      <Paragraph position="4">  Computational Linguistics Volume 27, Number 1</Paragraph>
      <Paragraph position="6"> Figure 5 Trace of function calls for computing the nontrivial lcp-delimited intervals in Figure 4. In this trace, trivial intervals are omitted. Print_LDIs(i = x,k = y) represents a function call with arguments, i and k. Indentation represents the nest of recursive calls. Print_LDIs(i,k) searches the right edge, j, of the non-trivial lcp-delimited interval, &lt;i,j&gt;, whose SIL is lcp\[k\]. Each representative, k, value is given to the function print_LDIs just once (dotted arcs).</Paragraph>
      <Paragraph position="7"> print_LDIs +- function(i, k) { j~k.</Paragraph>
      <Paragraph position="8"> Output a trivial lcp-delimited interval &lt;k, k&gt; with tf = 1.</Paragraph>
      <Paragraph position="9"> While lcp\[k\] &lt;_ lcp~ + 1\] and j + 1 &lt; N, do j *-- print_LDIs(k, j + 1). Output an interval &lt;i,j&gt; with tf = j - i + 1 and rep = k, if it is lcp-delimited. Return j. } The second implementation (below) introduces its own explicit stack, a complication that turns out to be important in practice, especially for large corpora. C code is provided in Appendix B.</Paragraph>
      <Paragraph position="10"> print_LDIs_stack *-- function(N){ stack_i ~-- an integer array for the stack of the left edges, i.</Paragraph>
      <Paragraph position="11"> stack_k ~ an integer array for the stack of the representatives, k.</Paragraph>
      <Paragraph position="13"> Resulting non-trivial lcp-delimited intervals: ('rep' means a representative, k.)</Paragraph>
      <Paragraph position="15"> A suffix array for a corpus consisting of three documents. The special character $ denotes the end of a document. The procedure outputs a sequence of intervals with their term frequencies and document frequencies. These results are also presented for the nontrivial intervals.</Paragraph>
    </Section>
    <Section position="6" start_page="11" end_page="14" type="sub_section">
      <SectionTitle>
2.6 Computing df for All Classes
</SectionTitle>
      <Paragraph position="0"> Thus far we have seen how to compute term frequency, tf, for all substrings (n-grams) in a sequence (corpus). This section will extend the solution to compute document frequency, dr, as well as term frequency. The solution runs in O(NlogN) time and O(N) space. C code is provided in Appendix C.</Paragraph>
      <Paragraph position="1"> This section will use the running example shown in Figure 6, where the corpus is: &amp;quot;to_be$or$not_to_be$'. The corpus consists of three documents, &amp;quot;to_be$', &amp;quot;orS&amp;quot;, and &amp;quot;not_to_be$'. The special character $ is used to denote the end of a document. The procedure outputs a sequence of intervals with their term frequencies and document frequencies. These results are also presented for the nontrivial intervals.</Paragraph>
      <Paragraph position="2"> The suffix array is computed using the same procedures discussed above. In addition to the suffix array and the lcp vector, Figure 6 introduces a new third table that is used to map from suffixes to document ids. This table of document ids will be  Computational Linguistics Volume 27, Number 1 used by the function get_docnum to map from suffixes to document ids. Suffixes are terminated in Figure 6 after the first end of document symbol, unlike before, where suffixes were terminated with the end of corpus symbol.</Paragraph>
      <Paragraph position="3"> A straightforward method for computing df for an interval is to enumerate the suffixes within the interval and then compute their document ids, remove duplicates, and return the number of distinct documents. Thus, for example, df(&amp;quot;o') in Figure 6, can be computed by finding corresponding interval, (11,14}, where every suffix within the interval starts with &amp;quot;o&amp;quot; and no suffix outside the interval starts with &amp;quot;o'. Then we enumerate the suffixes within the interval {s\[11\],s\[12\],s\[13\],s\[14\]}, compute their document ids, {0, 2,1, 2}, and remove duplicates. In the end we discover that df(&amp;quot;o&amp;quot;) -3. That is, &amp;quot;o&amp;quot; appears in all three documents.</Paragraph>
      <Paragraph position="4"> Unfortunately, this straightforward approach is almost certainly too slow. Some document ids will be computed multiple times, especially when suffixes appear in nested intervals. We take advantage of the nesting property of lcp-delimited intervals to compute all df's efficiently. The df of an lcp-delimited interval can be computed recursively in terms of its constituents (nested subintervals), thus avoiding unnecessary recomputation.</Paragraph>
      <Paragraph position="5"> The procedure print_LDIs_w+-th_df presented below is similar to print_LDIs_stack but modified to compute df as well as tf. The stack keeps track of i and k, as before, but now the stack also keeps track of df.</Paragraph>
      <Paragraph position="6"> i, the left edge of an interval, k, the representative (SIL = lcp\[k\]), df, partial results for dr, counting documents seen thus far, minus duplicates.</Paragraph>
      <Paragraph position="7"> print_LDIs with_dr +--- function(N){ stack_i ~- an integer array for the stack of the left edges, i.</Paragraph>
      <Paragraph position="8"> stack_k ~ an integer array for the stack of the representatives, k.</Paragraph>
      <Paragraph position="9"> stackdf ~-- an integer array for the stack of the df counter.</Paragraph>
      <Paragraph position="10"> doclink\[O..D\] : an integer array for the document link initialized with -1.</Paragraph>
      <Paragraph position="11">  D = the number of documents.</Paragraph>
      <Paragraph position="12"> stack_i\[O\] ~ O.</Paragraph>
      <Paragraph position="13"> stack_k\[O\] *--- O.</Paragraph>
      <Paragraph position="14"> stack_dr\[O\] *--- 1.</Paragraph>
      <Paragraph position="15"> sp ~ 1 (a stack pointer).</Paragraph>
      <Paragraph position="16">  (1) Forj ~ 0,1,2,...,N- 1 do (2) (Output a trivial lcp-delimited interval q,j&gt; with tf = 1 and d/= 1.) (3) doc *-- get_docnum(s~\]) (4) if doclink\[doc\] ~ -1, do (5) let x be the largest x such that doclink\[doc\] &gt;_ stack_i N.</Paragraph>
      <Paragraph position="17"> (6) stack_dr\[x\] *--- stack_df\[x\] - 1.</Paragraph>
      <Paragraph position="18"> (7) doclink\[ doc\] ~-- j.</Paragraph>
      <Paragraph position="19"> (8) df ,-- 1.</Paragraph>
      <Paragraph position="20"> (9) While lcp~&amp;quot; + 1\] &lt; lcp\[stack_k\[sp - 1\]\] do (10) df *--- stack~tf\[sp - 1\] + df .</Paragraph>
      <Paragraph position="21"> (11) Output a nontrivial interval (i,j) with tf = j - i + 1 and dr, if it is lcp-delimited.</Paragraph>
      <Paragraph position="22"> (12) sp ~-- sp - 1.</Paragraph>
      <Paragraph position="23"> (13) stack_i\[sp\] ~-- stack_k\[sp - 1\].</Paragraph>
      <Paragraph position="24"> (14) stack_k \[sp\] ~-- j + 1.</Paragraph>
      <Paragraph position="25">  Yamamoto and Church Term Frequency and Document Frequency for All Substrings (15) staek_e/\[sp\] ~ ad.</Paragraph>
      <Paragraph position="26"> (16) sp *-- sp + 1. }  Lines 5 and 6 take care of duplicate documents. The duplication processing makes use of doclink (an array of length D, the number of documents in the collection), which keeps track of which suffixes have been seen in which document, doclink is initialized with -1 indicating that no suffixes have been seen yet. As suffixes are processed, doclink is updated (on line 7) so that doclink \[d\] contains the most recently processed suffix in document d. As illustrated in Figure 7, when j = 16 (snapshot A), the most recently processed suffix in document 0 is s\[11\] (&amp;quot;o_be$&amp;quot;), the most recently processed suffix in document 1 is s\[15\] (&amp;quot;r$&amp;quot;), and the most recently processed suffix in document 2 is s\[16\] (&amp;quot;t_to_be$&amp;quot;). Thus, doclink\[0\]= 11, doclink\[1\]= 15, and doclink\[2\]= 16. After processing s\[17\] (&amp;quot;to_be$&amp;quot;), which is in document 0, doclink\[0\] is updated from 11 to 17, as shown in snapshot B of Figure 7.</Paragraph>
      <Paragraph position="27"> Stackdf keeps track of document frequencies as suffixes are processed. The invariant is: stack_dr\[x\] contains the document frequency for suffixes seen thus far starting at i = stack_i\[x\]. (x is a stack offset.) When a new suffix is processed, line 5 checks for double counting by searching for intervals on the stack (still being processed) that have suffixes in the same document as the current suffix. If there is any double counting, stackdf is decremented appropriately on line 6.</Paragraph>
      <Paragraph position="28"> There is an example of this decrementing in snapshot C of Figure 7, highlighted by the circle around the binding of df to 0 on the stack element: \[i = 0, k = 17, df = 0\]. Note that df was previously bound to 1 in snapshot B. The binding of df was decremented when processing s\[18\] because s\[18\] is in the same document as s\[16\]. This duplication was identified by line 5. The decrementing was performed by line 6. Intervals are processed in depth-first order, so that more deeply nested intervals are processed before less deeply nested intervals. In this way, double counting is only an issue for intervals higher on the stack. The most deeply nested intervals are trivial intervals. They are processed first. They have a df of 1 (line 8). For the remaining nontrivial intervals, staekdf contains the partial results for intervals in process. As the stack is popped, the df values are aggregated up to compute the df value for the outer intervals. The aggregation occurs on line 10 and the popping of the stack occurs on line 12. The aggregation step is illustrated in snapshots C and D of Figure 7 by the two arrows with the &amp;quot;+&amp;quot; combination symbol pointing at a value of df in an output statement.</Paragraph>
    </Section>
    <Section position="7" start_page="14" end_page="16" type="sub_section">
      <SectionTitle>
2.7 Class Arrays
</SectionTitle>
      <Paragraph position="0"> The classes identified by the previous calculation are stored in a data structure we call a class array, to make it relatively easy to look up the term frequency and document frequency for an arbitrary substring. The class array is a stored list of five-tuples: (SIL, LBL, tf, df, longest suffix I. The fifth element of the five-tuple is a canonical member of the class (the longest suffix). The five-tuples are sorted by the alphabetical order of the canonical members. In our C code implementation, classes are represented by five integers, one for each element in the five-tuple. Since there are N trivial classes and at most N - 1 nontrivial classes, the class array will require at most 10N - 5 integers.</Paragraph>
      <Paragraph position="1"> However, for many practical applications, the trivial classes can be omitted.</Paragraph>
      <Paragraph position="2"> Figure 8 shows an example of the nontrivial class array for the corpus: &amp;quot;to_be$or$ not_to_be$&amp;quot;. The class array makes it relatively easy to determine that the substring &amp;quot;o&amp;quot; appears in all three documents. That is, df(&amp;quot;o&amp;quot;) = 3. We use a binary search to find that tuple c\[5\] is the relevant five-tuple for &amp;quot;o&amp;quot;. Having found the relevant tuple, it requires a simple record access to return the document frequency field.</Paragraph>
      <Paragraph position="3">  Snapshots of the doclink array and the stack during the processing of print_LDIs_with_df on the corpus: &amp;quot;to_be$or$not_to_be$'. The four snapshots A-D illustrate the state as j progresses from 16 to 18. Two nontrivial intervals are emitted while j is in this range: (17,18) and (16,18). The more deeply nested interval is emitted before the less deeply nested interval.</Paragraph>
    </Section>
    <Section position="8" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
3.1 RIDF and MI for English and Japanese
</SectionTitle>
      <Paragraph position="0"> We used the methods described above to compute dr, tf, and RIDF for all substrings in two corpora of newspapers summarized in Table 1. MI was computed for the longest substring in each class. The entire computation took a few hours on a MIPS10000 with 16 Gbytes of main memory. The processing time was dominated by the calculation of the suffix array.</Paragraph>
      <Paragraph position="1"> The English collection consists of 50 million words (113 thousand articles) of the Wall Street Journal (distributed by the ACL/DCI) and the Japanese collection consists of 216 million characters (436 thousand articles) of the CD-Mainichi Shimbun from 19911995 (which are distributed in CD-ROM format). The English corpus was tokenized into words delimited by white space, whereas the Japanese corpus was tokenized into characters (typically two bytes each).</Paragraph>
      <Paragraph position="2"> Table I indicates that there are a large number of nontrivial classes in both corpora. The English corpus has more substrings per nontrivial class than the Japanese corpus. It has been noted elsewhere that the English corpus contains quite a few duplicated articles (Paul and Baker 1992). The duplicated articles could explain why there are so many substrings per nontrivial class in the English corpus when compared with the Japanese corpus.</Paragraph>
      <Paragraph position="3"> Table 1 Statistics of the English and Japanese corpora.</Paragraph>
      <Paragraph position="4">  ll.. ............ __, , .-. ,_- ........ i1: .... ,~--15 --20 25 3'0 35 40 f---5 .... ,(f 15 20 25 3() 35 40 * . Length -, Length  The left panel plots MI as a function of the length of the n-gram; the right panel plots RIDF as a function of the length of the n-gram. Both panels were computed from the Japanese corpus. Note that while there is more dynamic range for shorter n-grams than for longer n-grams, there is plenty of dynamic range for n-grams well beyond bigrams and trigrams. For subsequent processing, we excluded substrings with tf &lt; 10 to avoid noise, resulting in about 1.4 million classes (1.6 million substrings) for English and 10 million classes (15 million substrings) for Japanese. We computed RIDF and MI values for the longest substring in each of these 1.4 million English classes and 10 million Japanese classes. These values can be applied to the other substrings in these classes for RIDF, but not for MI. (As mentioned above, two substrings in the same class need not have the same MI value.)</Paragraph>
      <Paragraph position="6"> where x and z are tokens, and Y and zYz are n-grams (sequences of tokens). When Y is the empty string, tf(Y) -- N.</Paragraph>
      <Paragraph position="7"> Figure 9 plots RIDF and MI values of 5,000 substrings randomly selected as a function of string length. In both cases, shorter substrings have more dynamic range. That is, RIDF and MI vary more for bigrams than million-grams. But there is considerable dynamic range for n-grams well beyond bigrams and trigrams.</Paragraph>
    </Section>
    <Section position="9" start_page="17" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Little Correlation between RIDF and MI
</SectionTitle>
      <Paragraph position="0"> We are interested in comparing and contrasting RIDF and MI. Figure 10 shows that RIDF and MI are largely independent. There is little if any correlation between the RIDF of a string and the MI of the same string. Panel (a) compares RIDF and MI for a sample of English word sequences from the WSJ corpus (excluding unigrams); panel (b) makes the same comparison but for Japanese phrases identified as keywords on the CD-ROM. In both cases, there are many substrings with a large RIDF value and a small MI, and vice versa.</Paragraph>
      <Paragraph position="2"> Both panels plot RIDF versus MI. Panel (a) plots RIDF and MI for a sample of English n-grams; panel (b) plots RIDF and MI for Japanese phrases identified as keywords on the CD-ROM. The right panel highlights the 10% highest RIDF and 10% lowest MI with a box, as well as the 10% lowest RIDF and the 10% highest MI. Arrows point to the boxes for clarity.</Paragraph>
      <Paragraph position="3"> We believe the two statistics are both useful but in different ways. Both pick out interesting n-grams, but n-grams with large MI are interesting in different ways from n-grams with large RIDF. Consider the English word sequences in Table 2, which all contain the word having. These sequences have large MI values and small RIDF values.</Paragraph>
      <Paragraph position="4"> In our collaboration with lexicographers, especially those working on dictionaries for learners, we have found considerable interest in statistics such as MI that pick out these kinds of phrases. Collocations can be quite challenging for nonnative speakers of the language. On the other hand, these kinds of phrases are not very good keywords for information retrieval.</Paragraph>
      <Paragraph position="5"> Table 2 English word sequences containing the word having. Note that these phrases have large MI and low RIDF. They tend to be more interesting for lexicography than information retrieval. The table is sorted by MI.</Paragraph>
      <Paragraph position="6">  Computational Linguistics Volume 27, Number 1 Table 3 English word sequences containing the word Mr. (sorted by RIDF). The word sequences near the top of the list are better keywords than the sequences near the bottom of the list. None of them are of much interest to lexicography.</Paragraph>
      <Paragraph position="7">  Table 3 shows MI and RIDF values for a sample of word sequences containing the word Mr. The table is sorted by RIDE The sequences near the top of the list are better keywords than the sequences further down. None of these sequences would be of much interest to a lexicographer (unless he or she were studying names). Many of the sequences have rather small MI values.</Paragraph>
      <Paragraph position="8"> Table 4 shows a few word sequences starting with the word the with large MI values. All of these sequences have high MI (by construction), but some are high in RIDF as well (labeled B), and some are not (labeled A). Most of the sequences are interesting in one way or another, but the A sequences are different from the B sequences. The A sequences would be of more interest to someone studying the grammar in the WSJ subdomain, whereas the B sequences would be of more interest to someone studying the terminology in this subdomain. The B sequences in Table 4 tend to pick out specific events in the news, if not specific stories. The phrase, the Basic Law, for example, picks out a pair of stories that discuss the event of the handover of Hong Kong to China, as illustrated in the concordance shown in Table 5.</Paragraph>
      <Paragraph position="9"> Table 6 shows a number of word sequences with high MI containing common prepositions. The high MI indicates an interesting association, but again most have low RIDF and are not particularly good keywords, though there are a few exceptions (Just for Men, a well-known brand name, has a high RIDF and is a good keyword). The Japanese substrings are similar to the English substrings. Substrings with high RIDF pick out specific documents (and/or events) and therefore tend to be relatively good keywords. Substrings with high MI have nonindependent distributions (if not noncompositional semantics), and are therefore likely to be interesting to a lexicographer or linguist. Substrings that are high in both are more likely to be meaningful units  (words or phrases) than substrings that are high in just one or the other. Meaningless fragments tend to be low in both MI and RIDE We grouped the Japanese classes into nine cells depending on whether the RIDF was in the top 10%, the bottom 10%, or in between, and whether the MI was in the top 10%, the bottom 10%, or in between. Substrings in the top 10% in both RIDF and MI tend to be meaningful words such as (in English translation) merger, stock certi~'cate, dictionary, wireless, and so on. Substrings in the bottom 10% in both RIDF and MI tend to be meaningless fragments, or straightforward compositional combinations of words such as current regular-season game. Table 7 shows examples where MI and RIDF point in opposite directions (see highlighted rectangles in panel (b) of Figure 10).</Paragraph>
      <Paragraph position="10"> We have observed previously that MI is high for general vocabulary (words found in dictionary) and RIDF is high for names, technical terminology, and good keywords for information retrieval. Table 7 suggests an intriguing pattern. Japanese uses different character sets for general vocabulary and loan words. Words that are high in MI tend to use the general vocabulary character sets (hiragana and kanji) whereas words that are high in RIDF tend to use the loan word character sets (katakana and English). (There is an important exception, though, for names, which will be discussed in the next subsection.) The character sets largely reflect the history of the language. Japanese uses four character sets (Shibatani 1990). Typically, functional words of Japanese origin are written in hiragana. Words that were borrowed from Chinese many hundreds of years ago are written in kanji. Loan words borrowed more recently from Western languages are written in katakana. Truly foreign words are written in the English character set (also known as romaji). We were pleasantly surprised to discover that MI and RIDF were distinguishing substrings on the basis of these character set distinctions.</Paragraph>
      <Paragraph position="11">  Computational Linguistics Volume 27, Number 1 Table 5 Concordance of the phrase the Basic Law. Note that most of the instances of the Basic Law appear in just two stories, as indicated by the doc-id (the token-id of the first word in the document). token-id left context right context doc-id 2229521: line in the drafting of the Basic Law that will determine how Hon 2228648 2229902: s policy as expressed in the Basic Law - as Gov. Wilson's debut s 2228648 9746758: he U.S. Constitution and the Basic Law of the Federal Republic of 9746014 11824764: any changes must follow the Basic Law, Hong Kong's miniconstitut 11824269 33007637: sts a tentative draft of the Basic Law, and although this may be 33007425 33007720: the relationship between the Basic Law and the Chinese Constitutio 33007425 33007729: onstitution. Originally the Basic Law was to deal with this topic 33007425 33007945: wer of interpretation of the Basic Law shall be vested in the NPC 33007425 33007975: tation of a provision of the Basic Law, the courts of the HKSAR { 33007425 33008031: interpret provisions of the Basic Law. If a case involves the in 33007425 33008045: tation of a provision of the Basic Law concerning defense, foreig 33007425 33008115: etation of an article of the Basic Law regarding &amp;quot; defense, forei 33007425 33008205: nland representatives of the Basic Law Drafting Committee fear tha 33007425 33008398: e : Mainland drafters of the Basic Law simply do not appreciate th 33007425 33008488: pret all the articles of the Basic Law. While recognizing that th 33007425 33008506: y and power to interpret the Basic Law, it should irrevocably del 33007425 33008521: pret those provisions of the Basic Law within the scope of Hong Ko 33007425 33008545: r the tentative draft of the Basic Law, I cannot help but conclud 33007425 33008690: d of being guaranteed by the Basic Law, are being redefined out o 33007425 33008712: uncilor, is a member of the Basic Law Drafting Committee. 33007425 39020313: sts a tentative draft of the Basic Law, and although this may be 39020101 39020396: the relationship between the Basic Law and the Chinese Constitutio 39020101 39020405: onstitution. Originally the Basic Law was to deal with this topic 39020101 39020621: wer of interpretation of the Basic Law shall be vested in the NPC 39020101 39020651: tation of a provision of the Basic Law, the courts of the HKSAR { 39020101 39020707: interpret provisions of the Basic Law . If a case involves the in 39020101 39020721: tation of a provision of the Basic Law concerning defense, foreig 39020101 39020791: etation of an article of the Basic Law regarding &amp;quot; defense, forei 39020101 39020881: nland representatives of the Basic Law Drafting Committee fear tha 39020101 39021074: e : Mainland drafters of the Basic Law simply do not appreciate th 39020101 39021164: pret all the articles of the Basic Law. While recognizing that th 39020101 39021182: y and power to interpret the Basic Law, it should irrevocably del 39020101 39021197: pret those provisions of the Basic Law within the scope of Hong Ko 39020101 39021221: r the tentative draft of the Basic Law, I cannot help but conclud 39020101 39021366: d of being guaranteed by the Basic Law, are being redefined out o 39020101 39021388: uncilor, is a member of the Basic Law Drafting Committee. 39020101</Paragraph>
    </Section>
    <Section position="10" start_page="21" end_page="24" type="sub_section">
      <SectionTitle>
3.3 Names
</SectionTitle>
      <Paragraph position="0"> As mentioned above, names are an important exception to the rule that kanji (Chinese characters) are used for general vocabulary (words found in the dictionary) that were borrowed hundreds of years ago and katakana characters are used for more recent loan words (such as technical terminology). As illustrated in Table 7, kanji are also used for the names of Japanese people and katakana are used for the names of people from other countries.</Paragraph>
      <Paragraph position="1"> Names are quite different in English and Japanese. Figure 11 shows a striking contrast in the distributions of MI and RIDF values. MI has a more compact distribution in English than Japanese. Japanese names cluster into two groups, but English names do not.</Paragraph>
      <Paragraph position="2"> The names shown in Figure 11 were collected using a simple set of heuristics. For English, we selected substrings starting with the titles Mr., Ms., or Dr. For Japanese, we selected keywords (as identified by the CD-ROM) ending with the special character  (-shi), which is roughly the equivalent of the English titles Mr. and Ms. In both cases, phrases were required to have tf &gt; 10. 3 3 This procedure produced the interesting substring, Mr. From, where both words would normally appear on a stop list. This name has a large RIDF. (The MI, though, is small because the parts are so high in frequency.)  Examples of keywords with extreme values of RIDF and MI that point in opposite directions. The top half (high RIDF and low MI) tends to have more loan words, largely written in katakana and English. The bottom half (low RIDF and high MI) tends to have more general vocabulary, largely written in Chinese kanji.</Paragraph>
      <Paragraph position="3">  The English names have a sharp cutoff around MI = 7 due in large part to the title Mr. MI('Mr.', x) = log 2 ~N -- ldegg2 tf('Mr.',x)tf(x) ---- 7.4 - log 2 ~'tf(x)' Since log 2 ~tf(x) is a small positive number, typically 0-3, MI('Mr',x) &lt; 7.4. Names generally have RIDF values ranging from practically nothing (for common names like Jones) to extremely large values for excellent keywords. The Japanese names, however, cluster into two groups, those with RIDF above 0.5, and those with RIDF below 0.5. The separation above and below RIDF = 0.5, we believe, is a reflec- null Yamamoto and Church Term Frequency and Document Frequency for All Substrings tion of the well-known distinction between new information and given information in discourse structure. It is common in both English and Japanese, for the first mention of a name in a news article to describe the name in more detail than subsequent uses. In English, for example, terms like spokesman or spokeswoman and appositives are quite common for the first use of a name, and less so, for subsequent uses. In Japanese, the pattern appears to be even more rigid than in English. The first use will very often list the full name (first name plus last name), unlike subsequent uses, which almost always omit the first name. As a consequence, the last name exhibits a large range of RIDF values, as in English, but the full name will usually (90%) fall below the RIDF = 0.5 threshold. The MI values have a broader range as well, depending on the compositionality of the name.</Paragraph>
      <Paragraph position="4"> To summarize, RIDF and MI can be used to identify a number of interesting similarities and differences in the use of names. Names are interestingly different from general vocabulary. Many names are very good keywords and have large RIDE General vocabulary tends to have large MI. Although we observe this basic pattern over both English and Japanese, names bring up some interesting differences between the two languages such as the tendency for Japanese names to fall into two groups separated by the RIDF = 0.5 threshold.</Paragraph>
    </Section>
    <Section position="11" start_page="24" end_page="24" type="sub_section">
      <SectionTitle>
3.4 Word Extraction
</SectionTitle>
      <Paragraph position="0"> RIDF and MI may be useful for word extraction. In many languages such as Chinese, Japanese, and Thai, word extraction is not an easy task, because, unlike English, many of these languages do not use delimiters between words. Automatic word extraction can be applied to the task of dictionary maintenance. Since most NLP applications (including word segmentation for these languages) are dictionary based, word extraction is very important for these languages. Nagao and Mori (1994) and Nagata (1996) proposed n-gram methods for Japanese. Sproat and Shih (1990) found MI to be useful for word extraction in Chinese.</Paragraph>
      <Paragraph position="1"> We performed the following simple experiment to see if both MI and RIDF could be useful for word extraction in Japanese. We extracted four random samples of 100 substrings each. The four samples cover all four combinations of high and low RIDF and high and low MI, where high is defined to be in the top 10% and low is defined to be in the bottom 10%. Then we manually scored each sample substring using our own subjective judgment. Substrings were labeled &amp;quot;good&amp;quot; (the substring is a word), &amp;quot;bad&amp;quot; (the substring is not a word), or &amp;quot;gray&amp;quot; (the judge is not sure). The results are presented in Table 8, which shows that substrings with high scores in both dimensions are more likely to be words than substrings that score high in just one dimension.</Paragraph>
      <Paragraph position="2"> Substrings with low scores in both dimensions are very unlikely to be words. These results demonstrate plausibility for the use of multiple statistics. The approach could be combined with other methods in the literature such as Kita et al. (1994) to produce a more practical system. In any case, automatic word extraction is not an easy task for Japanese (Nagata 1996).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML