File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-4003_metho.xml

Size: 29,418 bytes

Last Modified: 2025-10-06 14:08:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4003">
  <Title>c(c) 2004 Association for Computational Linguistics Fast Approximate Search in Large Dictionaries Stoyan Mihov [?]</Title>
  <Section position="7" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, k).</Paragraph>
    <Paragraph position="1"> Remark 1 There is a direct relationship between the entries in column j of the dynamic programming table T</Paragraph>
  </Section>
  <Section position="8" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, T) and the set of active states of A</Paragraph>
    <Paragraph position="2"> (P, T) has the value h [?] k iff h is the exponent of the bottom-most active state in the ith column of A</Paragraph>
  </Section>
  <Section position="9" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, k). For example, in Figure 3, the set of active states of A</Paragraph>
  </Section>
  <Section position="10" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="11" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (chold, 2) for approximate search with pattern chold and distance bound k = 2. Active states after symbols t and h have been read are highlighted.  Nondeterministic automaton A(chold, 2) for testing Levenshtein distance with bound k = 2 for pattern chold. Triangular areas are highlighted. Dark states are active after symbols h and c have been read.</Paragraph>
    <Paragraph position="1"> The direct use of the nondeterministic automaton A</Paragraph>
  </Section>
  <Section position="12" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, k) for conducting approximate searches is inefficient. Furthermore, depending on the length m of the pattern and the error bound k, the explicit construction and storage of a deterministic</Paragraph>
  </Section>
  <Section position="13" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
version of A
AST
</SectionTitle>
    <Paragraph position="0"> (P, k) might be difficult or impossible. In practice, simulation of determinism via bit-parallel computation of sets of active states gives rise to efficient and flexible algorithms. See Navarro (2001) and Navarro and Raffinot (2002) for surveys of algorithms along this line.</Paragraph>
  </Section>
  <Section position="14" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
4. Testing Levenshtein Neighborhood with Universal Deterministic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
Levenshtein Automata
</SectionTitle>
      <Paragraph position="0"> In our approach, approximate search of a pattern P in a dictionary D is traced back to the problem of deciding whether the Levenshtein distance between P and an entry W of D exceeds a given bound k. A well-known method for solving this problem is based on a nondeterministic automaton A(P, k) similar to A</Paragraph>
    </Section>
  </Section>
  <Section position="15" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, k). A string W is accepted by A(P, k) iff d L (P, W) [?] k. The automaton A(P, k) does not have the initial S loop that is needed in A AST (P, k) to traverse the text. The automaton for pattern chold and distance bound k = 2 is shown in Figure 4. Columns of A(P, k) with numbers 0,..., m = |P |are defined as for A</Paragraph>
  </Section>
  <Section position="16" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
AST
</SectionTitle>
    <Paragraph position="0"> (P, k).InA(P, k), we use as final states all states q from which we can reach one of the states in column m using a (possibly empty) sequence of epsilon1-transitions. The reason for this modification--which obviously does not change the set of accepted words--will become apparent later.</Paragraph>
    <Paragraph position="1"> We now show that for fixed small error bounds k, the explicit computation of A(P, k), in a deterministic or nondeterministic variant, can be completely avoided. In our approach, pattern P and entry W = w  Mihov and Schulz Fast Approximate Search in Large Dictionaries appears to be impossible. For defining states, input vectors and transitions of A</Paragraph>
    <Paragraph position="3"> Let P denote a pattern of length m. The triangular area of a state p of A(P, k) consists of all states q of A(P, k) that can be reached from p using a (potentially empty) sequence of u upward transitions and, in addition, h [?] u horizontal or reverse (i.e., leftward) horizontal transitions. Let 0 [?] i [?] m.Bytriangular area i, we mean the triangular area of state i</Paragraph>
    <Paragraph position="5"> For example, in Figure 4, triangular areas 0,...,7 of A(chold, 2) are shown.</Paragraph>
    <Paragraph position="6"> In Remark 1, we pointed to the relationship between the entries in column i of table  , a subset of the triangular area i [?] 1); 2. on the characteristic vector vectorkh(w</Paragraph>
    <Paragraph position="8"> The following description of A  [?] (k) proceeds in three steps that introduce, in order, input vectors, states, and the transition function. States and transition function are described informally.</Paragraph>
    <Paragraph position="9"> 1. Input vectors. Basically we want to use the vectors kh(w</Paragraph>
    <Paragraph position="11"> ), which are of length [?] 2k + 1, as input for A [?] (k). For technical reasons, we introduce two modifications. First, in order to standardize the length of the characteristic vectors that are obtained for the initial symbols w  ,..., we define p</Paragraph>
    <Paragraph position="13"> In other words, we attach to P a new prefix with k symbols $. Here $ is a new symbol that does not occur in W. Second imagine that we get to triangular area i after reading the ith letter w i (cf. Remark 2). As long as i [?] m [?] k [?] 1, we know that we cannot reach a triangular area containing final states after reading w i . In order to encode  Computational Linguistics Volume 30, Number 4 this information in the input vectors, we enlarge the relevant subword of P for input</Paragraph>
    <Paragraph position="15"> is 2k+2; for i = m[?]k (respectively, m[?]k+1,..., m,..., m+k), the length of vectorkh</Paragraph>
    <Paragraph position="17"> m[?]i . Here the symbol  |denotes bitwise OR. Once we have obtained the values vectork(w i ), which are represented as arrays, the vectors vectorkh</Paragraph>
    <Paragraph position="19"> ) can be accessed in constant time.</Paragraph>
    <Paragraph position="20">  2. States. Henceforth, states of automata A(P, k) will be called positions. Recall that a position is given by a base number and an exponent e,0 [?] e [?] k represent null ing the error count. By a symbolic triangular area, we mean a triangular area in which &amp;quot;explicit&amp;quot; base numbers (like 1, 2,...) in positions are replaced by &amp;quot;symbolic&amp;quot; base numbers of a form described below. Two kinds of symbolic triangular areas are used. A unique &amp;quot;I-area&amp;quot; represents all triangular areas of automata A(P, k) that do not contain final positions. The &amp;quot;integer variable&amp;quot; I is used to abstract from possible base numbers i,0[?] i [?] m [?] k [?] 1. Furthermore, k + 1&amp;quot;M-areas&amp;quot; are used to represent triangular areas of automata A(P, k) that contain final positions. Variable M is meant to abstract from concrete values of m, which differ for distinct P. Symbolic base numbers are expressions of the form I, I + 1, I [?] 1, I + 2, I [?] 2... (Iareas) or M, M [?] 1, M [?] 2,... (M-areas). The elements of the symbolic areas, which are called symbolic positions, are symbolic base numbers together with exponents indicating an error count. Details should become clear in Example 2. The use of expressions such as (I + 2)  simply enables a convenient labeling of states of A [?] (k) (cf. Figure 6). Using this kind of labeling, it is easy to formulate a correspondence between derivations in automata A(P, k) and in A [?] (k) (cf. properties C1 and C2 discussed below).</Paragraph>
    <Paragraph position="21">  } is the start state. A special technique is used to reduce the number of states. Returning to automata of the form A(P, k), it is simple to see that triangular areas often contain positions p = g</Paragraph>
    <Paragraph position="23"> where p &amp;quot;subsumes&amp;quot; q in the following sense: If, for some fixed input rest U,it is possible to reach a final position of A(P, k) starting from q and consuming U, then we may also reach a final position starting from p using U. A corresponding notion of subsumption can be defined for symbolic positions. States of A [?] (k) are then defined as subsets of symbolic triangular areas that are free of subsumption in the sense that a symbolic position of a state is never subsumed by another position of the same state. Example 3 The states of automaton A [?] (1) are shown in Figure 6. As a result of the above reduction technique, the only state containing the symbolic position I</Paragraph>
    <Paragraph position="25"> denote the set of active positions of A(P, k) that are reached after reading the ith symbol w</Paragraph>
    <Paragraph position="27"> simplicity, we assume that in each set, all subsumed positions are erased. In A [?] (k) we have a parallel acceptance procedure in which we reach, say, state S</Paragraph>
    <Paragraph position="29"> ), where r = min{m, i + k + 1}, as above, for 1 [?] i [?] n.</Paragraph>
    <Paragraph position="30"> Transitions are defined in such a way that C1 and C2 hold: C1. For all parallel sets S</Paragraph>
    <Paragraph position="32"> (2) has 50 nonfinal states and 40 final states. The automaton A [?] (3) has 563 states. When we tried  Mihov and Schulz Fast Approximate Search in Large Dictionaries to minimize the automata A  (3), we found that these three automata are already minimal. However, we do not have a general proof that our construction always leads to minimal automata.</Paragraph>
  </Section>
  <Section position="17" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
5. Approximate Search in Dictionaries Using Universal Levenshtein Automata
</SectionTitle>
    <Paragraph position="0"> We now describe how to use the universal deterministic Levenshtein automaton A</Paragraph>
    <Paragraph position="2"> for approximate search for a pattern in a dictionary.</Paragraph>
    <Section position="1" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
5.1 Basic Correction Algorithm
</SectionTitle>
      <Paragraph position="0"> Let D denote the background dictionary, and let P = p</Paragraph>
      <Paragraph position="2"> denote a given pattern.</Paragraph>
      <Paragraph position="3"> Recall that we want to compute for some fixed bound k the set of all entries W [?] D such that d  Computational Linguistics Volume 30, Number 4 the longest word in the dictionary, then in general (e.g., for the empty input word), the algorithm will result in a complete traversal of A D . In practice, small bounds are used, and only a small portion of A D will be visited. For bound 0, the algorithm validates in time O(|P|) if the input pattern P is in the dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
5.2 Evaluation Results for Basic Correction Algorithm
</SectionTitle>
      <Paragraph position="0"> Experimental results were obtained using a Bulgarian lexicon (BL) with 965, 339 word entries (average length 10.23 symbols), a German dictionary (GL) with 3, 871, 605 entries (dominated by compound nouns, average length 18.74 symbols), and a &amp;quot;lexicon&amp;quot; (TL) containing 1, 200, 073 bibliographic titles from the Bavarian National Library (average length 47.64 symbols). The German dictionary and the title dictionary are nonpublic. They were provided to us by Franz Guenthner and the Bavarian National Library, respectively, for the tests we conducted. The following table summarizes the dictionary automaton statistics for the three dictionaries:  line. Let a garbled word W be given. In order to find all words from the dictionary within Levenshtein distance k, we can use two simple methods:  dictionary word V using the universal Levenshtein automaton can be estimated as 1 us (a crude approximation). When using Method 2, we need about 1 us for the dictionary lookup of a word with 10 symbols. Assume that the alphabet has 30 symbols. Given the input W, we have 639 strings within Levenshtein distance 1, about 400,000 strings within distance 2, and about 260,000,000 strings within distance 3. Assuming that the dictionary has 1,000,000 words, we get the following table of correction times:  lexicon, we used a Bulgarian word list containing randomly introduced errors. In each word, we introduced between zero and four randomly selected symbol substitutions,  Mihov and Schulz Fast Approximate Search in Large Dictionaries insertions, or deletions. The number of test words created for each length is shown in the following table:  Table 1 lists the results of the basic correction algorithm using BL and standard Levenshtein distance with bounds k = 1, 2, 3. Column 1 shows the length of the input words. Column 2 (CT1) describes the average time needed for the parallel traversal of the dictionary automaton and the universal Levenshtein automaton using Levenshtein distance 1. The time needed to output the correction candidates is always included; hence the column represents the total correction time. Column 3 (NC1) shows the average number of correction candidates (dictionary words within the given distance bound) per input word. (For k = 1, there are cases in which this number is below 1. This shows that for some of the test words, no candidates were returned: These words were too seriously corrupted for correction suggestions to be found within the given distance bound.) Similarly Columns 4 (CT2) and 6 (CT3) yield, respectively, the total correction times per word (averages) for distance bounds 2 and 3, and Columns 5 (NC2) and 7 (NC3) yield, respectively, the average number of correction candidates per word for distance bounds 2 and 3. Again, the time needed to output all corrections is included.</Paragraph>
      <Paragraph position="1"> 5.2.3 Correction with GL. To test the correction times when using the German lexicon, we again created a word list with randomly introduced errors. The number of test words of each particular length is shown in the following table: Length 1-14 15-24 25-34 35-44 45-54 55-64 sharp words 100,000 100,000 100,000 9,776 995 514 The average correction times and number of correction candidates for GL are summarized in Table 2, which has the same arrangement of columns (with corresponding interpretations) as Table 1.</Paragraph>
      <Paragraph position="2"> 5.2.4 Correction with TL. To test the correction times when using the title &amp;quot;lexicon,&amp;quot; we again created a word list with randomly introduced errors. The number of test words of each length is presented in the following table: Length 1-14 15-24 25-34 35-44 45-54 55-64 sharp words 91,767 244,449 215,094 163,425 121,665 80,765 Table 3 lists the results for correction with TL and standard Levenshtein distance with bounds k = 1, 2, 3. The arrangement of columns is the same as for Table 1, with corresponding interpretations.</Paragraph>
      <Paragraph position="3"> 5.2.5 Summary. For each of the three dictionaries, evaluation times strongly depend on the tolerated number of edit operations. When fixing a distance bound, the length of the input word does not have a significant influence. In many cases, correction works faster for long input words, because the number of correction candidates decreases. The large number of entries in GL leads to increased correction times.</Paragraph>
      <Paragraph position="4">  Mihov and Schulz Fast Approximate Search in Large Dictionaries</Paragraph>
    </Section>
  </Section>
  <Section position="18" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
6. Using Backwards Dictionaries for Filtering
</SectionTitle>
    <Paragraph position="0"> In the related area of pattern matching in strings, various filtering methods have been introduced that help to find portions of a given text in which an approximate match of a given pattern P is not possible. (See Navarro [2001] and Navarro and Raffinot [2002] for surveys). In this section, we show how one general method of this form (Wu and Manber 1992; Myers 1994; Baeza-Yates and Navarro 1999; Navarro and Baeza-Yates 1999) can be adapted to approximate search in a dictionary, improving the basic correction algorithm.</Paragraph>
    <Paragraph position="1"> For approximate text search, the crucial observation is the following: If the Levenshtein distance between a pattern P and a portion of text T prime does not exceed a given bound k, and if we cut P into k + 1 disjoint pieces P  }, which is much faster than approximate search for P.</Paragraph>
    <Paragraph position="2"> When finding one of the pieces P</Paragraph>
    <Paragraph position="4"> in the text, the full pattern P is searched for (returning now to approximate search) within a small neighborhood around the occurrence. Generalizations of this idea rely on the following lemma (Myers 1994; Baeza-Yates and  In our experiments, which were limited to distance bounds k = 1, 2, 3, we used the following three instances of the general idea. Let P denote an input pattern, and let W denote an entry of the dictionary D. Assume we cut P into two pieces, representing it in the form P = P  Computational Linguistics Volume 30, Number 4 In order to make use of these observations, we compute, given dictionary D, the back- null , as described in Section 5. The sequence of all transition labels of the actual paths in A D is stored as usual. Whenever we reach a pair of final states, the current sequence--which includes  ) [?] 2. Conversely, any dictionary word of this form is found using a search of this form. A closer look at the final states that are respectively reached in A  ) [?] 1. Conversely, any word in the dictionary of this form is found using a search of this form. A closer look at the final states that are reached in A  of the form described in cases (1a)-(1d).</Paragraph>
    <Section position="1" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
6.1 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> The following table summarizes the statistics of the automata for the three backwards dictionaries:  Note that the size of the backwards-dictionary automata is approximately the same as the size of the dictionary automata.</Paragraph>
      <Paragraph position="1"> Tables 4, 5, and 6 present the evaluation results for the backwards dictionary filtering method using dictionaries BL, GL, and TL, respectively. We have constructed additional automata for the backwards dictionaries.</Paragraph>
      <Paragraph position="2"> For the tests, we used the same lists of input words as in Section 5.2 in order to allow a direct comparison to the basic correction method. Dashes indicate that the correction times were too small to be measured with sufficient confidence in their level of precision. In columns 3, 5, and 7, we quantify the speedup factor, that is, the ratio of the time taken by the basic algorithm to that taken by the backwards-dictionary filtering method.</Paragraph>
    </Section>
    <Section position="2" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
6.2 Backwards-Dictionary Method for Levenshtein Distance with Transpositions
</SectionTitle>
      <Paragraph position="0"> Universal Levenshtein automata can also be constructed for the modified Levenshtein distance, in which character transpositions count as a primitive edit operation, along with insertions, deletions, and substitutions. This kind of distance is preferable when correcting typing errors. A generalization of the techniques presented by the authors (Schulz and Mihov 2002) for modified Levenshtein distances--using either transpositions or merges and splits as additional edit operations--has been described in Schulz and Mihov (2001). It is assumed that all edit operations are applied in parallel, which implies, for example, that insertions between transposed letters are not possible.</Paragraph>
      <Paragraph position="1"> If we want to apply the filtering method using backwards dictionaries for the modified Levenshtein distance d  Evaluation results using the backwards-dictionary filtering method, Bulgarian dictionary, and distance bounds k = 1, 2, 3. Times in milliseconds and speedup factors (ratio of times) with respect to basic algorithm.</Paragraph>
      <Paragraph position="2">  The cases for distance bounds k = 1, 2 are solved using similar extensions of the original subcase analysis. In each case, it is straightforward to realize a search procedure with subsearches corresponding to the new subcase analysis, using an ordinary dictionary and a backwards dictionary, generalizing the above ideas.</Paragraph>
      <Paragraph position="3">  proves correction times. The increase in speed depends both on the length of the input word and on the error bound. The method works particularly well for long input words. For GL, a drastic improvement can be observed for all subclasses. In contrast, for very short words of BL, only a modest improvement is obtained. When using BL and the modified Levenshtein distance d prime L with transpositions, the backwards-dictionary method improved the basic search method only for words of length [?] 9. For short words, a large number of repetitions of the same correction candidates was observed. The analysis of this problem is a point of future work.</Paragraph>
      <Paragraph position="4"> Variants of the backwards-dictionary method also can be used for the Levenshtein distance d primeprime L , in which insertions, deletions, substitutions, merges, and splits are treated as primitive edit operations. Here, the idea is to split the pattern at two neighboring positions, which doubles the number of subsearches. We did not evaluate this variant. 7. Using Dictionaries with Single Deletions for Filtering The final technique that we describe here is again an adaptation of a filtering method from pattern matching in strings (Muth and Manber 1996; Navarro and Raffinot 2002). When restricted to the error bound k = 1, this method is very efficient. It can be used only for finite dictionaries. Assume that the pattern P = p  Computational Linguistics Volume 30, Number 4 This fact can be used to compute m+1 derivatives of P that are compared with similar derivatives of a window T prime of length m that is slid over the text. A derivative of a word V can be V or a word that is obtained by deleting exactly one letter of V. Coincidence between derivatives of P and T prime can be used to detect approximate matches of P of the above form. For details we refer to Navarro and Raffinot (2002). In what follows we describe an adaptation of the method to approximate search of a pattern P in a dictionary D.</Paragraph>
      <Paragraph position="5"> Let i be an integer. With V [i] we denote the word that is obtained from a word V by deleting the ith symbol of V. For |V |&lt; i, we define V [i] = V.Byadictionary with output sets, we mean a list of strings W, each of which is associated with a set of output strings O(W). Each string W is called a key. Starting from the conventional dictionary D, we compute the following dictionaries with output sets D  (P, W) [?] 1. It should be noted that the output sets are not necessarily disjoint. For example, if P itself is a dictionary entry, then P [?] O</Paragraph>
      <Paragraph position="7"> all 1 [?] i [?]|P|.</Paragraph>
      <Paragraph position="8"> After we implemented the above procedure for approximate search, we found that a similar approach based on hashing had been described as early as 1981 in a technical report by Mor and Fraenkel (1981).</Paragraph>
    </Section>
    <Section position="3" start_page="4333212" end_page="4333212" type="sub_section">
      <SectionTitle>
7.1 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> Table 8 presents the evaluation results for edit distance 1 using dictionaries with single deletions obtained from BL. The total size of the constructed single-deletion dictionary automata is 34.691 megabytes. The word lists used for tests are those described in Section 5.2. GL and TL are not considered here, since the complete system of subdictionaries needed turned out to be too large. For a small range of input words of length 3-6, filtering using dictionaries with single deletions behaves better than filtering using the backwards-dictionary method.</Paragraph>
    </Section>
  </Section>
  <Section position="19" start_page="4333212" end_page="4333212" type="metho">
    <SectionTitle>
8. Similarity Keys
</SectionTitle>
    <Paragraph position="0"> A well-known technique for improving lexical search not mentioned so far is the use of similarity keys. A similarity key is a mapping k that assigns to each word W a simplified representation k(W). Similarity keys are used to group dictionaries into classes  Mihov and Schulz Fast Approximate Search in Large Dictionaries of &amp;quot;similar&amp;quot; entries. Many concrete notions of &amp;quot;similarity&amp;quot; have been considered, depending on the application domain. Examples are phonetic similarity (e.g., SOUNDEX system; cf. Odell and Russell [1918, 1922] and Davidson [1962]), similarity in terms of word shape and geometric form (e.g., &amp;quot;envelope representation&amp;quot; [Sinha 1990; Anigbogu and Belaid 1995] ) or similarity under n-gram analysis (Angell, Freund, and Willett 1983; Owolabi and McGregor 1988). In order to search for a pattern P in the dictionary, the &amp;quot;code&amp;quot; k(P) is computed. The dictionary is organized in such a way that we may efficiently retrieve all regions containing entries with code (similar to) k(P). As a result, only small parts of the dictionary must be visited, which speeds up search. Many variants of this basic idea have been discussed in the literature (Kukich 1992; Zobel and Dart 1995; de Bertrand de Beuvron and Trigano 1995).</Paragraph>
    <Paragraph position="1"> In our own experiments we first considered the following simple idea. Given a similarity key k, each entry W of dictionary D is equipped with an additional prefix of the form k(W)&amp;. Here &amp; is a special symbol that marks the border between codes and original words. The enhanced dictionary  . We distinguish two phases in the backtracking process. In Phase 1, which ends when the special symbol &amp; is read, we compute an initial path of A ^ D in which the corresponding sequence of transition labels represents a code a such that d  are translated into characteristic vectors. In order to guarantee completeness of the method, the distance between codes of a pair of words should not exceed the distance between the words themselves.</Paragraph>
    <Paragraph position="2"> It is simple to see that in this method, the backtracking search is automatically restricted to the subset of all dictionary entries V such that d L (k(V),k(P)) [?] k. Unfortunately, despite this, the approach does not lead to reduced search times. A closer look at the structure of (conventional) dictionary automata A D for large dictionaries D shows that there exists an enormous number of distinct initial paths of A D of length 3-5. During the controlled traversal of A D , most of the search time is spent visiting paths of this initial &amp;quot;wall.&amp;quot; Clearly, most of these paths do not lead to any correction candidate. Unfortunately, however, these &amp;quot;blind&amp;quot; paths are recognized too late. Using the basic method described in Section 5, we have to overcome one single wall in A D for the whole dictionary. In contrast, when integrating similarity keys in the above form, we have to traverse a similar wall for the subdictionary D k(W) := {V [?] D  |k(V)= k(W)} for each code k(W) found in Phase 1. Even if the sets D k(W) are usually much smaller than D, the larger number of walls that are visited leads to increased traversal times.</Paragraph>
    <Paragraph position="3"> As an alternative, we tested a method in which we attached to each entry W of D all prefixes of the form a&amp;, where a represents a possible code such that d L (k(W),a) [?] k.</Paragraph>
    <Paragraph position="4"> Using a procedure similar to the one described above, we have to traverse only one wall in Phase 2. With this method, we obtained a reduction in search time. However, with this approach, enhanced dictionaries ^ D are typically much larger than original dictionaries D. Hence the method can be used only if both dictionary D and bound k are not too large and if the key is not too fine. Since the method is not more efficient than filtering using backwards dictionaries, evaluation results are not presented here.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML