File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/j04-4003_concl.xml

Size: 4,599 bytes

Last Modified: 2025-10-06 13:53:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-4003">
  <Title>c(c) 2004 Association for Computational Linguistics Fast Approximate Search in Large Dictionaries Stoyan Mihov [?]</Title>
  <Section position="20" start_page="4333212" end_page="4333212" type="concl">
    <SectionTitle>
9. Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> In this article, we have shown how filtering methods can be used to improve finite-state techniques for approximate search in large dictionaries. As a central contribution we introduced a new correction method, filtering based on backwards dictionaries and partitioned input patterns. Though this method generally leads to very short correction times, we believe that correction times could possibly be improved further using refinements and variants of the method or introducing other filtering methods.</Paragraph>
    <Paragraph position="1"> There are, however, reasons to assume that we are not too far from a situation in which further algorithmic improvements become impossible for fundamental reasons.</Paragraph>
    <Paragraph position="2"> The following considerations show how an &amp;quot;optimal&amp;quot; correction time can be estimated that cannot be improved upon without altering the hardware or using faster access methods for automata.</Paragraph>
    <Paragraph position="3"> We used a simple backtracking procedure to realize a complete traversal of the dictionary automaton A D . During a first traversal, we counted the total number of visits to any state. Since A D is not a tree, states may be passed through several times during the complete traversal. Each such event counts as one visit to a state. The ratio of the number of visits to the total number of symbols in the list of words D gives the average number of visits per symbol, denoted v  smaller than 1 because of numerous prefixes of dictionary words that are shared in A D .</Paragraph>
    <Paragraph position="4"> We then used a second traversal of A D --not counting visits to states--to compute the total traversal time. The ratio of the total traversal time to the number of visits yields  can be used to estimate the optimal correction time for V. In fact, in order to achieve this correction time, we need an oracle that knows how to avoid any kind of useless backtracking. Each situation in which we proceeded on a dictionary path that does not lead to a correction candidate for V would require some extra time that is not included in the above calculation. From another point of view, the above idealized algorithm essentially just copies the correction candidates into a resulting destination. The time that is consumed is proportional to the sum of the length of the correction candidates.</Paragraph>
    <Paragraph position="5">  Mihov and Schulz Fast Approximate Search in Large Dictionaries For each of the three dictionaries, we estimated the optimal correction time for one class of input words. For BL we looked at input words of length 10. The average number of correction candidates for Levenshtein distance 3 is 121.73 (cf. Table 1). Assuming that the average length of correction candidates is 10, we obtain a total of 1, 217.3 symbols in the complete set of all correction candidates. Hence the optimal correction time is approximately 1217.3 * 0.0000918 ms * 0.1433 = 0.016 ms The actual correction time using filtering with the backwards-dictionary method is 0.827 ms, which is 52 times slower.</Paragraph>
    <Paragraph position="6"> For GL, we considered input words of length 15-24 and distance bound 3. We have on average 3.824 correction candidates of length 20, that is, 76.48 symbols. Hence the optimal correction time is approximately 76.48 * 0.0001078 ms * 0.3618 = 0.003 ms The actual correction time using filtering with the backwards-dictionary method is 0.601 ms, which is 200 times slower.</Paragraph>
    <Paragraph position="7"> For TL, we used input sequences of length 45-54 and again distance bound 3. We have on average 0.857 correction candidates of length 50, that is, 42.85 symbols. Hence the optimal correction time is approximately 42.85 * 0.0000865 ms * 0.7335 = 0.003 ms The actual correction time using filtering with the backwards-dictionary method is 0.759 ms, which is 253 times slower.</Paragraph>
    <Paragraph position="8"> These numbers coincide with our basic intuition that further algorithmic improvements are simpler for dictionaries with long entries. For example, variants of the backwards-dictionary method could be considered in which a finer subcase analysis is used to improve filtering.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML