File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/j00-1002_abstr.xml

Size: 4,144 bytes

Last Modified: 2025-10-06 13:41:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-1002">
  <Title>Stoyan Mihov t Bulgarian Academy of Sciences</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Finite-state automata are used in a variety of applications, including aspects of natural language processing (NLP). They may store sets of words, with or without annotations such as the corresponding pronunciation, base form, or morphological categories. The main reasons for using finite-state automata in the NLP domain are that their representation of the set of words is compact, and that looking up a string in a dictionary represented by a finite-state automaton is very fast--proportional to the length of the string. Of particular interest to the NLP community are deterministic, acyclic, finite-state automata, which we call dictionaries.</Paragraph>
    <Paragraph position="1"> Dictionaries can be constructed in various ways--see Watson (1993a, 1995) for a taxonomy of (general) finite-state automata construction algorithms. A word is simply a finite sequence of symbols over some alphabet and we do not associate it with a meaning in this paper. A necessary and sufficient condition for any deterministic automaton to be acyclic is that it recognizes a finite set of words. The algorithms described here construct automata from such finite sets.</Paragraph>
    <Paragraph position="2"> The Myhill-Nerode theorem (see Hopcroft and Ullman \[1979\]) states that among the many deterministic automata that accept a given language, there is a unique automaton (excluding isomorphisms) that has a minimal number of states. This is called the minimal deterministic automaton of the language.</Paragraph>
    <Paragraph position="3"> The generalized algorithm presented in this paper has been independently devel- null oped by Jan Daciuk of the Technical University of Gdafisk, and by Richard Watson * Department of Applied Informatics, Technical University of Gdafisk, U1. G. Narutowicza 11/12, PL80-952 Gdafisk, Poland. E-mail: jandac@pg.gda.pl Linguistic Modelling Laboratory, LPDP--Bulgarian Academy of Sciences, Bulgaria. E-mail: stoyan@lml.bas.bg :~ Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa. E-mail: watson@cs.up.ac.za SS E-mail: watson@OpenFIRE.org (~) 2000 Association for Computational Linguistics Computational Linguistics Volume 26, Number 1 and Bruce Watson (then of the IST Technologies Research Group) at Ribbit Software Systems Inc. The specialized (to sorted input data) algorithm was independently developed by Jan Daciuk and by Stoyan Mihov of the Bulgarian Academy of Sciences.</Paragraph>
    <Paragraph position="4"> Jan Daciuk has made his C++ implementations of the algorithms freely available  for research purposes at www.pg.gda.pl/~jandac/fsa.html. 1 Stoyan Mihov has implemented the (sorted input) algorithm in a Java package for minimal acyclic finite-state automata. This package forms the foundation of the Grammatical Web Server for Bulgarian (at origin2000.bas.bg) and implements operations on acyclic finite automata, such as union, intersection, and difference, as well as constructions for perfect hashing. Commercial C++ and Java implementations are available via www.OpenFIRE.org.</Paragraph>
    <Paragraph position="5"> The commercial implementations include several additional features such as a method to remove words from the dictionary (while maintaining minimality). The algorithms have been used for constructing dictionaries and transducers for spell-checking, morphological analysis, two-level morphology, restoration of diacritics, perfect hashing, and document indexing. The algorithms have also proven useful in numerous problems outside the field of NLP, such as DNA sequence matching and computer virus recognition.</Paragraph>
    <Paragraph position="6"> An earlier version of this paper, authored by Daciuk, Watson, and Watson, appeared at the International Workshop on Finite-state Methods in Natural Language Processing in 1998--see Daciuk, Watson, and Watson (1998).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML