File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/j00-1005_abstr.xml
Size: 8,024 bytes
Last Modified: 2025-10-06 13:41:41
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-1005"> <Title>Treatment of Epsilon Moves in Subset Construction</Title> <Section position="2" start_page="0" end_page="62" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Finite-State Language Processing </SectionTitle> <Paragraph position="0"> An important problem in computational linguistics is posed by the fact that the grammars typically hypothesized by linguists are unattractive from the point of view of computation. For instance, the number of steps required to analyze a sentence of n words is n 3 for context-free grammars. For certain linguistically more attractive grammatical formalisms it can be shown that no upper bound to the number of steps required to find an analysis can be given. The human language user, however, seems to process in linear time; humans understand longer sentences with no noticeable delay. This implies that neither context-free grammars nor more powerful grammatical formalisms are likely models for human language processing. An important issue therefore is how the linearity of processing by humans can be accounted for.</Paragraph> <Paragraph position="1"> A potential solution to this problem concerns the possibility of approximating an underlying general and abstract grammar by techniques of a much simpler sort.</Paragraph> <Paragraph position="2"> The idea that a competence grammar might be approximated by finite-state means goes back to early work by Chomsky (Chomsky 1963, 1964). There are essentially three observations that motivate the view that the processing of natural language is humans have a finite (small, limited, fixed) amount of memory available for language processing humans have problems with certain grammatical constructions, such as center-embedding, which are impossible to describe by finite-state means (Miller and Chomsky 1963) humans process natural language very efficiently (in linear time)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Finite-State Approximation and c-Moves </SectionTitle> <Paragraph position="0"> In experimenting with finite-state approximation techniques for context-free and more powerful grammatical formalisms (such as the techniques presented in Black \[1989\], Pereira and Wright \[1991, 1997\], Rood \[1996\], Grimley-Evans \[1997\], Nederhof \[1997, 1998\], and Johnson \[1998\]), we have found that the resulting automata often are extremely large. Moreover, the automata contain many e-moves (jumps). And finally, if such automata are determinized then the resulting automata are often smaller. It turns out that a straightforward implementation of the subset construction determinization algorithm performs badly for such inputs. In this paper we consider a number of variants of the subset construction algorithm that differ in their treatment of c-moves.</Paragraph> <Paragraph position="1"> Although we have observed that finite-state approximation techniques typically yield automata with large numbers of c-moves, this is obviously not a necessity. Instead of trying to improve upon determinization techniques for such automata, it might be more fruitful to try to improve these approximation techniques in such a way that more compact automata are produced. 1 However, because research into finite-state approximation is still of an exploratory and experimental nature, it can be argued that more robust determinization algorithms do still have a role to play: it can be expected that approximation techniques are much easier to define and implement if the resulting automaton is allowed to be nondeterministic and to contain c-moves.</Paragraph> <Paragraph position="2"> Note furthermore that even if our primary motivation is in finite-state approximation, the problem of determinizing finite-state automata with c-moves may be relevant in other areas of language research as well.</Paragraph> </Section> <Section position="3" start_page="0" end_page="62" type="sub_section"> <SectionTitle> 1.3 Subset Construction and c-Moves </SectionTitle> <Paragraph position="0"> The experiments were performed using the FSA Utilities. The FSA Utilities toolbox (van Noord 1997, 1999; Gerdemann and van Noord 1999; van Noord and Gerdemann 1999) is a collection of tools to manipulate regular expressions, finite-state automata, and finite-state transducers. Manipulations include determinization, minimization, composition, complementation, intersection, Kleene closure, etc. Various visualization tools are available to browse finite-state automata. The toolbox is implemented in SICStus Prolog, and is available free of charge under Gnu General Public License via anonymous ftp at ftp://ftp.let.rug.nl/pub/vannoord/Fsa/, and via the web at http://www.let.rug.nl/~vannoord/Fsa/. At the time of our initial experiments with finite-state approximation, an old version of the toolbox was used, which ran into memory problems for some of these automata. For this reason, the subset construction algorithm has been reimplemented, paying special attention to the treatment of E-moves. Three variants of the subset construction algorithm are identified, which differ in the way c-moves are treated: per graph The most obvious and straightforward approach is sequential in the following sense: Firstly, an equivalent automaton without c-moves is constructed for the input. To do this, the transitive closure of the graph consisting of all c-moves is computed. Secondly, the resulting automaton is then treated by a subset construction algorithm for c-free automata. Different variants of per graph can be identified, depending on the implementation of the c-removal step.</Paragraph> <Paragraph position="1"> van Noord Epsilon Moves in Subset Construction per state For each state that occurs in a subset produced during subset construction, compute the states that are reachable using e-moves. The results of this computation can be memorized, or computed for each state in a pre-processing step. This is the approach mentioned briefly in Johnson and Wood (1997). 2 per subset For each subset Q of states that arises during subset construction, compute Q~ 2 Q, which extends Q with all states that are reachable from any member of Q using e-moves. Such an algorithm is described in Aho, Sethi, and Ullman (1986).</Paragraph> <Paragraph position="2"> The motivation for this paper is the knowledge gleaned from experience, that the first approach turns out to be impractical for automata with very large numbers of e-moves. An integration of the subset construction algorithm with the computation of e-reachable states performs much better in practice for such automata.</Paragraph> <Paragraph position="3"> Section 2 presents a short statement of the problem (how to determinize a given finite-state automaton), and a subset construction algorithm that solves this problem in the absence of e-moves. Section 3 defines a number of subset construction algorithms that differ with respect to the treatment of e-moves. Most aspects of the algorithms are not new and have been described elsewhere, and/or were incorporated in previous implementations; a comparison of the different algorithms had not been performed previously. We provide a comparison with respect to the size of the resulting deterministic automaton (in Section 3) and practical efficiency (in Section 4). Section 4 provides experimental results both for randomly generated automata and for automata generated by approximation algorithms. Our implementations of the various algorithms are also compared with AT&T's FSM utilities (Mohri, Pereira, and Riley 1998), to establish that the experimental differences we find between the algorithms are truly caused by differences in the algorithm (as opposed to accidental implementation details). null</Paragraph> </Section> </Section> class="xml-element"></Paper>