File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-1306_abstr.xml

Size: 5,145 bytes

Last Modified: 2025-10-06 13:49:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1306">
  <Title>Treatment of ~-Moves in Subset Construction</Title>
  <Section position="2" start_page="0" end_page="57" type="abstr">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In experimenting with finite-state approximation techniques for context-free and more powerful grammatical formalisms (such as the techniques presented in Pereira and Wright (1997), Nederhof (1997), Evans (1997)) we have found that the resulting automata often are extremely large. Moreover, the automata contain many e-moves (jumps). And finally, if such automata are determinised then the resulting automata are often smaller. It turns out that a straightforward implementation of the subset construction determinisation algorithm performs badly for such inputs.</Paragraph>
    <Paragraph position="1"> As a motivating example, consider the definite-clause grammar that has been developed for the OVIS2 Spoken Dialogue System. This grammar is described in detail in van Noord et al. (1997). After removing the feature constraints of this grammar, and after the removal of the sub-grammar for temporal expressions, this context-free skeleton grammar was input to an implementation of the technique described in Nederhof (1997). 1 The resulting non-deterministic automaton (labelled zov/s2 below) contains 89832 states, 80935 e-moves, and 80400 transitions.</Paragraph>
    <Paragraph position="2"> The determinised automaton contains only 6541 states, and 60781 transitions. Finally, the minimal automaton contains only 78 states and 526 transitions! Other grammars give rise to similar numbers. Thus, the approximation techniques yield particularly 'verbose' automata for relatively simple languages.</Paragraph>
    <Paragraph position="3"> The experiments were performed using the FSA Utilities toolkit (van Noord, 1997). At the time, an old version of the toolkit was used, which ran into memory problems for some of these automata. For this reason,, the subset construction algorithm has been re-implemented, paying special attention to the treatment of e-moves. Three variants of the subset construction algorithm are identified which differ in the way e-moves are treated: per graph The most obvious and straightforward approach is sequential in the following sense. Firstly, an equivalent automaton without e-moves is constructed for the input. In or-A later implementation by Nederhof (p.c.) avoids construction of the complete non-determistic automaton by determinis'mg and minimising subautomata before they are embedded into larger subautomata.</Paragraph>
    <Paragraph position="5"> der to do this, the transitive closure of the graph consisting of all e-moves is computed. Secondly, the resulting automato n is then treated by a subset construction algorithm for e-free automata.</Paragraph>
    <Paragraph position="6"> per state For each state which occurs in a subset produced during subset construction, compute the states which are reachable using e-moves. The results of this computation can be memorised, or computed for each state in a preprocessing step. This is the approach mentioned briefly in Johson and Wood (1997). 2 per subset For each subset Q of states which arises during subset construction, compute Q' D Q which extends Q with all states which are reachable from any member of Q using emoves. Such an algorithm is described in Aho, Sethi, and Ullman (1986). We extend this algorithm by memorising the e-closure computation.</Paragraph>
    <Paragraph position="7"> * The motivation for this paper is the experience that the first approach turns out to be impractical for automata with very large numbers of e-moves. An integration of the subset construction algorithm with the computation of e-reachable states performs much better in practice. The per subset algorithm almost always performs better than the per state approach. However, for automata with a low number of jumps, the per graph algorithm outperforms the others.</Paragraph>
    <Paragraph position="8"> In constructing an e-free automaton the number of transitions increases. Given the fact that the input automaton already is extremely large (compared to the simplicity of the language it defines), this is an undesirable situation. An equivalent e-freeautomaton for the example given above results in an automaton with 2353781 transitions. The implementation ofper subset is the only variant which succeeds in determinising the input automaton of this example.</Paragraph>
    <Paragraph position="9"> In the following section some background information concerning the FSA Utilities tool-box is provided. Section 3 then presents a short statement of the problem (determinise a given finite-state automaton), and a subset construction algorithm which solves this problem in the absence of e-moves. Section 4 identifies three variants of the subset construction algorithm which take e-moves into account. Finally, section 5 discusses some experiments in order to compare the three variants both on randomly generated automata and on automata generated by approximation algorithms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML