File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-1005_metho.xml
Size: 21,614 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?> <Paper uid="J00-1005"> <Title>Treatment of Epsilon Moves in Subset Construction</Title> <Section position="3" start_page="62" end_page="64" type="metho"> <SectionTitle> 2. Subset Construction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="62" end_page="64" type="sub_section"> <SectionTitle> 2.1 Problem Statement </SectionTitle> <Paragraph position="0"> Let a finite-state machine M be specified by a tuple (Q, G, 6, S, F) where Q is a finite set of states, G is a finite alphabet, and ~ is a function from Q x (G u {C/}) --* 2 Q.</Paragraph> <Paragraph position="1"> Furthermore, S c_ Q is a set of start states and F _C Q is a set of final states. 3 Let e-move be the relation {(qi, qj)lqj E ~(qi, e)}. c-reachable is the reflexive and transitive closure of e-move. Let e-CLOSURE: 2 Q ~ 2 Q be a function defined as: e-CLOSURE(Q') = {qlq' E Q', (q',q) E e-reachable} Furthermore, we write e-CLOSURE-I(Q ') for the set {qlq' E Q', (q,q') E e-reachable}.</Paragraph> <Paragraph position="2"> 2 According to Derick Wood (p. c.), this approach has been implemented in several systems, including Howard Johnson's INR system.</Paragraph> <Paragraph position="3"> 3 Note that a set of start states is required, rather than a single start state. Many operations on automata can be defined somewhat more elegantly in this way (including per graph t discussed below). Obviously, for deterministic automata this set should be a singleton set.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 26, Number 1</Paragraph> <Paragraph position="6"> while there is an unmarked subset T E States d__qo</Paragraph> <Paragraph position="8"> if U ~ States then add U unmarked to States if U M F then Finals := Finals U {U} fi</Paragraph> <Paragraph position="10"> Subset construction algorithm.</Paragraph> <Paragraph position="11"> For any given finite-state automaton M = (Q, G, 6, S, F), there is an equivalent deterministic automaton M I = (2 Q, G, 6', {Q0}, FI) * F ~ is the set of all states in 2 Q containing a final state of M, i.e., the set of subsets {Qi E 2Qiq E Qi, q E F}. M' has a single start state Q0, which is the epsilon closure of the start states of M, i.e., Q0 = c-CLOSURE(S). Finally, 6'({ql, q2 ..... qi},a) = E-CLOSURE(6(ql, a) U 6(q2,a) U... U 6(qi, a)) An algorithm that computes M / for a given M will only need to take into account states in 2 Q that are reachable from the start state Q0. This is the reason that for many input automata the algorithm does not need to treat all subsets of states (but note that there are automata for which all subsets are relevant, and hence exponential behavior cannot be avoided in general).</Paragraph> <Paragraph position="12"> Consider the subset construction algorithm in Figure 1. The algorithm maintains a set of subsets States. Each subset can be either marked or unmarked (to indicate whether the subset has been treated by the algorithm); the set of unmarked sub-sets is sometimes referred to as the agenda. The algorithm takes such an unmarked subset T and computes all transitions leaving T. This computation is performed by the function instructions and is called instruction computation by Johnson and Wood (1997).</Paragraph> <Paragraph position="13"> van Noord Epsilon Moves in Subset Construction The function index_transitions constructs the function transitions: Q --, ~. x 2 Q, which returns for a given state p the set of pairs (s, T) representing the transitions leaving p. Furthermore, the function merge takes such a set of pairs and merges all pairs with the same first element (by taking the union of the corresponding second elements). For example: merge({(a, {1,2,4}), (b, {2,4}), (a, {3,4}), (b, {5,6})}) = {(a, {1,2,3,4}), (b, {2,4,5,6})} The procedure add is responsible for &quot;reachable-state-set maintenance,&quot; by ensuring that target subsets are added to the set of subsets if these subsets were not encountered before. Moreover, if such a new subset contains a final state, then this subset is added to the set of final states.</Paragraph> </Section> </Section> <Section position="4" start_page="64" end_page="67" type="metho"> <SectionTitle> 3. Variants for E-Moves </SectionTitle> <Paragraph position="0"> The algorithm presented in the previous section does not treat c-moves. In this section, possible extensions of the algorithm are identified to treat c-moves.</Paragraph> <Section position="1" start_page="64" end_page="65" type="sub_section"> <SectionTitle> 3.1 Per Graph </SectionTitle> <Paragraph position="0"> In the per graph variant, two steps can be identified. In the first step, efree, an equivalent c-free automaton is constructed. In the second step this c-free automaton is determinized using the subset construction algorithm. The advantage of this approach is that the subset construction algorithm can remain simple because the input automaton is c-free.</Paragraph> <Paragraph position="1"> An algorithm for efree is described for instance in Hopcroft and Ullman (1979, 2627). The main ingredient of efree is the construction of the function c-CLOSURE, which can be computed using a standard transitive closure algorithm for directed graphs: this algorithm is applied to the directed graph consisting of all c-moves of M. Such an algorithm can be found in several textbooks (see, for instance, Cormen, Leiserson, and Rivest \[1990\]).</Paragraph> <Paragraph position="2"> For a given finite-state automaton M = (Q, G,6,S,F), efree computes M' = (Q, ~, 6', S', F'), where S' = c-CLOSURE(S), F' = c-CLOSURE -1 (F), and 6'(p,a) = {qiq' E 6(p', a), p' c c-CLOSURE -1 (p), q E c-CLOSURE(q')}. Instead of using c-CLOSURE on both the source and target side of a transition, efree can be optimized in two different ways by using c-CLOSURE only on one side: efreet: M' = (Q, ~, 6', S',F), where S' = c-CLOSURE(S), and</Paragraph> <Paragraph position="4"> Although the variants appear very similar, there are some differences. Firstly, efree t might introduce states that are not coaccessible: states from which no path exists to a final state; in contrast, efree s might introduce states that are not accessible: states from which no path exists from the start state. A straightforward modification of both algorithms is possible to ensure that these states are not present in the output. Thus efree t,c Illustration of the difference in size between two variants of efree. (1) is the input automaton. The result of efree t is given in (2); (3) is the result of erred. (4) and (5) are the result of applying the subset construction to the result of efree t and efred, respectively.</Paragraph> <Paragraph position="5"> ensures that all states in the resulting automaton are co-accessible; efree s,a ensures that all states in the resulting automaton are accessible. As a consequence, the size of the determinized machine is in general smaller if efree t,c is employed, because states that were not co-accessible (in the input) are removed (this is therefore an additional benefit of efreet,C; the fact that efree s,a removes accessible states has no effect on the size of the determinized machine because the subset construction algorithm already ensures accessibility anyway).</Paragraph> <Paragraph position="6"> Secondly, it turns out that applying eSree t in combination with the subset construction algorithm generally produces smaller automata than efree s (even if we ignore the benefit of ensuring co-accessibility). An example is presented in Figure 2. The differences can be quite significant, as illustrated in Figure 3.</Paragraph> <Paragraph position="7"> Below we will write per graph x to indicate the nonintegrated algorithm based on efree x .</Paragraph> </Section> <Section position="2" start_page="65" end_page="67" type="sub_section"> <SectionTitle> 3.2 Per Subset and Per State </SectionTitle> <Paragraph position="0"> Next, we discuss two variants (per subset and per state) in which the treatment of c-moves is integrated with the subset construction algorithm. We will show later that such an integrated approach is in practice often more efficient than the per graph approach if there are many C-moves. The per subset and per state approaches are also more suitable for a lazy implementation of the subset construction algorithm (in such a lazy implementation, subsets are only computed with respect to a given input string).</Paragraph> <Paragraph position="1"> The per subset and the per state algorithms use a simplified variant of the transitive closure algorithm for graphs. Instead of computing the transitive closure of a given Difference in sizes of deterministic automata constructed with either efree s or efree t, for randomly generated input automata consisting of 100 states, 15 symbols, and various numbers of transitions and jumps (cf. Section 4). Note that all states in the input are co-accessible; the difference in size is due solely to the effect illustrated in Figure 2.</Paragraph> <Paragraph position="3"> while there is an unmarked state t C D do</Paragraph> <Paragraph position="5"> Epsilon closure algorithm.</Paragraph> <Paragraph position="6"> graph, this algorithm only computes the closure for a given set of states. Such an algorithm is given in Figure 4.</Paragraph> <Paragraph position="7"> In both of the two integrated approaches, the subset construction algorithm is initialized with an agenda containing a single subset that is the e-CLOSURE of the set of start states of the input; furthermore, the way in which new transitions are computed also takes the effect of c-moves into account. Both differences are accounted for by an alternative definition of the epsilon_closure function.</Paragraph> <Paragraph position="8"> The approach in which the transitive closure is computed for one state at a time is defined by the following definition of the epsilon_closure function. Note that we make sure that the transitive closure computation is only performed once for each In the case of the per subset approach, the closure algorithm is applied to each subset. We also memorize the closure function, in order to ensure that the closure computation is performed only once for each subset. This can be useful, since the same subset can be generated many times during subset construction. The definition simply is:</Paragraph> <Paragraph position="10"> variant 3: per subset The motivation for the per state variant is the insight that in this case the closure algorithm is called at most IQ\] times. In contrast, in the per subset approach the transitive closure algorithm may need to be called 2 IQI times. On the other hand, in the per state approach some overhead must be accepted for computing the union of the results for each state. Moreover, in practice, the number of subsets is often much smaller than 21QI. In some cases, the number of reachable subsets is smaller than the number of states encountered in those subsets.</Paragraph> </Section> <Section position="3" start_page="67" end_page="67" type="sub_section"> <SectionTitle> 3.3 Implementation </SectionTitle> <Paragraph position="0"> In order to implement the algorithms efficiently in Prolog, it is important to use efficient data structures. In particular, we use an implementation of (non-updatable) arrays based on the N+K trees of O'Keefe (1990, 142-145) with N = 95 and K = 32.</Paragraph> <Paragraph position="1"> On top of this data structure, a hash array is implemented using the SICStus library predicate term_hash/4, which constructs a key for a given term. In such hashes, a value in the underlying array is a partial list of key-value pairs; thus collisions are resolved by chaining. This provides efficient access in practice, although such arrays are quite memory-intensive: care must be taken to ensure that the deterministic algorithms indeed are implemented without introducing choice-points during runtime. null</Paragraph> </Section> </Section> <Section position="5" start_page="67" end_page="72" type="metho"> <SectionTitle> 4. Experiments </SectionTitle> <Paragraph position="0"> Two sets of experiments have been performed. In the first set of experiments, random automata are generated according to a number of criteria based on Leslie (1995). In the second set of experiments, results are provided for a number of (much larger) automata that surfaced during actual development work on finite-state approximation techniques. 5 van Noord Epsilon Moves in Subset Construction is defined as the number of transitions divided by the square of the number of states multiplied by the number of symbols (i.e., the number of transitions divided by the maximum number of &quot;possible&quot; transitions, or, in other words, the probability that a possible transition in fact exists). Deterministic transition density is the number of transitions divided by the number of states multiplied by the number of symbols (i.e., the ratio of the number of transitions and the maximum number of &quot;possible&quot; transitions in a deterministic machine).</Paragraph> <Paragraph position="1"> In both of these definitions, the number of transitions should be understood as the number of nonduplicate transitions that do not lead to a sink state. A sink state is a state from which there exists no sequence of transitions to a final state. In the randomly generated automata, states are accessible and co-accessible by construction; sink states and associated transitions are not represented.</Paragraph> <Paragraph position="2"> Leslie (1995) shows that deterministic transition density is a reliable measure for the difficulty of subset construction. Exponential blow-up can be expected for input automata with deterministic transition density of around 2. 6 He concludes (page 66): randomly generated automata exhibit the maximum execution time, and the maximum number of states, at an approximate deterministic density of 2. Most of the area under the curve occurs within 0.5 and 2.5 deterministic density--this is the area in which subset construction is expensive.</Paragraph> <Paragraph position="3"> Conjecture. For a given NFA, we can compute the expected numbers of states and transitions in the corresponding DFA, produced by subset construction, from the deterministic density of the NFA. In addition, this functional relationship gives rise to a Poisson-like curve with its peak approximately at a deterministic density of 2.</Paragraph> <Paragraph position="4"> A number of automata were generated randomly, according to the number of states, symbols, and transitions. For the first experiment, automata were generated consisting of 15 symbols, 25 states, and various densities (and no c-moves). The results are summarized in Figure 5. CPU-time was measured on a HP 9000/785 machine running HP-UX 10.20. Note that our timings do not include the start-up of the Prolog engine, nor the time required for garbage collection.</Paragraph> <Paragraph position="5"> In order to establish that the differences we obtain later are genuinely due to differences in the underlying algorithm, and not due to &quot;accidental&quot; implementation details, we have compared our implementation with the determinizer of AT&T's FSM utilities (Mohri, Pereira, and Riley 1998). For automata without e-moves, we establish that FSM normally is faster: for automata with very small transition densities, FSM is up to four times as fast; for automata with larger densities, the results are similar. A new concept called absolute jump density is introduced to specify the number of c-moves. It is defined as the number of e-moves divided by the square of the number of states (i.e., the probability that an c-move exists for a given pair of states). Furthermore, deterministic jump density is the number of e-moves divided by the number of states (i.e., the average number of e-moves that leave a given state). In order to measure the differences between the three implementations, a number of automata have been generated consisting of 15 states and 15 symbols, using various implementation; fsm represents the CPU-time required by AT&T's FSM library; states represents the sum of the number of states of the input and output automata.</Paragraph> <Paragraph position="6"> transition densities between 0.01 and 0.3 (for larger densities, the automata tend to collapse to an automaton for ~.*). For each of these transition densities, deterministic jump densities were chosen in the range 0 to 2.5 (again, for larger values, the automata tend to collapse). In Figures 6 to 9, the outcomes of these experiments are summarized by listing the average amount of CPU-time required per deterministic jump density (for each of the algorithms), using automata with 15, 20, 25, and 100 states, respectively. Thus, every dot represents the average for determinizing a number of different input automata with various absolute transition densities and the same deterministic jump density.</Paragraph> <Paragraph position="7"> The striking aspect of these experiments is that the integrated per subset and per state variants are much more efficient for larger deterministic jump densities. The per graph t is typically the fastest algorithm of the nonintegrated versions. However, in these experiments all states in the input are co-accessible by construction; and moreover, all states in the input are final states. Therefore, the advantages of the pergraph t'c algorithm could not be observed here.</Paragraph> <Paragraph position="8"> The turning point is a deterministic jump density of around 0.8: for smaller densities the per graph t is typically slightly faster; for larger densities the per state algorithm is much faster. For densities beyond 1.5, the per subset algorithm tends to perform better than the per state algorithm. Interestingly, this generalization is supported by the experiments on automata generated by approximation techniques (although the results for randomly generated automata are more consistent than the results for &quot;real&quot; examples).</Paragraph> <Paragraph position="9"> Average amount of CPU-time versus jump density for each of the algorithms, and FSM. Input automata have 20 states. Absolute transition densities: 0.01-0.3.</Paragraph> <Paragraph position="10"> Average amount of CPU-time versus deterministic jump density for each of the algorithms, and FSM. Input automata have 100 states. Absolute transition densities: 0.001-0.0035. van Noord Epsilon Moves in Subset Construction Comparison with the FSM Library. We also provide the results for AT&T's FSM library. FSM is designed to treat weighted automata for very general weight sets. The initial implementation of the library consisted of an on-the-fly computation of the epsilon closures combined with determinization. This was abandoned for two reasons: it could not be generalized to the case of general weight sets, and it was not outputting the intermediate epsilon-removed machine (which might be of interest in itself). In the current version, c-moves must be removed before determinization is possible. This mechanism thus is comparable to our per graph variant. Apparently, FSM employs an algorithm equivalent to our per graph s,a. The resulting determinized machines are generally larger than the machines produced by our integrated variants and the variants that incorporate c-moves on the target side of transitions. The timings below are obtained for the pipe fsmrmepsilon I fsmdeterminize This is somewhat unfair, since this includes the time to write and read the intermediate machine. Even so, it is interesting to note that the FSM library is a constant factor faster than our per graphS,a; for larger numbers of jumps the per state and per subset variants consistently beat the FSM library.</Paragraph> <Paragraph position="11"> Experiment: Automata Generated by Approximation Algorithms. The automata used in the previous experiments were randomly generated. However, it may well be that in practice the automata that are to be treated by the algorithm have typical properties not reflected in this test data. For this reason, results are presented for a number of automata that were generated using approximation techniques for context-free grammars; in particular, for automata created by Nederhof, using the technique described in Nederhof (1997), and a small number of automata created using the technique of Pereira and Wright (1997) (as implemented by Nederhof). We have restricted our attention to automata with at least 1,000 states in the input.</Paragraph> <Paragraph position="12"> The automata typically contain lots of jumps. Moreover, the number of states of the resulting automaton is often smaller than the number of states in the input automaton. Results are given in Tables I and 2. One of the most striking examples is the ygrim automaton consisting of 3,382 states and 9,124 jumps. For this example, the per graph implementations ran out of memory (after a long time), whereas the implementation of the per subset algorithm produced the determinized automaton (containing only 9 states) within a single CPU-second. The FSM implementation took much longer for this example (whereas for many of the other examples it is faster than our implementations). Note that this example has the highest ratio of number of jumps to number of states. This confirms the observation that the per subset algorithm performs better on inputs with a high deterministic jump density.</Paragraph> </Section> class="xml-element"></Paper>