XML Viewer - j00-1002

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-1002_metho.xml
Size: 28,344 bytes
Last Modified: 2025-10-06 14:07:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-1002">
  <Title>Stoyan Mihov t Bulgarian Academy of Sciences</Title>
  <Section position="3" start_page="0" end_page="4" type="metho">
    <SectionTitle>
2. Mathematical Preliminaries
</SectionTitle>
    <Paragraph position="0"> We define a deterministic finite-state automaton to be a 5-tuple M = (Q, ~, 6, q0, F), where Q is a finite set of states, q0 E Q is the start state, F C Q is a set of final states, is a finite set of symbols called the alphabet, and 6 is a partial mapping 6: Q x G ~ Q denoting transitions. When 6(q,a) is undefined, we write ~(q,a) = _L. We can extend the 6 mapping to partial mapping 6*: Q x ~* ~ Q as follows (where a E Y,, x E ~*):</Paragraph>
    <Paragraph position="2"> Let DAFSA be the set of all deterministic finite-state automata in which the transition function 6 is acyclic--there is no string w and state q such that 6&amp;quot; (q, w) = q.</Paragraph>
    <Paragraph position="3"> We define PS(M) to be the language accepted by automaton M:</Paragraph>
    <Paragraph position="5"> The size of the automaton, IMI, is equal to the number of states, IQ\[. ~(G*) is the set of all languages over G. Define the function 2: Q ~ 7~(G *) to map a state q to the set of all strings on a path from q to any final state in M. More precisely,</Paragraph>
    <Paragraph position="7"> Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs a state can also be defined recursively: (q)= {aPS (6(q,a))\[ac~A6(q,a)~ _L } U {{~ } ifqEF otherwise One may ask whether such a recursive definition has a unique solution. Most texts on language theory, for example Moll, Arbib, and Kfoury (1988), show that the solution is indeed unique--it is the least fixed-point of the equation.</Paragraph>
    <Paragraph position="8"> We also define a property of an automaton specifying that all states can be reached from the start state:</Paragraph>
    <Paragraph position="10"> The property of being a minimal automaton is traditionally defined as follows (see Watson \[1993b, 1995\]):</Paragraph>
    <Paragraph position="12"> We will, however, use an alternative definition of minimality, which is shown to be equivalent: Minimal(M) = (Vq,q, cQ(q ~ q' ~PS (q) #PS (q'))) A Reachable(M) A general treatment of automata minimization can be found in Watson (1995). A formal proof of the correctness of the following algorithm can be found in Mihov (1998).</Paragraph>
  </Section>
  <Section position="4" start_page="4" end_page="8" type="metho">
    <SectionTitle>
3. Construction from Sorted Data
</SectionTitle>
    <Paragraph position="0"> A trie is a dictionary with a tree-structured transition graph in which the start state is the root and all leaves are final states. 2 An example of a dictionary in a form of a trie is given in Figure 1. We can see that many subtrees in the transition graph are isomorphic. The equivalent minimal dictionary (Figure 2) is the one in which only one copy of each isomorphic subtree is kept. This means that, pointers (edges) to all isomorphic subtrees are replaced by pointers (edges) to their unique representative. null The traditional method of obtaining a minimal dictionary is to first create a (not necessarily minimal) dictionary for the language and then minimize it using any one of a number of algorithms (again, see Watson \[1993b, 1995\] for numerous examples of such algorithms). The first stage is usually done by building a trie, for which there are fast and well-understood algorithms. Dictionary minimization algorithms are quite efficient in terms of the size of their input dictionary--for some algorithms, the memory and time requirements are both linear in the number of states. Unfortunately, even such good performance is not sufficient in practice, where the intermediate dictionary (the trie) can be much larger than the available physical memory. (Some effort towards decreasing the memory requirement has been made; see Revuz \[1991\].) This paper presents a way to reduce these intermediate memory requirements and decrease the 2 There may also be nonleaf, in other words interior, states that are final.</Paragraph>
    <Paragraph position="1"> Computational Linguistics Volume 26, Number 1 Figure 1 A trie whose language is the French regular endings of verbs of the first group.</Paragraph>
    <Paragraph position="2"> Figure 2 The unique minimal dictionary whose language is the French regular endings of verbs of the first group.</Paragraph>
    <Paragraph position="3"> total construction time by constructing the minimal dictionary incrementally (word by word, maintaining an invariant of minimality), thus avoiding ever having the entire trie in memory.</Paragraph>
    <Paragraph position="4"> Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs The central part of most automata minimization algorithms is a classification of states. The states of the input dictionary are partitioned such that the equivalence classes correspond to the states of the equivalent minimal automaton. Assuming the input dictionary has only reachable states (that is, Reachable is true), we can deduce (by our alternative definition of minimality) that each state in the minimal dictionary must have a unique right language. Since this is a necessary and sufficient condition for minimality, we can use equality of right languages as the equivalence relation for our classes. Using our definition of right languages, it is easily shown that equality of right languages is an equivalence relation (it is reflexive, symmetric, and transitive). We will denote two states, p and q, belonging to the same equivalence class by p = q (note that = here is different from its use for logical equivalence of predicates). In the literature, this relation is sometimes written as E.</Paragraph>
    <Paragraph position="5"> To aid in understanding, let us traverse the trie (see Figure 1) with the postorder method and see how the partitioning can be performed. For each state we encounter, we must check whether there is an equivalent state in the part of the dictionary that has already been analyzed. If so, we replace the current state with the equivalent state. If not, we put the state into a register, so that we can find it easily. It follows that the register has the following property: it contains only states that are pairwise inequivalent. We start with the (lexicographically) first leaf, moving backward through the trie toward the start state. All states up to the first forward-branching state (state with more than one outgoing transition) must belong to different classes and we immediately place them in the register, since there will be no need to replace them by other states. Considering the other branches, and starting from their leaves, we need to know whether or not a given state belongs to the same class as a previously registered state. For a given state p (not in the register), we try to find a state q in the register that would have the same right language. To do this, we do not need to compare the languages themselves---comparing sets of strings is computationally expensive. We can use our recursive definition of the right language. State p belongs to the same  they are either both final or both nonfinal; and they have the same number of outgoing transitions; and corresponding outgoing transitions have the same labels; and corresponding outgoing transitions lead to states that have the same right languages.</Paragraph>
    <Paragraph position="6"> Because the postorder method ensures that all states reachable from the states already visited are unique representatives of their classes (i.e., their right languages are unique in the visited part of the automaton), we can rewrite the last condition as: 4'. corresponding transitions lead to the same states.</Paragraph>
    <Paragraph position="7"> If all the conditions are satisfied, the state p is replaced by q. Replacing p simply involves deleting it while redirecting all of its incoming transitions to q. Note that all Computational Linguistics Volume 26, Number 1 leaf states belong to the same equivalence class. If some of the conditions are not satisfied, p must be a representative of a new class and therefore must be put into the register.</Paragraph>
    <Paragraph position="8"> To build the dictionary one word at a time, we need to merge the process of adding new words to the dictionary with the minimization process. There are two crucial questions that must be answered. First, which states (or equivalence classes) are subject to change when new words are added? Second, is there a way to add new words to the dictionary such that we minimize the number of states that may need to be changed during the addition of a word? Looking at Figures 1 and 2, we can reproduce the same postorder traversal of states when the input data is lexicographically sorted. (Note that in order to do this, the alphabet G must be ordered, as is the case with ASCII and Unicode). To process a state, we need to know its right language. According to the method presented above, we must have the whole subtree whose root is that state. The subtree represents endings of subsequent (ordered) words. Further investigation reveals that when we add words in this order, only the states that need to be traversed to accept the previous word added to the dictionary may change when a new word is added. The rest of the dictionary remains unchanged, because a new word either begins with a symbol different from the first symbols of all words already in the automaton; the beginning symbol of the new word is lexicographically placed after those symbols; or it shares some (or even all) initial symbols of the word previously added to the dictionary; the algorithm then creates a forward branch, as the symbol on the label of the transition must be later in the alphabet than symbols on all other transitions leaving that state.</Paragraph>
    <Paragraph position="9"> When the previous word is a prefix of the new word, the only state that is to be modified is the last state belonging to the previous word. The new word may share its ending with other words already in the dictionary, which means that we need to create links to some parts of the dictionary. Those parts, however, are not modified. This discovery leads us to Algorithm 1, shown below.</Paragraph>
    <Paragraph position="11"> Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs func common_prefix(Word) return the longest prefix w of Word such that ~* (q0, w) ~ 3_</Paragraph>
    <Paragraph position="13"> if 3qEQ( q E Register A q = Child) --,</Paragraph>
    <Paragraph position="15"> The main loop of the algorithm reads subsequent words and establishes which part of the word is already in the automaton (the CommonPrefix), and which is not (the CurrentSuffix). An important step is determining what the last state (here called LastState) in the path of the common prefix is. If LastState already has children, it means that not all states in the path of the previously added word are in the path of the common prefix. In that case, by calling the function replace_or_register, we can let the minimization process work on those states in the path of the previously added word that are not in the common prefix path. Then we can add to the LastState a chain of states that would recognize the CurrentSuffix.</Paragraph>
    <Paragraph position="16"> The function common_prefix finds the longest prefix (of the word to be added) that is a prefix of a word already in the automaton. The prefix can be empty (since = q).</Paragraph>
    <Paragraph position="17"> The function add_suffix creates a branch extending out of the dictionary, which represents the suffix of the word being added (the maximal suffix of the word which is not a prefix of any other word already in the dictionary). The last state of this branch is marked as final.</Paragraph>
    <Paragraph position="18"> The function last_child returns a reference to the state reached by the lexicographically last transition that is outgoing from the argument state. Since the input data is lexicographically sorted, last_child returns the outgoing transition (from the state) most recently added (during the addition of the previous word). The function replace_or_register effectively works on the last child of the argument state. It is called with the argument that is the last state in the common prefix path (or the initial state in the last call). We need the argument state to modify its transition in those instances in which the child is to be replaced with another (equivalent) state. Firstly, the function calls itself recursively until it reaches the end of the path of the previously added word.</Paragraph>
    <Paragraph position="19"> Note that when it encounters a state with more than one child, it takes the last one, as it belongs to the previously added word. As the length of words is limited, so is the depth of recursion. Then, returning from each recursive call, it checks whether a state equivalent to the current state can be found in the register. If this is true, then the state is replaced with the equivalent state found in the register. If not, the state is registered as a representative of a new class. Note that the function replace-or_register processes only the states belonging to the path of the previously added word (a part, or possibly all, of those created with the previous call to add_suffix), and that those Computational Linguistics Volume 26, Number 1 states are never reprocessed. Finally, has_children returns true if, and only if, there are outgoing transitions from the state.</Paragraph>
    <Paragraph position="20"> During the construction, the automaton states are either in the register or on the path for the last added word. All the states in the register are states in the resulting minimal automaton. Hence the temporary automaton built during the construction has fewer states than the resulting automaton plus the length of the longest word.</Paragraph>
    <Paragraph position="21"> Memory is needed for the minimized dictionary that is under construction, the call stack, and for the register of states. The memory for the dictionary is proportional to the number of states and the total number of transitions. The memory for the register of states is proportional to the number of states and can be freed once construction is complete. By choosing an appropriate implementation method, one can achieve a memory complexity O(n) for a given alphabet, where n is the number of states of the minimized automaton. This is an important advantage of our algorithm. null For each letter from the input list, the algorithm must either make a step in the function common_prefix or add a state in the procedure add_suyqx. Both operations can be performed in constant time. Each new state that has been added in the procedure add~ufix has to be processed exactly once in the procedure replace_or_register. The number of states that have to be replaced or registered is clearly smaller than the number of letters in the input list. 3 The processing of one state in the procedure consists of one register search and possibly one register insertion. The time complexity of the search is (c)(log n),where n is the number of states in the (minimized) dictionary. The time complexity of adding a state to the register is also O(log n). In practice, however, by using a hash table to represent the register (and equivalence relation), the average time complexity of those operations can be made almost constant. Hence the time complexity of the whole algorithm is 0(I log n), where l is the total number of letters in the input list.</Paragraph>
  </Section>
  <Section position="5" start_page="8" end_page="14" type="metho">
    <SectionTitle>
4. Construction from Unsorted Data
</SectionTitle>
    <Paragraph position="0"> Sometimes it is difficult or even impossible to sort the input data before constructing a dictionary. For example, there may be insufficient time or storage space to sort the data or the data may originate in another program or physical source. An incremental dictionary-building algorithm would still be very useful in those situations, although unsorted data makes it more difficult to merge the trie-building and the minimization processes. We could leave the two processes disjoint, although this would lead to the traditional method of constructing a trie and minimizing it afterwards. A better solution is to minimize everything on-the-fly, possibly changing the equivalence classes of some states each time a word is added. Before actually constructing a new state in the dictionary, we first determine if it would be included in the equivalence class of a preexisting state. Similarly, we may need to change the equivalence classes of previously constructed states since their right languages may have changed. This leads to an incremental construction algorithm. Naturally, we would want to create the states for a new word in an order that would minimize the creation of new equivalence classes.</Paragraph>
    <Paragraph position="1"> As in the algorithm for sorted data, when a new word w is added, we search for the prefix of w already in the dictionary. This time, however, we cannot assume  Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs a \b Figure 3 The result of blindly adding the word bae to a minimized dictionary (appearing on the left) containing abd and bad. The rightmost dictionary inadvertently contains abe as well. The lower dictionary is correct--state 3 had to be cloned.</Paragraph>
    <Paragraph position="2"> that the states traversed by this common prefix will not be changed by the addition of the word. If there are any preexisting states traversed by the common prefix that are already targets of more than one in-transition (known as confluence states), then blindly appending another transition to the last state in this path (as we would in the sorted algorithm) would accidentally add more words than desired (see Figure 3 for an example of this).</Paragraph>
    <Paragraph position="3"> To avoid generation of such spurious words, all states in the common prefix path from the first confluence state must be cloned. Cloning is the process of creating a new state that has outgoing transitions on the same labels and to the same destination states as a given state. If we compare the minimal dictionary (Figure 1) to an equivalent trie (Figure 2), we notice that a confluence state can be seen as a root of several original, isomorphic subtrees merged into one (as described in the previous section). One of the isomorphic subtrees now needs to be modified (leaving it no longer isomorphic), so it must first be separated from the others by cloning of its root. The isomorphic subtrees hanging off these roots are unchanged, so the original root and its clone have the same outgoing transitions (that is, transitions on the same labels and to the same destination states).</Paragraph>
    <Paragraph position="4"> In Algorithm 1, the confluence states were never traversed during the search for the common prefix. The common prefix was not only the longest common prefix of the word to be added and all the words already in the automaton, it was also the longest common prefix of the word to be added and the last (i.e., the previous) word added to the automaton. As it was the function replace_or_register that created confluence states, and that function was never called on states belonging to the path of the last word added to the automaton, those states could never be found in the common prefix path.</Paragraph>
    <Paragraph position="5"> Once the entire common prefix is traversed, the rest of the word must be appended. If there are no confluence states in the common prefix, then the method of adding the rest of the word does not differ from the method used in the algorithm for sorted data. However, we need to withdraw (from the register) the last state in the common prefix path in order not to create cycles. This is in contrast to the situation in the algorithm for sorted data where that state is not yet registered. Also, CurrentSuffix could be matched with a path in the automaton containing states from the common prefix path (including the last state of the prefix).</Paragraph>
    <Paragraph position="6">  Consider an automaton (shown in solid lines on the left-hand figure) accepting abcde and fghde. Suppose we want to add fgh@de. As the common prefix path (shown in thicker lines) contains a confluence state, we clone state 5 to obtain state 9, add the suffix to state 9, and minimize it. When we also consider the dashed lines in the left-hand figure, we see that state 8 became a new confluence state earlier in the common prefix path. The right-hand figure shows what could happen if we did not rescan the common prefix path for confluence states. State 10 is a clone of state 4.</Paragraph>
    <Paragraph position="7"> When there is a confluence state, then we need to clone some states. We start with the last state in the common prefix path, append the rest of the word to that clone and minimize it. Note that in this algorithm, we do not wait for the next word to come, so we can minimize (replace or register the states of) CurrentSuffix state by state as they are created. Adding and minimizing the rest of the word may create new confluence states earlier in the common prefix path, so we need to rescan the common prefix path in order not to create cycles, as illustrated in Figure 4. Then we proceed with cloning and minimizing the states on the path from the state immediately preceding the last state to the current first confluence state.</Paragraph>
    <Paragraph position="8"> Another, less complicated but also less economical, method can be used to avoid the problem of creating cycles in the presence of confluence states. In that solution, we proceed from the state immediately preceding the confluence state towards the end of the common prefix path, cloning the states on the way. But first, the state immediately preceding the first confluence state should be removed from the register. At the end of the common prefix path, we add the suffix. Then, we call replace_or_register with the predecessor of the state immediately preceding the first confluence state. The following should be noted about this solution: memory requirements are higher, as we keep more than one isomorphic state at a time, the function replace_or_register must remain recursive (as in the sorted version), and the argument to replace_or_register must be a string, not a symbol, in order to pass subsequent symbols to children.</Paragraph>
    <Paragraph position="9"> When the process of traversing the common prefix (up to a confluence state) and adding the suffix is complete, further modifications follow. We must recalculate the equivalence class of each state on the path of the new word. If any equivalence class changes, we must also recalculate the equivalence classes of all of the parents of all of the states in the changed class. Interestingly, this process could actually make the new dictionary smaller. For example, if we add the word abe to the dictionary at the bottom of Figure 3 while maintaining minimality, we obtain the dictionary shown in  Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs the right of Figure 3, which is one state smaller. The resulting algorithm is shown in  Algorithm 2.</Paragraph>
    <Paragraph position="10"> Algorithm 2.</Paragraph>
    <Paragraph position="12"> if 3q E Q(q c Register A q = Child)  Computational Linguistics Volume 26, Number 1</Paragraph>
    <Paragraph position="14"> The main loop reads the words, finds the common prefix, and tries to find the first confluence state in the common prefix path. Then the remaining part of the word (CurrentSuf-fi'x) is added.</Paragraph>
    <Paragraph position="15"> If a confluence state is found (i.e., FirstState points to a state in the automaton), all states from the first confluence state to the end of the common prefix path are cloned, and then considered for replacement or registering. Note that the inner loop (with i as the control variable) begins with the penultimate state in the common prefix, because the last state has already been cloned and the function replace~r_register acts on a child of its argument state.</Paragraph>
    <Paragraph position="16"> Addition of a new suffix to the last state in the common prefix changes the right languages of all states that precede that state in the common prefix path. The last part of the main loop deals with that situation. If the change resulted in such modification of the right language of a state that an equivalent state can be found somewhere else in the automaton, then the state is replaced with the equivalent one and the change propagates towards the initial state. If the replacement of a given state cannot take place, then (according to our recursive definition of the right language) there is no need to replace any preceding state.</Paragraph>
    <Paragraph position="17"> Several changes to the functions used in the sorted algorithm are necessary to handle the general case of unsorted data. The replace~r_register procedure needs to be modified slightly. Since new words are added in arbitrary order, one can no longer assume that the last child (lexicographically) of the state (the one that has been added most recently) is the child whose equivalence class may have changed. However, we know the label on the transition leading to the altered child, so we use it to access that state. Also, we do not need to call the function recursively. We assume that add~uffix replaces or registers the states in the CurrentSuffix in the correct order; later we process one path of states in the automaton, starting from those most distant from the initial state, proceeding towards the initial state q0. So in every situation in which we call replace_or_register, all children of the state Child are already unique representatives of their equivalence classes.</Paragraph>
    <Paragraph position="18"> Also, in the sorted algorithm, add_suffix is never passed ~ as an argument, whereas this may occur in the unsorted version of the algorithm. The effect is that the LastState should be marked as final since the common prefix is, in fact, the entire word. In the sorted algorithm, the chain of states created by add_suffix was left for further treatment until new words are added (or until the end of processing). Here, the automaton is completely minimized on-the-fly after adding a new word, and the function add~suffix can call replace_or_register for each state it creates (starting from the end of the suffix). Finally, the new function first_state simply traverses the dictionary using the given word prefix and returns the first confluence state it encounters. If no such state exists, first_state returns 0.</Paragraph>
    <Paragraph position="19"> As in the sorted case, the main loop of the unsorted algorithm executes m times, where m is the number of words accepted by the dictionary. The inner loops are executed at most Iwl times for each word. Putting a state into the register takes O(logn), although it may be constant when using a hash table. The same estimation is valid  Daciuk, Mihov, Watson, and Watson Incremental Construction of FSAs for a removal from the register. In this case, the time complexity of the algorithm remains the same, but the constant changes. Similarly, hashing can be used to provide an efficient method of determining the state equivalence classes. For sorted data, only a single path through the dictionary could possibly be changed each time a new word is added. For unsorted data, however, the changes frequently fan out and percolate all the way back to the start state, so processing each word takes more time.</Paragraph>
    <Section position="1" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
4.1 Extending the Algorithms
</SectionTitle>
      <Paragraph position="0"> These new algorithms can also be used to construct transducers. The alphabet of the (transducing) automaton would be G1 x G2, where G1 and ~2 are the alphabet of the levels. Alternatively, elements of G~ can be associated with the final states of the dictionary and only output once a valid word from G~ is recognized.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML