XML Viewer - p96-1029

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1029_metho.xml
Size: 19,368 bytes
Last Modified: 2025-10-06 14:14:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1029">
  <Title>Compilation of Weighted Finite-State Transducers from Decision Trees</Title>
  <Section position="3" start_page="215" end_page="215" type="metho">
    <SectionTitle>
2 Quick Review of Tree-Based
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="215" end_page="215" type="sub_section">
      <SectionTitle>
Modeling
</SectionTitle>
      <Paragraph position="0"> A general introduction to classification and regression trees ('CART') including the algorithm for growing trees from data can be found in (Breiman et al., 1984). Applications of tree-based modeling to problems in speech and NLP are discussed in (Riley, 1989; Riley, 1991; Wang and Hirschberg, 1992; Magerman, 1995, inter alia). In this section we presume that one has already trained a tree or set of trees, and we merely remind the reader of the salient points in the interpretation of those trees.</Paragraph>
      <Paragraph position="1"> Consider the tree depicted in Figure 1, which was trained on the TIMIT database (Fisher et al., 1987), and which models the phonetic realization of the English phoneme/aa/(/a/) in various environments (Riley, 1991). When this tree is used in predicting the allophonic form of a particular instance of /aa/, one starts at the root of the tree, and asks questions about the environment in which the/aa/is found. Each non-leaf node n, dominates two daughter nodes conventionally labeled as 2n and 2n+ 1; the decision on whether to go left to 2n or right to 2n + 1 depends on the answer to the question that is being asked at node n. 1The work reported here can thus be seen as complementary to recent reports on methods for directly inferring transducers from data (Oncina et al., 1993; Gildea and Jurafsky, 1995).</Paragraph>
      <Paragraph position="2"> A concrete example will serve to illustrate. Consider that we have/aa/in some environment. The first question that is asked concerns the number of segments, including the /aa/itself, that occur to the left of the/aa/in the word in which/aa/occurs. (See Table 1 for an explanation of the symbols used in Figure 1.) In this case, if the /aa/ is initial -- i.e., lseg is 1, one goes left; if there is one or more segments to the left in the word, go right. Let us assume that this /aa/is initial in the word, in which case we go left. The next question concerns the consonantal 'place' of articulation of the segment to the right of/an/; if it is alveolar go left; otherwise, if it is of some other quality, or if the segment to the right of/aa/is not a consonant, then go right. Let us assume that the segment to the right is/z/, which is alveolar, so we go left. This lands us at terminal node 4. The tree in Figure 1 shows us that in the training data 119 out of 308 occurrences of/aa/in this environment were realized as \[ao\], or in other words that we can estimate the probability of/aa/being realized as \[ao\] in this environment as .385. The full set of realizations at this node with estimated non-zero probabilities is as follows (see Table 2 for a relevant set of ARPABET-IPA correspondences): phone probability - log prob. (weight)  An important point to bear in mind is that a decision tree in general is a complete description, in the sense that for any new data point, there will be some leaf node that corresponds to it. So for the tree in Figure 1, each new novel instance of/aa/will be handled by (exactly) one leaf node in the tree, depending upon the environment in which the/an/finds itself.</Paragraph>
      <Paragraph position="3"> Another important point is that each decision tree considered here has the property that its predictions specify how to rewrite a symbol (in context) in an input string. In particular, they specify a two-level mapping from a set of input symbols (phonemes) to a set of output symbols (allophones).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="215" end_page="216" type="metho">
    <SectionTitle>
3 Quick Review of Rule
Compilation
</SectionTitle>
    <Paragraph position="0"> Work on finite-state phonology (Johnson, 1972; Koskenniemi, 1983; Kaplan and Kay, 1994) has shown that systems of rewrite rules of the familiar form C/ --* C//)~ p, where C/, C/, A and p are  regular expressions, can be represented computationally as finite-state transducers (FSTs): note that C/ represents the rule's input rule, C/ the output, and ~ and p, respectively, the left and right contexts.</Paragraph>
    <Paragraph position="1"> Kaplan and Kay (1994) have presented a concrete algorithm for compiling systems of such rules into FSTs. These methods can be extended slightly to include the compilation of probabilistic or weighted rules into weighted finitestate-transducers (WFSTs -- see (Pereira et al., 1994)): Mohri and Sproat (1996) describe a rule-compilation algorithm which is more efficient than the Kaplan-Kay algorithm, and which has been extended to handle weighted rules. For present purposes it is sufficient to observe that given this extended algorithm, we can allow C/ in the expression C/ --~ C//~ p, to represent a weighted regular expression. The compiled transducer corresponding to that rule will replace C/ with C/ with the appropriate weights in the context A p.</Paragraph>
  </Section>
  <Section position="5" start_page="216" end_page="220" type="metho">
    <SectionTitle>
4 The Tree Compilation Algorithm
</SectionTitle>
    <Paragraph position="0"> The key requirements on the kind of decision trees that we can compile into WFSTs are (1) the predictions at the leaf nodes specify how to rewrite a particular symbol in an input string, and (2) the decisions at each node are stateable as regular expressions over the input string. Each leaf node represents a single rule. The regular expressions for each branch describe one aspect of the left context )~, right context p, or both. The left and right contexts for the rule consist of the intersections of the partial descriptions of these contexts defined for each branch traversed between the root and leaf node. The input C/ is predefined for the entire tree, whereas the output C/ is defined as the union of the set of outputs, along with their weights, that are associated with the leaf node. The weighted rule belonging to the leaf node can then be compiled into a transducer using the weighted-rule-compilation algorithm referenced in the preceding section. The transducer for the entire tree can be derived by the intersection of the entire set of transducers associated with the leaf nodes. Note that while regular relations are not generally closed under intersection, the subset of same-length (or more strictly speaking lengthpreserving) relations is closed; see below.</Paragraph>
    <Paragraph position="1"> To see how this works, let us return to the example in Figure 1. To start with, we know that this tree models the phonetic realization of/aa/, so we can immediately set C/ to be aa for the whole tree. Next, consider again the traversal of the tree from the root node to leaf node 4. The first decision concerns the number of segments to the left of the /aa/ in the word, either none for the left  place of articulation of consonant n segments to the right place of articulation of consonant n segments to the left values: alveolar; bilabial; labiodental; dental; palatal; velar; pharyngeal; n/a if is a vowel, or there is no such segment vpn vp-n place of articulation of vowel n segments to the right place of articulation of vowel n segments to the left values: central-mid-high; back-low; back-mid-low; back-high; front-low; front-mid-low; front-mid-high; front-high; central-mid-low; back-mid-high n/a if is a consonant, or there is no such segment Iseg number of preceding segments including the segment of interest within the word rseg number of following segments including the segment of interest within the word values: 1, 2, 3, many str stress assigned to this vowel values: primary, secondary, no (zero) stress n/a if there is no stress mark</Paragraph>
    <Paragraph position="3"> branch, or one or more for the right branch. Assuming that we have a symbol a representing a single segment, the symbol # representing a word boundary, and allowing for the possibility of intervening optional stress marks ~ which do not count as segments, these two possibilities can be represented by the regular expressions for A in (a) of Table 3. 2 At this node there is no decision based on the righthand context, so the right-hand context is free. We can represent this by setting p at this node to be E*, where E (conventionally) represents the entire alphabet: note that the alphabet is defined to be an alphabet of all C/:C/ correspondence pairs that were determined empirically to be possible.</Paragraph>
    <Paragraph position="4"> The decision at the left daughter of the root node concerns whether or not the segment to the right is an alveolar. Assuming we have defined classes of segments alv, blab, and so forth (represented as unions of segments) we can represent the regular expression for p as in (b) of Table 3. In this case it is A which is unrestricted, so we can set that at ~*.</Paragraph>
    <Paragraph position="5"> We can derive the ~ and p expressions for the rule at leaf node 4 by intersecting together the expressions for these contexts defined for each branch traversed on the way to the leaf. For leaf node 4, A = #Opt(')N E* = #Opt('), and p = E* n Opt(')(alv) = Opt(')(alv). 3 The rule input C/ has already been given as aa. The output C/ is defined as the union of all of the possible expressions -- at the leaf node in question -- that aa could become, with their associated weights (negative log probabilities), which we represent here as subscripted floating-point numbers:</Paragraph>
    <Paragraph position="7"> Thus the entire weighted rule can be written as  branch may define expressions of different lengths, it is necessary to left-pad each )~ with ~*, and right-pad each p with ~*. We gloss over this point here in order to make the regular expressions somewhat simpler to understand  follows:</Paragraph>
    <Paragraph position="9"> By a similar construction, the rule at node 6, for example, would be represented as: aa --* (aa0.40 U aol.n)/ N (Z*((cmh) U (bl) U (bml) U (bh))) r: Each node thus represents a rule which states that a mapping occurs between the input symbol C/ and the weighted expression C/ in the condition described by A p. Now, in cases where C/ finds itself in a context that is not subsumed by A p, the rule behaves exactly as a two-level surface coercion rule (Koskenniemi, 1983): it freely allows C/ to correspond to any C/ as specified by the alphabet of pairs. These C/:C/ correspondences are, however, constrained by other rules derived from the tree, as we shall see directly.</Paragraph>
    <Paragraph position="10"> The interpretation of the full tree is that it represents the conjunction of all such mappings: for rules 1, 2 ...n, C/ corresponds to C/1 given condition ~1__Pl and C/ corresponds to C/~ given condition ~2 P2 ...and C/ corresponds to C/,, given condition ~ p~. But this conjunction is simply the intersection of the entire set of transducers defined for the leaves of the tree. Observe now that the C/:C/ correspondences that were left free by the rule of one leaf node, are constrained by intersection with the other leaf nodes: since, as noted above, the tree is a complete description, it follows that for any leaf node i, and for any context A p not subsumed by hi Pi, there is some leaf node j such that )~j pj subsumes ~ p.</Paragraph>
    <Paragraph position="11"> Thus, the transducers compiled for the rules at nodes 4 and 6, are intersected together, along with the rules for all the other leaf nodes. Now, as noted above, and as discussed by Kaplan and Kay (1994) regular relations -- the algebraic counterpart of FSTs -- are not in general closed under intersection; however, the subset of same-length regular relations is closed under intersection, since they can be thought of as finite-state acceptors ex(a) left branch A = #Opt(')</Paragraph>
    <Paragraph position="13"> 4 in the tree in Figure 1. Note that, as per convention, superscript '+' denotes one or more instances of an expression.</Paragraph>
    <Paragraph position="14"> pressed over pairs of symbols. 4 This point can be extended somewhat to include relations that involve bounded deletions or insertions: this is precisely the interpretation necessary for systems of  two-level rules (Koskenniemi, 1983), where a single transducer expressing the entire system may be constructed via intersection of the transducers expressing the individual rules (Kaplan and Kay, 1994, pages 367-376). Indeed, our decision tree represents neither more nor less than a set of weighted two-level rules. Each of the symbols in the expressions for A and p actually represent (sets of) pairs of symbols: thus alp, for example, represents all lexical alveolars paired with all of their possible surface realizations. And just as each tree represents a system of weighted two-level rules, so a set of trees -- e.g., where each tree deals with the realization of a particular phone -- represents a system of weighted two-level rules, where each two-level rule is compiled from each of the individual trees.</Paragraph>
    <Paragraph position="15"> We can summarize this discussion more formally as follows. We presume a function Compile which given a rule returns the WFST computing that rule. The WFST for a single leaf L is thus defined as follows, where CT is the input symbol for the entire tree, eL is the output expression defined at L, t95 represents the path traversed from the root node to L, p is an individual branch on 4One can thus define intersection for transducers analogously with intersection for acceptors. Given two machines Gz and G2, with transition functions 51 and 52, one can define the transition function of G, 5, as follows: for an input-output pair (i,o), 5((ql, q2), (i, o)) = (q~, q~) if and only if 5z(ql, (i, o)) = q~ and 62(q2, (i, o)) = q~.</Paragraph>
    <Paragraph position="16">  that path, and Ap and pp are the expressions for A and p defined at p:  The algorithm just described has been empirically verified on the Resource Management (RM) continuous speech recognition task (Price et al., 1988). Following somewhat the discussion in (Pereira et al., 1994; Pereira and Riley, 1996), we can represent the speech recognition task as the problem of finding the best path in the composition of a grammar (language model) G, the transitive-closure of a dictionary D mapping be-&amp;quot; tween words and their phonemic representation, a model of phone realization (I), and a weighted lattice representing the acoustic observations A.</Paragraph>
    <Paragraph position="17"> Thus: BestPath(G o D* o C/ o A) (1) The transducer C/ fo= ~e~ RuleT can be constructed out of the r of 40 trees, one for each phoneme, trained on the TIMIT database.</Paragraph>
    <Paragraph position="18"> The size of the trees range from 1 to 23 leaf nodes, with a totM of 291 leaves for the entire forest.</Paragraph>
    <Paragraph position="19"> The model was tested on 300 sentences from the RM task containing 2560 word tokens, and approximately 10,500 phonemes. A version of the model of recognition given in expression (1), where q~ is a transducer computed from the trees, was compared with a version where the trees were used directly following a method described in (Ljolje and Riley, 1992). The phonetic realizations and their weights were identical for both methods, thus verifying the correctness of the compilation algorithm described here.</Paragraph>
    <Paragraph position="20"> The sizes of the compiled transducers can be quite large; in fact they were sufficiently large that instead of constructing C/b beforehand, we intersected the 40 individual transducers with the lattice D* at runtime. Table 4 gives sizes for the entire set of phone trees: tree sizes are listed in terms of number of rules (terminal nodes) and raw size in bytes; transducer sizes are listed in terms of number of states and arcs. Note that the entire alphabet comprises 215 symbol pairs. Also given in Table 4 are the compilation times for the individual trees on a Silicon Graphics R4400 machine running at 150 MHz with 1024 Mbytes of memory.</Paragraph>
    <Paragraph position="21"> The times are somewhat slow for the larger trees, but still acceptable for off-line compilation.</Paragraph>
    <Paragraph position="22"> While the sizes of the resulting transducers seem at first glance to be unfavorable, it is important to bear in mind that size is not the only consideration in deciding upon a particular representation. WFSTs possess several nice properties that are not shared by trees, or handwritten rule-sets for that matter. In particular, once compiled into a WFST, a tree can be used in the same way as a WFST derived from any other source, such as a lexicon or a language model; a compiled WFST can be used directly in a speech recognition model such as that of (Pereira and Riley, 1996) or in a speech synthesis text-analysis model such as that of (Sproat, 1996). Use of a tree directly requires a special-purpose interpreter, which is much less flexible.</Paragraph>
    <Paragraph position="23"> It should also be borne in mind that the size explosion evident in Table 4 also characterizes rules that are compiled from hand-built rewrite rules (Kaplan and Kay, 1994; Mohri and Sproat, 1996). For example, the text-analysis ruleset for  the Bell Labs German text-to-speech (TTS) system (see (Sproat, 1996; Mohri and Sproat, 1996)) contains sets of rules for the pronunciation of various orthographic symbols. The ruleset for &lt;a&gt;, for example, contains 25 ordered rewrite rules.</Paragraph>
    <Paragraph position="24"> Over an alphabet of 194 symbols, this compiles, using the algorithm of (Mohri and Sproat, 1996), into a transducer containing 213,408 arcs and 1,927 states. This is 72% as many arcs and 48% as many states as the transducer for/ah/in Table 4. The size explosion is not quite as great here, but the resulting transducer is still large compared to the original rule file, which only requires 1428 bytes of storage. Again, the advantages of representing the rules as a transducer outweigh the problems of size. 5</Paragraph>
  </Section>
  <Section position="6" start_page="220" end_page="220" type="metho">
    <SectionTitle>
6 Future Applications
</SectionTitle>
    <Paragraph position="0"> We have presented a practical algorithm for converting decision trees inferred from data into weighted finite-state transducers that directly implement the models implicit in the trees, and we have empirically verified that the algorithm is correct. null Several interesting areas of application come to mind. In addition to speech recognition, where we hope to apply the phonetic realization models described above to the much larger North American Business task (Paul and Baker, 1992), there are also applications to TTS where, for example, the decision trees for prosodic phrase-boundary prediction discussed in (Wang and Hirschberg, 1992) can be compiled into transducers and used directly in the WFST-based model of text analysis used in the multi-lingual version of the Bell Laboratories TTS system, described in (Sproat, 1995; Sproat, 1996).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML