File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1307_metho.xml

Size: 17,929 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1307">
  <Title>Learning Finite-State Models for Language Understanding*</Title>
  <Section position="3" start_page="69" end_page="71" type="metho">
    <SectionTitle>
2 Subsequential Transduction
</SectionTitle>
    <Paragraph position="0"> The following definitions follow closely those given in Berstel \[4\], with some small variations for the sake of brevity. A Finite State Transducer (FST) is a six tuple r = (Q, X, Y, q0, QF, E), where Q is a finite set of states, X, Y are input and output alphabets, qo E Q is an initial state, QF c Q is a set of final states and E C Q x x* x Y* x Q are the edges or transitions. The output associated by v to an input string, z, is obtained by concatenating the output strings of the edges of r that are used to parse the successive symbols of z.</Paragraph>
    <Paragraph position="1"> One problem of using Finite State Transducers in our framework is that the problem of learning of general Finite State Transducers is at least as hard as the problem of learning a general Finite State Automaton, which is well known to be probably intractable. So we need a less general type of transducers. A Sequential Transducer (ST) is a five tuple ~&amp;quot; = (Q, X, Y, qo, E), where E C Q x X x Y* x Q and all the states are accepting (QF = Q) and determini.qtic; i.e., (q,a,u,r), (q,a,v, s) e E =~ (u = v ^ r = s). An important restriction of STs is that they preserve increasing length input-output prefixes; i.e., if t is a sequential transduction', then t(X) = A, t(uv) e t(u)Y*, where ~ is the empty or Nil string.</Paragraph>
    <Paragraph position="2"> While the use of sequential translation models has proved useful for LU in a number of rather simpletasks \[21, !9, 20, 26\], the limitations of this approach dearly show up as the conceptual complexity of the task increases. The main concern is that the required sequentiality assumption often prevents the use of &amp;quot;semantic languages&amp;quot; that are expressive enough.to correctly cover the underlying semantic space and/or to actually introduce the required semantic constraints. As we will see below, input-output sequentiality requirements can be significantly relaxed through the use of Subsequential Transduction. This would allow us to use more powerful semantic languages that need only be subsequential with the input.</Paragraph>
    <Paragraph position="3"> A Subsequential Transducer (SST) is defined to be a six-tuple r = (Q,X,Y, qo;E,a), where v' = (Q,X,Y, qo,E) is a Sequential Transducer and a : Q ~ Y* is a partial state output function \[4\]. An output string of r is obtained by concatenating a(q) to the usual sequential output string, r'(x), where q is the last state reached with the input x. Examples of SSTs are shown in Fig.1.</Paragraph>
    <Paragraph position="4">  Two SSTs are equivalent if they perform the same input-output mapping. Among equivalent SSTs there always exists one that is canonical. This transducer always adopts an &amp;quot;onward&amp;quot; form, in which the output substrings are assigned to the edges in such a way that they are as &amp;quot;close&amp;quot; to the initial state as they can be (see Oncina et al., 1993 \[15\], Reutenauer, 1990 \[22\]; for a recent reelaboration of these concepts see Mohri, 1997 \[13\]). On the other hand, any finite (training) set of input-output pairs of strings can be properly represented as a Tree Subsequential Transducer (TST), which can then be easily converted into a corresponding Onward Tree 8ubsequential Transducer (OTST). Fig.1 (left and center) illustrates these concepts (and construction), which are the basis of the so-called Onward Snbsequential Transducer Inference Algorithm (OSTIA), by Oncina \[14, 15\].</Paragraph>
    <Paragraph position="5"> Given an input-output training sample T, the OSTI Algorithm works by merging states in the OTST(T) as follows \[15\]: All pairs of states of OTST(T) are orderly considered level by level, starting at the root, and, for each of these pairs, the states are tentatively merged. If this results in a non-deterministic state, then an attempt is made to restore determinism by reeursively pushing-back some output substrings towards the leaves of the transducer (i.e., partially undoing the onward construction), while performing the necessary additional state merge operations. If the resulting transducer is subsequential, then (all) the merging(s) is (are) accepted; otherwise, a next pair of states is considered in the previous transducer. A transducer produced by this procedure from the OTST of Fig.1 (center) is shown in Fig.1 (right). Note that this resulting transducer is consistent with all the training pairs in T and makes a suitable generallization thereof.</Paragraph>
    <Paragraph position="6"> All these operations can be very eiticiently implemented, yielding an extremely fast algorithm that can easily handle huge sets of training data. It has formally been shown that OSTIA always converges to any target subeequential transduction for a sufficiently large number of training pairs of this transduction \[15\].</Paragraph>
    <Paragraph position="7"> ~k/a Figure 1. Learning a Subsequential Transducer from the input-output sample T={(A,b), (B,ab), (AA,ba), (AB,bb), (BB,aab)). Left: Tree Subsequential Transducer TST(T); Center: Onward Tree Subsequential Transducer OTST(T); Right: transducer yield by OSTIA. Each state contains the output string that the function ~, associates to this state.</Paragraph>
    <Paragraph position="8"> The learning strategy followed by OSTIA tries to generalize the training pairs as much as possible. This often leads to very compact transducers that accurately translate correct input text. However, this compactness often entails excessive over-generalization of the input and output languages, allowing nearly meaningless input sentences to be accepted, and translated into even more meaningless output! While this is not actuaily a problem for perfectly correct tezt input, it leads to dramatic failures when dealing with not exactly correct text or (even &amp;quot;correct&amp;quot;) speech input.</Paragraph>
    <Paragraph position="9"> A possible Way to overcome this problem is to limit generalization by imposing adequate Language Model (LM) constraints: the learned SSTs should not accept input sentences or produce output sentences which are not consistent with given LMs of the input and output  languages. These LMs are also known as Domain and Range models \[17\]. Learning with Domain and/er Range constraints can be carried out with a version of OSTIA called OSTIA-DR \[16, 17\]. This version was used in the work presented in this paper.</Paragraph>
    <Paragraph position="10"> Subsequential Transducers and the OSTI (or OSTI-DR) Algorithm have been very successfully applied to learning several quite contrived (artificial) translation tasks \[15\]. Also, it has recently been applied to Language Translation \[25, 9, 1\] and Language Understanding, as will be discussed here below. Among many possibilities for (finite-state) modeling the input and output languages, here we have adopted the well-known bigrama \[8\], which can be easily learned from the same (input and output) training sentences used for OSTIA-DR.</Paragraph>
  </Section>
  <Section position="4" start_page="71" end_page="72" type="metho">
    <SectionTitle>
3 Reducing the demand for training data
</SectionTitle>
    <Paragraph position="0"> The amount of training data required by OSTIA(-DR)-learning is directly related with the size of the vocabularies and the amount of input-output asynchrony of the translation task considered.</Paragraph>
    <Paragraph position="1"> This is due to the need of &amp;quot;delaying&amp;quot; the output until enough input has been seen. In the worst case, the number of states required by a SST to achieve this delaying mechanism can grow as much as O(nk), where n is the number of (functionally equivalent) words and k the length of the delay.</Paragraph>
    <Paragraph position="2"> Techniques to reduce the impact of k were studied in \[29\]. The proposed methods rely on reorderin 9 the words of the (training) output sentences on the base of partial alignments obtained by statistical translation methods \[5\]. Obviously, adequate mechanisms are provided to recover the correct word order for the translation of new test input sentences \[29\].</Paragraph>
    <Section position="1" start_page="71" end_page="72" type="sub_section">
      <SectionTitle>
3.1 Using word/phrase Categorization
</SectionTitle>
      <Paragraph position="0"> On the other hand, techniques to cut down the impact of vocabulary size were studied in \[28\].</Paragraph>
      <Paragraph position="1"> The basic idea was to substitute words or groups of words by labels representing their syntactic (or semantic) category within a limited rank of options. Learning was thus carried out with the categorized sentences, which involved a (much) smaller effective vocabulary. The steps followed for introducing categories in the learning and transducing processes began with category identification and categorization of the corpus. Once the categorized corpus was available, it was used for training a model: the base transducer. Also, for each category, a simple transducer was built: its category transducer. Finally, category expansion was needed for obtaining the final sentence-transducer: the arcs in the base transducer corresponding to the different categories were expanded using their category transducers.</Paragraph>
      <Paragraph position="2"> Note that, while all the transducers learned by OSTIA-DR are subsequential and therefore deterministic, this embedding of categories generally results in final transducers that are no longer subsequential and often they can be ambiguous. Consequently, translation can not be performed through deterministic parsing and Viterhi-like Dynamic Programming is required.</Paragraph>
      <Paragraph position="3"> Obviously, categorization has to be done for input/output paired clusters; therefore adequate techniques are needed to represent the actual identity of input and output words in the clusters and to recover this identity when parsing test input sentences. This recovering is made by keeping referencies between category labels and then solving them with a postprocess filter.</Paragraph>
      <Paragraph position="4"> This method is explained in detail in \[1\]. Text-input experiments using these techniques were presented in \[28\]. While the direct approach degrades rapidly with increasing vocabulary sizes, categorization keeps the accuracy essentially unchanged.</Paragraph>
    </Section>
    <Section position="2" start_page="72" end_page="72" type="sub_section">
      <SectionTitle>
3.2 Coping with undertrainlng through Error Correcting
</SectionTitle>
      <Paragraph position="0"> The performance achieved by a SST model (and for many other types of models whatsoever) tends to be poor if the input sentences do not strictly comply with the syntactic restrictions imposed by the model. This is tile case of syntactically incorrect sentences, or correct sentences whose precise &amp;quot;structure&amp;quot; has not been exactly captured because it was not present in the training data.</Paragraph>
      <Paragraph position="1"> Both Of these problems can be approached by me~n.~ of Error-Correcting Decoding (ECD) \[3, 29\]. Under this approach, the input sentence, x, is considered as a corrupted version of some sentence, ~ E L, where L is the domain or input language of the SST. The corruption process is modeled by means of an Error Model that accounts for insertion, stibstitution and deletion &amp;quot;edit errors&amp;quot;. In practice, these &amp;quot;errors&amp;quot; should account for likely vocabulary variations, word disappearances, superfluous words, repetitions, and so on. Recognition can then be seen as an ECD process: given x, find a sentence ~ in L such that the distance form ~ to x, measured in terms of edit operations (insertions, deletions and substitutions) is minimum 2.</Paragraph>
      <Paragraph position="2"> Given the finite-state nature of SST Models, Error Models can be tightly integrated, and combined error-correcting decoding and translation can be performed very efficiently using fast ECD beam-search, Viterbi-based techniques such as those proposed in \[3\].</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="72" end_page="75" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The chosen task in our experiments was the translation from Spanish sentences specifying times and dates into sentences of a formal semantic language. This is in fact an important subtask that is common to many real-world LU applications of much interest to industry and society. Examples of this kind of applications are flight, train or hotel reservations, appointment schedules, etc. \[7,11, 12\]. Therefore, having an adequate solution to this subtask can significantly simplify the building of successful systems for these applications (another work on this subtask can be found in \[6\]).</Paragraph>
    <Paragraph position="1"> The chosen formal language has been the one used in UNIX&amp;quot; command &amp;quot;at&amp;quot;. This simple language allows both absolute and relative descriptions of time. From these descriptions, the &amp;quot;at&amp;quot; interpreter can be directly used to obtain date/time interpretations in the desired format. The correct syntax of &amp;quot;at&amp;quot; commands is described in the standard Unix documentation (see, e.g. \[30\]). Fig. 2 shows some training pairs that have been selected from the training material.</Paragraph>
    <Paragraph position="2"> Starting from the given context-free-style syntax description of the &amp;quot;at&amp;quot; command \[30\], and knowledge-based patterns of typical ways of expressing dates and times in natural, spontaneous Spanish, a large corpus of pairs of &amp;quot;natural-language&amp;quot;/at-language sentences has been artificially constructed. This is intended to be the first step in a bootstrapping development. On-going work on this task is aimed at (semi-automatically) obtaining additional corpora produced by native speakers. The corpus generation procedure incorporated certain &amp;quot;category labels&amp;quot;, such as hour, month, day of week, etc. We have used a similar process for defining and generating subcorpora in which every input and its corresponding semantic coding belong to the different categories. We finally have obtained an uncategorized version of the categorized corpus, by means of randomly instantiating the category marks in the samples. The examples found on figure 2 come from this uncategorized corpus, while figure 3 shows the corresponding categorized pairs.</Paragraph>
    <Paragraph position="3"> 2 Note that while only simple deterministic ECD is considered in this paper, ECD can be easily formulated in a more powerful, 8tochaatic manner \[2\].</Paragraph>
    <Paragraph position="4">  We have generated a training corpus of 48353 different, uncategorized translation pairs, and a disjoint test set with 1331 translation pairs. We have presented the OSTIA-DR with 8 training subsets of sizes increasing from 1817 up to 48353. We also have presented OSTIA-DR with the same, but categorized, training subsets. In this case, the number of different pairs went from 1384 up to 12381. Figure 4 shows the size of categorized corpora vs. uncategorized corpora. The input language vocabulary has 108 words, and the output language has 125 semantic symbols.</Paragraph>
    <Paragraph position="5"> We have used 11 different category labels.</Paragraph>
    <Paragraph position="6"> In the categorized experiments, a sentence-transducer was inferred from the categorized sentences, and a (small) category-transducer for each one of the categories. The final transducer, which is able to translate noncategorized sentences, was build up by the embedding of the category-transducers into the sentence-transducers. The output yielded by this final transducer includes category labels and their corresponding instances, as found in the translation process. The definitive translations of the test set inputs are obtained by means of a simple filter that resolves the dependencies. The sizes of the inferred transducers are shown on figure 5.</Paragraph>
    <Paragraph position="7"> Performance has been measured in terms of both semantic-symbol error and fUll-sentence matching rates. The translation of the test set inputs has been computed using both the standard Viterbi algorithm and the Error Correction techniques, outlined on sections 3.1 and 3.2. The results are shown in figure 6.</Paragraph>
    <Paragraph position="8"> A big difference in performance between the uncategorized and categorize d training procedures can be observed. Semantic-symbol error rates are much lower in the categorized experiments than in the uncategorized ones. We can also appreciate a remarkable decrease in semantic-symbol error rates of Error Correcting with respect to Viterbi translations, specially for smaller training corpus. The full-sentence matching rate also exhibited a strong improve- null transducer containing category labels, while &amp;quot;cats&amp;quot; stands for the final sentence-transducer which is calculated by embedding the (small) category-transducers into the &amp;quot;base&amp;quot; one; &amp;quot;plain&amp;quot; stands for the uncategorized sentence-transducer.</Paragraph>
    <Paragraph position="9">  and '~on-cats&amp;quot; for the non-categorized ones. Transductions in =EC&amp;quot; have been computed using Error Correcting techniques, and in 'Wit&amp;quot; using the standard Viterbi algorithm.</Paragraph>
    <Paragraph position="10"> ment by using categorization: while uncategorized training only achieves 30%-40% matching rate, the categorized one yields up to 98%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML