File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0407_metho.xml

Size: 16,994 bytes

Last Modified: 2025-10-06 14:14:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0407">
  <Title>Using Categories in the EUTRANS System</Title>
  <Section position="4" start_page="44" end_page="46" type="metho">
    <SectionTitle>
3 Introducing Word Categories in
</SectionTitle>
    <Paragraph position="0"> the Learning and Translation  An approach for using categories together with SSTs was presented in (Vilar et al., 1995), proving it to be useful in reducing the number of examples required for learning. However, the approach presented there was not easily integrable in a speech recognition system and did not provide for the case in which the categories included units larger than a word.</Paragraph>
    <Paragraph position="1"> For the EUTRANS project, the approach was changed so that a single USST would comprise all the information for the translation, including elementary transducers for the categories. These steps were followed: * CATEGORY IDENTIFICATION. The categories used in EUTRANS were seven: masculine names, femenine names, surnames, dates, hours, room numbers, and general numbers.</Paragraph>
    <Paragraph position="2"> The election of these categories was done while keeping with the example based nature of the project. In particular, the categories chosen do not need very specific rules for recognising them, the translation rules they follow are quite simple, and the amount of special linguistic knowledge introduced was  very low.</Paragraph>
    <Paragraph position="3"> * CoRPus CATEGORIZATION. Once the cate- null gories were defined, simple scripts substituted the words in the categories by adequate labels, so that the pair (ddme la Have de la habitaci6n ciento veintitrds - give me the key to room one two three) became (dime Is Uave de la habitaci6n $ROOM - give me the key to room SROOM), where $ROOM is the category label for room numbers.</Paragraph>
    <Paragraph position="4"> * INITIAL MODEL LEARNING. The categorised corpus was used for training a model, the ini- null tial SST.</Paragraph>
    <Paragraph position="5"> * CATEGORY MODELLING. For each category, a simple SST was built: its category SST (cSST).</Paragraph>
    <Paragraph position="6"> * CATEGORY EXPANSION. The arcs in the ini- null tial SST corresponding to the different categories were expanded using their cSSTs. A general view of the process can be seen in Figure 1. The left part represents the elements involved in the learning of the expanded USST, exemplified with a single training pair. The right part of the diagram gives a schematic representation of the use of this transducer. The category expansion step is a bit more complex than just substituting each category-labeled arc by the corresponding cSST. The main problems are: (I) how to insert the output of the cSST within the output of the initial transducer; (2) how to deal with more than one final state in the cSST; (3) how to deal with cycles in the cSST involving its initial state.</Paragraph>
    <Paragraph position="7"> The problem with the output had certain subtelities, since the translation of a category label  can appear before or after the label has been seen in the input. For example, consider the transducer in Figure2(a) and a Spanish sentence categorised as me voy a $HOUR, which corresponds to the categorised English one I am leaving at $HOUR. Once me roy a is seen, the continuation can only be $HOUR, so the initial SST, before seeing this category label in the input, has already produced the whole output (including $HOUR). Taking this into account, we decided to keep the output of the initial SST and to include there the information necessary for removing the category labels. To do this, the label for the category was considered as a variable that acts as a placeholder in the output sentence and whose contents are also fixed by an assignment appearing elsewhere within that sentence. In our example, the expected output for me roy alas tres y media could be I am leaving at $HOUR $HOUR = \[half past three\]. This assumes that each category appears at most once within each sentence.</Paragraph>
    <Paragraph position="8"> The expanded model is obtained by an iterative procedure which starts with the initial SST. Each time the procedure finds an arc whose input symbol is a category label, it expands this arc by the adequate cSST producing a new model.</Paragraph>
    <Paragraph position="9"> This expansion can introduce non-determinism, so these new models are now USSTs. When every arc of this kind has been expanded, we have the ex- null panded USST. The expansion of each arc follows these steps: * Eliminate the arc.</Paragraph>
    <Paragraph position="10"> * Create a copy of the cSST corresponding to the category label.</Paragraph>
    <Paragraph position="11"> * Add new arcs linking the new cSST with the USST. These arcs have to ensure that the output produced in the cSST is embraced between c=\[ and \], c being the category label. * Eliminate useless states.</Paragraph>
    <Paragraph position="12">  Formally, we have an USST 7&amp;quot; = (X,Y,Q, qo, E,a), a cSST r~ = (X,Y, Qc, qoe, E~,ac), where we assume that ac(qoc = 0, and an arc</Paragraph>
    <Paragraph position="14"> The new elements are: * The set Q~ is disjoint with Q and there exists a bijection C/ : Qc ~ Q~.</Paragraph>
    <Paragraph position="15"> * The new set of arcs is:</Paragraph>
    <Paragraph position="17"> Note that this solves the problems deriving from the cSST having multiple final states or cycles involving the initial state. The price to pay is the introduction of non-determinism in the model.</Paragraph>
    <Paragraph position="18"> * The new state emission function is: { a(s) ifsEQ Finally, the useless states that may appear during this construction are removed.</Paragraph>
    <Paragraph position="19"> A simple example of the effects of this procedure can be seen on Figure 2. The drawing (a) depicts the initial SST, (b) is a cSST for the hours between one and three (in o'clock and half past forms), and the expanded USST is in (c).</Paragraph>
  </Section>
  <Section position="5" start_page="46" end_page="47" type="metho">
    <SectionTitle>
4 Overview of the Speech
Tr-an.~lation System
</SectionTitle>
    <Paragraph position="0"> A possible scheme for speech translation consists in translating the output of a conventional Continuous Speech Recognition (CSR) front-end. This implies that some restrictions present in the translation and the output language, which could enhance the acoustic search, are not taken into account. In this sense, it is preferable to integrate the translation model within a conventional CSR system to carry out a simultaneous search for the recognised sentence and its corresponding translation. This integration can be done by using a SST as language and translation model, since it has included in the learning process the restrictions introduced by the translation and the output language. Experimental results show that better performance is achieved (Jimdnez et al., 1994; Jim/mez et al., 1995).</Paragraph>
    <Paragraph position="1"> Thus, our system can be seen as the result of integrating a series of finite state models at different levels:  * ACOUSTIC LEVEL. Individual phones are represented by means of Hidden Markov Models (HMMs).</Paragraph>
    <Paragraph position="2"> * LEXICAL LEVEL. Individual words are represented by means of finite state automata with  arcs labeled by phones.</Paragraph>
  </Section>
  <Section position="6" start_page="47" end_page="47" type="metho">
    <SectionTitle>
* SYNTACTIC AND TRANSLATION LEVEL. The
</SectionTitle>
    <Paragraph position="0"> syntactic constrains and translation rules are represented by an USST.</Paragraph>
    <Paragraph position="1"> In our case, the integration means the substitution of the arcs of the USST by the automata describing the input language words, followed by the substitution of the arcs in this expanded automata by the corresponding HMMs. In this way, a conventional Viterbi search (Fomey, 1973) for the most likely path in the resulting network, given the input acoustic observations, can be performed, and both the recognised sentence and its translation are found by following the optimal path.</Paragraph>
  </Section>
  <Section position="7" start_page="47" end_page="49" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="47" end_page="48" type="sub_section">
      <SectionTitle>
5.1 The Traveler Task
</SectionTitle>
      <Paragraph position="0"> The Traveler Task (Amengual et al., 1996b) was defined within the EUTRANS project (Amengual et al., 1996a). It is more realistic that the one in (Castellanos et al., 1994), but, unlike other corpora such as the Hansards (Brown et al., 1990), it is not unrestricted.</Paragraph>
      <Paragraph position="1"> The general framework established for the Traveler Task aims at covering usual sentences that can be needed in typical scenarios by a traveler visiting a foreign country whose language he/she does not speak. This framework includes a great variety of different translation scenarios, and thus results appropriate for progressive experimentation with increasing level of complexity. In a first phase, the scenario has been limited to some human-to-human communication situations in the reception of a hotel:  * Asking for rooms, wake-up calls, keys, the bill, a taxi and moving the luggage.</Paragraph>
      <Paragraph position="2"> * Asking about rooms (availability, features, price).</Paragraph>
      <Paragraph position="3"> * Having a look at rooms, complaining about and changing them.</Paragraph>
      <Paragraph position="4"> * Notifying a previous reservation.</Paragraph>
      <Paragraph position="5"> * Signing the registration form.</Paragraph>
      <Paragraph position="6"> * Asking and complaining about the bill.</Paragraph>
      <Paragraph position="7"> * Notifying the departure.</Paragraph>
      <Paragraph position="8"> * Other common expressions.</Paragraph>
      <Paragraph position="9">  The Traveler Task text corpora are sets of pairs, each pair consisting in a sentence in the input language and its corresponding translation in the output language. They were automatically built by using a set of Stochastic, Syntax-directed Translation Schemata (Gonzalez and Thomason, 1978) with the help of a data generation tool, specially developed for the EUTRANS project. This software allows the use of several syntactic extensions  Spanish: Pot favor, ~quieren pedirnos un taxi para la habitacidn trescientos diez? English: &amp;quot; Will you ask for a taxi for room number three one oh for us, please? Spanish: DeseaHa reservar una habitaciSn tranquiIa con teldfono y teIevisidn hasta pasado mal~ana.</Paragraph>
      <Paragraph position="10"> German: Ich mSchte ein ruhiges Zimmer mit TeIefon und Fernseher his iibermorgen reservieren. Spanish: zMe pueden dar las llaves de la habitaciSn, por favor? Italian: Mi potreste dare le chiavi della stanza, per favore?  to these schema specifications in order to express optional rules, permutation of phrases, concordance (of gender, number and case), etc. The use of automatic corpora generation was convenient due to time constrains of the first phase of the EUTRANS project, and cost-effectiveness. Moreover, the complexity of the task can be controlled. The languages considered were Spanish as input and English, German and Italian as output, giving a total of three independent corpora of 500,000 pairs each. Some examples of sentence pairs are shown in Table I. Some features of the corpora can be seen in Table 2. For each language, the test set perplexity has been computed by training a trigram model (with simple fiat smoothing) using a set of 20,000 random sentences and computing the probabilities yielded by this model for a set of i0,000 independent random sentences. The lower perplexity of the output languages derives from a design decision: multiple variants of the input sentences were introduced to account for different ways of expressing the same idea, but they were given the same translation.</Paragraph>
      <Paragraph position="11"> Finally, a multispeaker speech corpus for the task was acquired. It consists of 2,000 utterances in Spanish. Details can be found in (Amengual et al., 1997a).</Paragraph>
    </Section>
    <Section position="2" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
5.2 Text Input Experiments
</SectionTitle>
      <Paragraph position="0"> Our approach was tested with the three text corpora. Each one was divided in training and test sets, with 490,000 and 10,000 pairs, respectively.</Paragraph>
      <Paragraph position="1"> A sequence of models was trained with increasing subsets of the training set. Each model was tested using only those sentences in the test set that were not seen in training. This has been done because a model trained with OSTIA-DR is guaranteed to reproduce exactly those sentences it has seen during learning. The performance was evaluated in terms of Word Error Rate (WER), which is the percentage of output words that has to be inserted, deleted and substituted for they to exactly match the corresponding expected translations.</Paragraph>
      <Paragraph position="2"> The results for the three corpora can be seen on Table 3. The columns labeled as &amp;quot;Different&amp;quot; and &amp;quot;Categ.&amp;quot;, refer to the number of different sentences in the training set and the number of different sentences after categorization. Graphical representations of the same results are on Figures 3, 4 and 5. As expected, the use of lexical categories had a major impact on the learning algorithm.</Paragraph>
      <Paragraph position="3"> The differences in WER attributable to the use of lexical categories can be as high as about a 40% in the early stages of the learning process and decrease when the number of examples grows. The large increase in performance is a natural consequence of the fact that the categories help in reducing the total variability that can be found in the corpora (although sentences do exhibit a great deal of variability, the underlying syntactic structure is actually much less diverse). They also have the advantage of allowing an easier extension in the vocabulary of the task without having a negative effect on the performance of the models so obtained (Vilar et al., 1995).</Paragraph>
    </Section>
    <Section position="3" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
5.3 Speech Input Experiments
</SectionTitle>
      <Paragraph position="0"> A set of Spanish to English speaker independent translation experiments were performed integrating in our speech input system (as described in  ror rates (WER) and real time factor (RTF) for the best Spanish to English transducer.</Paragraph>
      <Paragraph position="1">  section 4) the following models: * ACOUSTIC LEVEL. The phones were represented by context-independent continuousdensity HMMs. Each HMM consisted of six states following a left-to-right topology with loops and skips. The emission distribution of each state was modeled by a mixture of Gaussians. Actually, there were only three emi~. sion distributions per HMM since the states were tied in pairs (the first with the second, the third with the fourth, and the fifth with the sixth). Details about the corpus used in training these models and its parametrization can be found in (Amengnal et al., 1997a).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="49" end_page="49" type="metho">
    <SectionTitle>
* LEXICAL LEVEL Spanish Phonetics allows
</SectionTitle>
    <Paragraph position="0"> the representation of each word as a sequence of phones that can be derived from standard rules. This sequence can be represented by a simple chain. There were a total of 31 phones, including stressed and unstressed vowels plus two types of silence.</Paragraph>
  </Section>
  <Section position="9" start_page="49" end_page="49" type="metho">
    <SectionTitle>
SYNTACTIC AND TRANLATION LEVEL. We
</SectionTitle>
    <Paragraph position="0"> used the best of the transducers obtained in the Spanish to English text experiments. It was enriched with probabilities estimated by parsing the same training data with the final model and using relative frequencies of use as probability estimates.</Paragraph>
    <Paragraph position="1"> The Viterbi search for the most likely path was speeded up by using beam search at two levels: independent beam widths were used in the states of the SST (empirically fixed to 300) and in the states of the HMMs. Other details of the experiments can be found in (Amengnal et al., 1997a). Table 4 shows that good translation results (a WER of 6.4%) can be achieved with a Real Time Factor (RTF) of just 2.2. It is worth noting that these results were obtained in a HP-9735 workstation without resorting to any type of specialised hardware or signal processing device. When translation accuracy is the main concern, a more detailed acoustic model and a wider beam in the search can be used to achieve a WER of 1.9%, but with a RTF of 11.3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML