XML Viewer - w97-0407

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0407_intro.xml
Size: 5,425 bytes
Last Modified: 2025-10-06 14:06:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0407">
  <Title>Using Categories in the EUTRANS System</Title>
  <Section position="3" start_page="0" end_page="44" type="intro">
    <SectionTitle>
2 Basic Concepts -rid Notation
</SectionTitle>
    <Paragraph position="0"> Given an alphabet X, X* is the free monoid of strings over X. The symbol A represents the empty string, first letters (a, b, c, ...) represent individual symbols of the alphabets and last letters (z, y, x, ...) represent strings of the free monoids. We refer to the individual elements of the strings by means of subindices, as in x = al...an. Given two strings x,y E X', xy denotes the concatenation of x and y.</Paragraph>
    <Section position="1" start_page="0" end_page="44" type="sub_section">
      <SectionTitle>
2.1 Subsequential Transducers
</SectionTitle>
      <Paragraph position="0"> A Subsequential Transducer (Berstel, 1979) is a deterministic finite state network that accepts sentences from a given input language and produces associated sentences of an output language. A SST is composed of states and arcs. Each arc connects two states and it is associated to an input symbol and an output substring (that may be empty). Translation of an input sentence is obtained starting from the initial state, following the path corresponding to its symbols through the network, and concatenating the corresponding output substrings.</Paragraph>
      <Paragraph position="1">  Formally, a SST is a tuple r = (X, Y, Q, q0, E, o-) where X and 1,&amp;quot; are the input and output alphabets, Q is a finite set of states, qo E Q is the initial state, E E Q x X x Y&amp;quot; x ~ is a set of arcs satisfying the determinism condition, and a : Q ~ Y&amp;quot; is a state emission function 2. Those states for which o&amp;quot; is defined are usually called final states. The determinism condition means that, if (p, a. y, q) and (p, a, y', q') belong to E, then y = y' andq=q'. Given astringx = al...an E X', a sequence (qo~al,yl,ql) .... , (qn-l,a,~,yn,q,~) is a valid path if (qi-1, ai, Yi, qi) belongs to E for every i in 1,..., n, and qn is a final state. In case there exists such a valid path for z, the translation of z by r is yl... yna(q~). Otherwise, the translation is undefined. Note that due to the condition of determinism, there can be no more than one valid path, and hence at most one translation, for a given input string. Therefore, r defines a function between an input language, Lt C_ X deg, and an output language, Lo C Y*. Both Lt and Lo are regular languages and their corresponding automata are easily obtainable from the SST. In particular, an automaton for Lt can be obtained by eliminating the output of the arcs and states, and considering the final state set of the automaton being the same as in the SST. A state is useless if it is not contained in any valid path. Useless states can be eliminated from a SST without changing the function it defines.</Paragraph>
      <Paragraph position="2"> In section 3, we will relax the model. Instead of imposing the determinism conditition, we will only enforce the existence of at most one valid path in the transducer for each input string (nonambiguity). We will call them Unambiguous SSTs (USSTs). Standard algorithms for finding the path corresponding to a string in an unambigous finite state automaton (see for instance (Hopcroft and UNman, 1979)) can be used for finding the translation in a USST. When the problem is the search for the best path in the expanded model during speech translation (see section 4), the use of the Viterbi algorithm (Forney, 1973) guarantees that the most likely path will be found.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
2.2 Inference of Subsequential
Transducers
</SectionTitle>
      <Paragraph position="0"> The use of SSTs to model limited domain translation tasks has the distinctive advantage of allowing an automatic and efficient learning of the translation models from sets of examples. An inference algorithm known as OSTIA (Onward Sub21n this paper, the term function refers to partial functions. We will use f(z) = @ to denote that the function .f is undefined for ~.</Paragraph>
      <Paragraph position="1"> sequential Transducer Inference Algorithm) allows the obtainment of a SST that correctly models the translation of a given task, if the training set is representative (in a formal sense) of the task (Oncina et al., 1993). Nevertheless, although the SSTs learned by OSTIA are usually good translation models, they are often poor input language models. In practice, they very accurately translate correct input sentences, but also accept and translate incorrect sentences producing meaningless results. This yields undesirable effects in case of noisy input, like the one obtained by OCR or speech recognition.</Paragraph>
      <Paragraph position="2"> To overcome this problem, the algorithm OSTIA-DR (Oncina and Var6, 1996) uses finite state domain (input language) and range (output language) models, which allow to learn SSTs that only accept input sentences and only produce output sentences compatible with those language models. OSTIA-DR can make use of any kind of finite state model. In particular, models can be n-testable automata, which are equivalent to n-grams (Vidal et al., 1995) and can be also automatically learned from examples.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML