File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1404_metho.xml

Size: 8,665 bytes

Last Modified: 2025-10-06 14:07:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1404">
  <Title>Approximating Context-Free by Rational Transduction for Example-Based MT</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Reordering as postprocessing
</SectionTitle>
    <Paragraph position="0"> In the following section we will discuss an algorithm that was devised for context-free grammars.</Paragraph>
    <Paragraph position="1"> To make it applicable to transduction, we propose a way to represent bilexical transduction grammars as ordinary context-free grammars. In the new productions, symbols from the source and target alphabets occur side by side, but whereas source symbols are matched by the parser to the input, the target symbols are gathered into output strings. In our case, the unique output string the parser eventually produces from an input string is obtained from the most likely derivation that matches that input string.</Paragraph>
    <Paragraph position="2"> 2That bilexical transduction grammars are less powerful than arbitrary context-free transduction grammars can be shown formally; cf. Section 3.2.3 of (Aho and Ullman, 1972).</Paragraph>
    <Paragraph position="3"> That the nonterminals in both halves of a RHS in the transduction grammar may occur in a different order is solved by introducing three special symbols, the reorder operators, which are interpreted after the parsing phase. These three operators will be written as &amp;quot;a48&amp;quot;, &amp;quot;a25&amp;quot; and &amp;quot;a49 &amp;quot;. In a given string, there should be matching triples of these operators, in such a way that if there are two such triples, then they either occur in two isolated substrings, or one occurs nested between the &amp;quot;a48&amp;quot; and the &amp;quot;a25&amp;quot; or nested between the &amp;quot;a25&amp;quot; and the &amp;quot;a49 &amp;quot; of the other triple. The interpretation of an occurrence of a triple, say in an output string a3a6a5a17a48a3a11a10 a25a3a11a154 a49 a3a11a155 , is that the two enclosed substrings should be reordered, so that we obtain a3a6a5a156a3a11a154a60a3a11a10a8a3a11a155 . Both the reorder operators and the symbols of the target alphabet will here be marked by a horizontal line to distinguish them from the source alphabet. For example, the two productions</Paragraph>
    <Paragraph position="5"> a71 a97 a49a157a107 a71 a71 a97 In the first production, the RHS nonterminals occur in the same order as in the left half of the original production, but reorder operators have been added to indicate that, after parsing, some sub-strings of the output string are to be reordered.</Paragraph>
    <Paragraph position="6"> Our reorder operators are similar to the two operators a148 and a158 from (Vilar and others, 1999), but the former are more powerful, since the latter allow only single words to be moved instead of whole phrases.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Finite-state approximation
</SectionTitle>
    <Paragraph position="0"> There are several methods to approximate context-free grammars by regular languages (Nederhof, 2000). We will consider here only the so called RTN method, which is applied in a simplified form.3 3As opposed to (Nederhof, 2000), we assume here that all nonterminals are mutually recursive, and the grammar contains self-embedding. We have observed that typical grammars that we obtain in the context of this article indeed have the property that almost all nonterminals belong to the same mutually recursive set.</Paragraph>
    <Paragraph position="1"> A finite automaton is constructed as follows.</Paragraph>
    <Paragraph position="2"> For each nonterminal a106 from the grammar we introduce two states a82a160a159 and a82 a97a159 . For each production a106a161a107a163a162  the automaton is a82a167a166 and the only final state is a82 a97a166 , where a105 is the start symbol of the grammar.</Paragraph>
    <Paragraph position="3"> If a symbol a162 a78 in the RHS of a production is a terminal, then we add a transition from a82 a78a169a168 a5 to  a78 labelled by a162a170a78 . If a symbol a162a170a78 in the RHS is a nonterminal a121 , then we add epsilon transitions from a82 a78a171a168</Paragraph>
    <Paragraph position="5"> The resulting automaton is determinized and minimized to allow fast processing of input. Note that if we apply the approximation to the type of context-free grammar discussed in Section 3, the transitions include symbols from both source and target alphabets, but we treat both uniformly as input symbols for the purpose of determinizing and minimizing. This means that the driver for the finite automaton still encounters nondeterminism while processing an input string, since a state may have several outgoing transitions for different output symbols.</Paragraph>
    <Paragraph position="6"> Furthermore, we ignore any weights that might be attached to the context-free productions, since determinization is problematic for weighted automata in general and in particular for the type of automaton that we would obtain when carrying over the weights from the context-free grammar onto the approximating language following (Mohri and Nederhof, 2001).</Paragraph>
    <Paragraph position="7"> Instead, weights for the transitions of the finite automaton are obtained by training, using strings that are produced as a side effect of the computation of the grammar from the corpus.</Paragraph>
    <Paragraph position="8"> These strings contain the symbols from both the source and target strings mixed together, plus occurrences of the reorder operators where needed.</Paragraph>
    <Paragraph position="9"> A English/French example might be: a48 I me like pla^it a25 him il  The way these strings were obtained ensures that they are included in the language generated by the context-free grammar, and they are therefore also accepted by the approximating automaton due to properties of the RTN approximation. The weights are the negative log of the probabilities obtained by maximum likelihood estimation.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Robustness
</SectionTitle>
    <Paragraph position="0"> The approximating finite automaton cannot ensure that the reorder operators &amp;quot;a48&amp;quot;, &amp;quot;a25&amp;quot; and &amp;quot;a49 &amp;quot; occur in matching triples in output strings. There are two possible ways to deal with this problem.</Paragraph>
    <Paragraph position="1"> First, we could extend the driver of the finite automaton to only consider derivations in which the operators are matched. This is however counter to our need for very efficient processing, since we are not aware of any practical algorithms for finding matching brackets in paths in a graph of which the complexity is less than cubic.</Paragraph>
    <Paragraph position="2"> Therefore, we have chosen a second approach, viz. to make the postprocessing robust, by inserting missing occurrences of &amp;quot;a48&amp;quot; or &amp;quot;a49 &amp;quot; and removing redundant occurrences of brackets. This means that any string containing symbols from the target alphabet and occurrences of the reorder operators is turned into a string without reorder operators, with a change of word order where necessary. null Both the transduction grammar and, to a lesser extent, the approximating finite automaton suffer from not being able to handle all strings of symbols from the source alphabet. With finite-state processing however, it is rather easy to obtain robustness, by making the following three provisions: null 1. To the nondeterministic finite automaton we add one epsilon transition from the initial state to a82a160a159 , for each nonterminal a106 . This means that from the initial state we may recognize an arbitrary phrase generated by some nonterminal from the grammar.</Paragraph>
    <Paragraph position="3"> 2. After the training phase of the weighted (minimal deterministic) automaton, all transitions that have not been visited obtain a fixed high (but finite) weight. This means that such transitions are only applied if all others fail.</Paragraph>
    <Paragraph position="4"> 3. The driver of the automaton is changed so that it restarts at the initial state when it gets stuck at some input word, and when necessary, that input word is deleted. The output string with the lowest weight obtained so far (preferably attached to final states, or to other states with outgoing transitions labelled by input symbols) is then concatenated with the output string resulting from processing subsequent input.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML