XML Viewer - p96-1023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1023_metho.xml
Size: 39,105 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1023">
  <Title>INVITED TALK Head Automata and Bilingual Tiling: Translation with Minimal Representations</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Until the advent of statistical methods in the mainstream of natural language processing, syntactic and semantic representations were becoming progressively more complex. This trend is now reversing itself, in part because statistical methods reduce the burden of detailed modeling required by constraint-based grammars, and in part because statistical models for converting natural language into complex syntactic or semantic representations is not well understood at present. At the same time, lexically centered views of language have continued to increase in popularity. We can see this in lexicalized grammatical theories, head-driven parsing and generation, and statistical disambiguation based on lexical associations.</Paragraph>
    <Paragraph position="1"> These themes -- simple representations, statistical modeling, and lexicalism -- form the basis for the models and algorithms described in the bulk of this paper. The primary purpose is to build effective mechanisms for machine translation, the oldest and still the most commonplace application of nonsuperficial natural language processing. A secondary motivation is to test the extent to which a non-trivial language processing task can be carried out without complex semantic representations.</Paragraph>
    <Paragraph position="2"> In Section 2 we present reversible mono-lingual models consisting of collections of simple automata associated with the heads of phrases. These head automata are applied by an algorithm with admissible incremental pruning based on semantic association costs, providing a practical solution to the problem of combinatoric disambiguation (Church and Patil 1982). The model is intended to combine the lexical sensitivity of N-gram models (Jelinek et al.</Paragraph>
    <Paragraph position="3"> 1992) and the structural properties of statistical context free grammars (Booth 1969) without the computational overhead of statistical lexicalized tree-adjoining grammars (Schabes 1992, Resnik 1992).</Paragraph>
    <Paragraph position="4"> For translation, we use a model for mapping dependency graphs written by the source language head automata. This model is coded entirely as a bilingual lexicon, with associated cost parameters. The transfer algorithm described in Section 4 searches for the lowest cost 'tiling' of the target dependency graph with entries from the bilingual lexicon. Dynamic programming is again used to make exhaustive search tractable, avoiding the combinatoric explosion of shake-and-bake translation (Whitelock 1992, Brew 1992).</Paragraph>
    <Paragraph position="5"> In Section 5 we present a general framework for associating costs with the solutions of search processes, pointing out some benefits of cost functions other than log likelihood, including an error-minimization cost function for unsupervised training of the parameters in our translation application. Section 6 briefly describes an English-Chinese translator employing the models and algorithms. We also present experimental results comparing the performance of different cost assignment methods.</Paragraph>
    <Paragraph position="6"> Finally, we return to the more general discussion of representations for machine translation and other natural language processing tasks, arguing the case for simple representations close to natural language itself.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="168" type="metho">
    <SectionTitle>
2 Head Automata Language Models
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="167" type="sub_section">
      <SectionTitle>
2.1 Lexieal and Dependency Parameters
</SectionTitle>
      <Paragraph position="0"> Head automata mono-lingual language models consist of a lexicon, in which each entry is a pair (w, m) of a word w from a vocabulary V and a head automaton m (defined below), and a parameter table giving an assignment of costs to events in a generative process involving the automata.</Paragraph>
      <Paragraph position="1">  We first describe the model in terms of the familiar paradigm of a generative statistical model, presenting the parameters as conditional probabilities. This gives us a stochastic version of dependency grammar (Hudson 1984).</Paragraph>
      <Paragraph position="2"> Each derivation in the generative statistical model produces an ordered dependency tree, that is, a tree in which nodes dominate ordered sequences of left and right subtrees and in which the nodes have labels taken from the vocabulary V and the arcs have labels taken from a set R of relation symbols. When a node with label w immediately dominates a node with label w' via an arc with label r, we say that w' is an r-dependent of the head w. The interpretation of this directed arc is that relation r holds between particular instances of w and w'. (A word may have several or no r-dependents for a particular relation r.) A recursive left-parent-right traversal of the nodes of an ordered dependency tree for a derivation yields the word string for the derivation.</Paragraph>
      <Paragraph position="3"> A head automaton m of a lexical entry (w, m) defines possible ordered local trees immediately dominated by w in derivations. Model parameters for head automata, together with dependency parameters and lexical parameters, give a probability distribution for derivations.</Paragraph>
      <Paragraph position="4"> A dependency parameter P( L w'lw, r') is the probability, given a head w with a dependent arc with label r', that w' is the r'-dependent for this arc.</Paragraph>
      <Paragraph position="5"> A lexical parameter P(m, qlr, t, w) is the probability that a local tree immediately dominated by an r-dependent w is derived by starting in state q of some automaton m in a lexieal entry (w, m). The model also includes lexieal parameters P(w,m, qlt&gt;) for the probability that w is the head word for an entire derivation initiated from state q of automaton m.</Paragraph>
    </Section>
    <Section position="2" start_page="167" end_page="168" type="sub_section">
      <SectionTitle>
2.2 Head Automata
</SectionTitle>
      <Paragraph position="0"> A head automaton is a weighted finite state machine that writes (or accepts) a pair of sequences of relation symbols from R: ((rl... r,)).</Paragraph>
      <Paragraph position="1"> These correspond to the relations between a head word and the sequences of dependent phrases to its left and right (see Figure 1). The machine consists of a finite set q0, * * &amp;quot;, qs of states and an action table specifying the finite cost (non-zero probability) actions the automaton can undergo.</Paragraph>
      <Paragraph position="2"> There are three types of action for an automaton m: left transitions, right transitions, and stop actions. These actions, together with associated probabilistic model parameters, are as follows.</Paragraph>
      <Paragraph position="4"> * Left transition: if in state qi-1, m can write a symbol r onto the right end of the current left sequence and enter state qi with probability P(~, qi, rlqi-1, m).</Paragraph>
      <Paragraph position="5"> * Right transition: if in state qi-1, m can write a symbol r onto the left end of the current right sequence and enter state qi with probability P(--* , qi, rlqi-1, m).</Paragraph>
      <Paragraph position="6"> * Stop: if in state q, m can stop with probability P(t31q , m), at which point the sequences are considered complete.</Paragraph>
      <Paragraph position="7">  For a consistent probabilistic model, the probabilities of all transitions and stop actions from a state q must sum to unity. Any state of a head automaton can be an initial state, the probability of a particular initial state in a derivation being specified by lexical parameters. A derivation of a pair of symbol sequence thus corresponds to the selection of an initial state, a sequence of zero or more transitions (writing the symbols) and a stop action. The probability, given an initial state q, that automaton m will a generate a pair of sequences, i.e.</Paragraph>
      <Paragraph position="8"> P((rl'.. rk), (rk+l&amp;quot;'' rn)Ira, q) is the product of the probabilities of the actions taken to generate the sequences. The case of zero transitions will yield empty sequences, corresponding to a leaf node of the dependency tree. From a linguistic perspective, head automata allow for a compact, graded, notion of lexical subcategorization (Gazdar et al. 1985) and the linear order of a head and its dependent phrases. Lexical parameters can control the saturation of a lexical item (for example a verb that is both transitive and intransitive) by starting the same automaton in different states. Head automata can also be used to code a grammar in which states of an automaton for word w corresponds to X-bar levels (Jaekendoff 1977) for phrases headed by w.</Paragraph>
      <Paragraph position="9"> Head automata are formally more powerful than finite state automata that accept regular languages in the following sense. Each head automaton defines a formal language with alphabet R whose strings are the concatenation of the left and right sequence pairs  written by the automaton. The class of languages defined in this way clearly includes all regular languages, since strings of a regular language can be generated, for example, by a head automaton that only writes a left sequence. Head automata can also accept some non-regular languages requiring coordination of the left and right sequences, for example the language anb ~ (requiring two states), and the language of palindromes over a finite alphabet.</Paragraph>
    </Section>
    <Section position="3" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
2.3 Derivation Probability
</SectionTitle>
      <Paragraph position="0"> Let the probability of generating an ordered dependency subtree D headed by an r-dependent word w be P(D\]w, r). The recursive process of generating this subtree proceeds as follows:  1. Select an initial state q of an automaton m for w with lexical probability P(m, q\[r, ~, w).</Paragraph>
      <Paragraph position="1"> 2. Run the automaton m0 with initial state q to generate a pair of relation sequences with probability P((rl... rk), (rk+l-&amp;quot;&amp;quot; r,,)lm, q). 3. For each relation ri in these sequences, select a dependent word wi with dependency probability P(l, wi\[w, ri).</Paragraph>
      <Paragraph position="2"> 4. For each dependent wi, recursively generate a  subtree with probability P(D~ Iwi, ri).</Paragraph>
      <Paragraph position="3"> We can now express the probability P(Do) for an entire ordered dependency tree derivation Do headed by a word w0 as</Paragraph>
      <Paragraph position="5"> YIl &lt;i&lt;n P(l, wilwo, ri)P( Di Iwi, ri).</Paragraph>
      <Paragraph position="6"> In the translation application we search for the highest probability derivation (or more generally, the Nhighest probability derivations). For other purposes, the probability of strings may be of more interest.</Paragraph>
      <Paragraph position="7"> The probability of a string according to the model is the sum of the probabilities of derivations of ordered dependency trees yielding the string.</Paragraph>
      <Paragraph position="8"> In practice, the number of parameters in a head automaton language model is dominated by the dependency parameters, that is, O(\]V\]2\]RI) parameters. This puts the size of the model somewhere in between 2-gram and 3-gram model. The similarly motivated link grammar model (Lafferty, Sleator and Temperley 1992) has O(\[VI 3) parameters. Unlike simple N-gram models, head automata models yield an interesting distribution of sentence lengths.</Paragraph>
      <Paragraph position="9"> For example, the average sentence length for Monte-Carlo generation with our probabilistic head automata model for ATIS was 10.6 words (the average was 9.7 words for the corpus it was trained on).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="168" end_page="169" type="metho">
    <SectionTitle>
3 Analysis and Generation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
3.1 Analysis
</SectionTitle>
      <Paragraph position="0"> Head automaton models admit efficient lexically driven analysis (parsing) algorithms in which partial analyses are costed incrementally as they are constructed. Put in terms of the traditional parsing issues in natural language understanding, &amp;quot;semantic&amp;quot; associations coded as dependency parameters are applied at each parsing step allowing semantically suboptimal analyses to be eliminated, so the analysis with the best semantic score can be identified without scoring an exponential number of syntactic parses. Since the model is lexical, linguistic constructions headed by lexical items not present in the input are not involved in the search the way they are with typical top-down or predictive parsing strategies.</Paragraph>
      <Paragraph position="1"> We will sketch an algorithm for finding the lowest cost ordered dependency tree derivation for an input string in polynomial time in the length of the string.</Paragraph>
      <Paragraph position="2"> In our experimental system we use a more general version of the algorithm to allow input in the form of word lattices.</Paragraph>
      <Paragraph position="3"> The algorithm is a bottom-up tabular parser (Younger 1967, Early 1970) in which constituents are constructed &amp;quot;head-outwards&amp;quot; (Kay 1989, Sata and Stock 1989). Since we are analyzing bottom-up with generative model automata, the algorithm 'runs' the automata backwards. Edges in the parsing lattice (or &amp;quot;chart&amp;quot;) are tuples representing partial or complete phrases headed by a word w from position i to position j in the string: (w,t,i,j,m,q,c).</Paragraph>
      <Paragraph position="4"> Here m is the head automaton for w in this derivation; the automaton is in state q; t is the dependency tree constructed so far, and c is the cost of the partial derivation. We will use the notation C(zly ) for the cost of a model event with probability P(zIy); the assignment of costs to events is discussed in Section 5.</Paragraph>
      <Paragraph position="5"> Initialization: For each word w in the input between positions i and j, the lattice is initialized with phrases {w,{},i,j,m,q$,c$) for any lexical entry (w, m) and any final state q! of the automaton m in the entry. A final state is one for which the stop action cost c! = C(DJq!, m) is finite.</Paragraph>
      <Paragraph position="6"> Transitions: Phrases are combined bottom-up to form progressively larger phrases. There are two types of combination corresponding to left and right transitions of the automaton for the word acting as the head in the combination. We will specify left combination; right combination is the mirror image of left combination. If the lattice contains two phrases abutting at position k in the string:  (Wl, tl, i, k, ml, ql, Cl) (W2, t2, k, j, ra2, q2, c2),  and the parameter table contains the following finite costs parameters (a left v-transition of m2, a lexical parameter for wl, and an r-dependency parameter):</Paragraph>
      <Paragraph position="8"> then build a new phrase headed by w2 with a tree t~ formed by adding tl to t~ as an r-dependent of w2: (w2, t~, i, j, m2, q~, cl + c2 + c3 + c4 -4- cs).</Paragraph>
      <Paragraph position="9"> When no more combinations are possible, for each phrase spanning the entire input we add the appropriate start of derivation cost to these phrases and select the one with the lowest total cost.</Paragraph>
      <Paragraph position="10"> Pruning: The dynamic programming condition for pruning suboptimal partial analyses is as follows.</Paragraph>
      <Paragraph position="11"> Whenever there are two phrases p: (w,t,i,j,m,q,c) p' = (w, t', i, j, m, q, c'), and c ~ is greater than c, then we can remove p~ because for any derivation involving p~ that spans the entire string, there will be a lower cost derivation involving p. This pruning condition is effective at curbing a combinatorial explosion arising from, for example, prepositional phrase attachment ambiguities (coded in the alternative trees t and t'). The worst case asymptotic time complexity of the analysis algorithm is O(min(n 2, IY12)n3), where n is the length of an input string and IVI is the size of the vocabulary. This limit can be derived in a similar way to cubic time tabular recognition algorithms for context free grammars (Younger 1967) with the grammar related term being replaced by the term min(n 2, IVI 2) since the words of the input sentence also act as categories in the head automata model.</Paragraph>
      <Paragraph position="12"> In this context &amp;quot;recognition&amp;quot; refers to checking that the input string can be generated from the grammar.</Paragraph>
      <Paragraph position="13"> Note that our algorithm is for analysis (in the sense of finding the best derivation) which, in general, is a higher time complexity problem than recognition.</Paragraph>
    </Section>
    <Section position="2" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
3.2 Generation
</SectionTitle>
      <Paragraph position="0"> By generation here we mean determining the lowest cost linear surface ordering for the dependents of each word in an unordered dependency structure resulting from the transfer mapping described in Section 4. In general, the output of transfer is a dependency graph and the task of the generator involves a search for a backbone dependency tree for the graph, if necessary by adding dependency edges to join up unconnected components of the graph.</Paragraph>
      <Paragraph position="1"> For each graph component, the main steps of the search process, described non-deterministically, are  1. Select a node with word label w having a finite start of derivation cost C(w, m, ql t&gt;).</Paragraph>
      <Paragraph position="2"> 2. Execute a path through the head automaton m starting at state q and ending at state q' with a finite stop action cost C(Olq' , m). When making a transition with relation ri in the path, select a graph edge with label ri from w to some previously unvisited node wi with finite dependency cost C(~,wilw, ri). Include the cost of the transition (e.g. C(---% ql, rilqi-1, m)) in the running total for this derivation.</Paragraph>
      <Paragraph position="3"> 3. For each dependent node wi, select a lexical entry with cost C(mi, qilri, J., wi), and recursively apply the machine rni from state ql as in step 2.</Paragraph>
      <Paragraph position="4"> 4. Perform a left-parent-right traversal of the  nodes of the resulting dependency tree, yielding a target string.</Paragraph>
      <Paragraph position="5"> The target string resulting from the lowest cost tree that includes all nodes in the graph is selected as the translation target string. The independence assumptions implicit in head automata models mean that we can select lowest cost orderings of local dependency trees, below a given relation r, independently in the search for the lowest cost derivation.</Paragraph>
      <Paragraph position="6"> When the generator is used as part of the translation system, the dependency parameter costs are not, in fact, applied by the generator. Instead, because these parameters are independent of surface order, they are applied earlier by the transfer component, influencing the choice of structure passed to the generator.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="169" end_page="172" type="metho">
    <SectionTitle>
4 Transfer Maps
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
4.1 Transfer Model Bilingual Lexicon
</SectionTitle>
      <Paragraph position="0"> The transfer model defines possible mappings, with associated costs, of dependency trees with source-language word node labels into ones with target-language word labels. Unlike the head automata monolingual models, the transfer model operates with unordered dependency trees, that is, it treats the dependents of a word as an unordered bag. The model is general enough to cover the common translation problems discussed in the literature (e.g. Lindop and Tsujii 1991 and Dorr 1994) including many-to-many word mapping, argument switching, and head switching.</Paragraph>
      <Paragraph position="1"> A transfer model consists of a bilingual lexicon and a transfer parameter table. The model uses dependency tree fragments, which are the same as unordered dependency trees except that some nodes may not have word labels. In the bilingual lexicon, an entry for a source word wi (see top portion of Figure 2) has the form (wi, Hi, hi, Gi, fi) where Hi is a source language tree fragment, ni (the primary node) is a distinguished node of Hi with label wi, Gi is a target tree fragment, and fi is a  mapping function, i.e. a (possibly partial) function from the nodes of Hi to the nodes of Gi.</Paragraph>
      <Paragraph position="2"> The transfer parameter table specifies costs for the application of transfer entries. In a context-independent model, each entry has a single cost parameter. In context-dependent transfer models, the cost function takes into account the identities of the labels of the arcs and nodes dominating wi in the source graph. (Context dependence is discussed further in Section 5.) The set of transfer parameters may also include costs for the null transfer entries for wi, for use in derivations in which wi is translated by the entry for another word v. For example, the entry for v might be for translating an idiom involving wi as a modifier.</Paragraph>
      <Paragraph position="3"> Each entry in the bilingual lexicon specifies a way of mapping part of a dependency tree, specifically that part &amp;quot;matching&amp;quot; (as explained below) the source fragment of the entry, into part of a target graph, as indicated by the target fragment. Entry mapping functions specify how the set of target fragments for deriving a translation are to be combined: whenever an entry is applied, a global node-mapping function is extended to include the entry mapping function.</Paragraph>
    </Section>
    <Section position="2" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
4.2 Matching, Tiling, and Derivation
</SectionTitle>
      <Paragraph position="0"> Transfer mapping takes a source dependency tree S from analysis and produces a minimum cost derivation of a target graph T and a (possibly partial) function f from source nodes to target nodes. In fact, the transfer model is applicable to certain types of source dependency graphs that are more general than trees, although the version of the head automata model described here only produces trees.</Paragraph>
      <Paragraph position="1"> We will say that a tree fragment H matches an unordered dependency tree S if there is a function g (a matching function) from the nodes of H to the nodes of S such that * g is a total one-one function; * if a node n of H has a label, and that label is word w, then the word label for g(n) is also w; * for every arc in H with label r from node nl to node n2, there is an arc with label r from g(nz) to g(n2).</Paragraph>
      <Paragraph position="2"> Unlike first order unification, this definition of matching is not commutative and is not deterministic in that there may be multiple matching functions for applying a bilingual entry to an input source tree. A particular match of an entry against a dependency tree can be represented by the matching function g, a set of arcs A in S, and the (possibly context dependent) cost c of applying the entry.</Paragraph>
      <Paragraph position="3"> A tiling of a source graph with respect to a transfer model is a set of entry matches {(El, gz, A1, cl), * * &amp;quot;, (E~, gk, At, ck)}  * k is the number of nodes in the source tree S.</Paragraph>
      <Paragraph position="4"> * Each Ei, 1 &lt; i ~ k, is a bilingual entry (wi, Hi, hi, Gi, fil matching S with function gi (see Figure 2) and arcs Ai.</Paragraph>
      <Paragraph position="5"> * For primary nodes nl and nj of two distinct entries Ei and Ej, gi(ni) and gi(nj) are distinct. * The sets of edges Ai form a partition of the edges of S.</Paragraph>
      <Paragraph position="6"> * The images gi(Li) form a partition of the nodes of S, where Li is the set of labeled source nodes in the source fragment Hi of Ei.</Paragraph>
      <Paragraph position="7"> * ci is the cost of the match specified by the parameter table.</Paragraph>
      <Paragraph position="8"> A tiling of S yields a costed derivation of a target dependency graph T as follows: * The cost of the derivation is the sum of the costs ci for each match in the tiling.</Paragraph>
      <Paragraph position="9"> * The nodes and arcs of T are composed of the nodes and arcs of the target fragments Gi for the entries Ei.</Paragraph>
      <Paragraph position="10"> * Let fi and fj be the mapping functions for en- null tries Ei and Ej. For any node n of S for which target nodes fi(g\[l(n)) and fj(g~l(n)) are defined, these two nodes are identified as a single node f(n) in T.</Paragraph>
      <Paragraph position="11"> The merging of target fragment nodes in the last condition has the effect of joining the target fragments in a consistent fashion. The node mapping function f for the entire tree thus has a different role from the alignment function in the IBM statistical translation model (Brown et al. 1990, 1993); the role of the latter includes the linear ordering of words in the target string. In our approach, target word order is handled exclusively by the target monolingual model.</Paragraph>
    </Section>
    <Section position="3" start_page="170" end_page="172" type="sub_section">
      <SectionTitle>
4.3 Transfer Algorithm
</SectionTitle>
      <Paragraph position="0"> The main transfer search is preceded by a bilingual lexicon matching phase. This leads to greater efficiency as it avoids repeating matching operations  during the search phase, and it allows a static analysis of the matching entries and source tree to identify subtrees for which the search phase can safely prune out suboptimal partial translations.</Paragraph>
      <Paragraph position="1"> Transfer Configurations In order to apply target language model relation costs incrementally, we need to distinguish between complete and incomplete arcs: an arc is complete if both its nodes have labels, otherwise it is incomplete. The output of the lexicon matching phrase, and the partial derivations manipulated by the search phase are both in the form of transfer configurations (S,R,T,P,f,c,I) where S is the set of source nodes and arcs consumed so far in the derivation, R the remaining source nodes and arcs, f the mapping function built so far, T the set of nodes and complete arcs of the target graph, P the set of incomplete target arcs, c the partial derivation cost, and I a set of source nodes for which entries have yet to be applied.</Paragraph>
      <Paragraph position="2"> Lexical matching phase The algorithm for lexical matching has a similar control structure to standard unification algorithms, except that it can result in multiple matches. We omit the details. The lexicon matching phase returns, for each source node i, a set of runtime entries. There is one runtime entry for each successful match and possibly a null entry for the node if the word label for i is included in successful matches for other entries. Runtime entries are transfer configurations of the form (Hi, C/, Gi, Pi, fi, ci, {i}) in which Hi is the source fragment for the entry with each node replaced by its image under the applicable matching function; Gi the target fragment for the entry, except for the incomplete arcs Pi of this fragment; fi the composition of mapping function for the entry with the inverse of the matching function; ci the cost of applying the entry in the context of its match with the source graph plus the cost in the target model of the arcs in Gi.</Paragraph>
      <Paragraph position="3"> Transfer Search Before the transfer search proper, the resulting runtime entries together with the source graph are analyzed to determine decomposition nodes. A decomposition node n is a source tree node for which it is safe to prune suboptimal translations of the subtree dominated by n. Specifically, it is checked that n is the root node of all source fragments Hn of runtime entries in which both n and its node label are included, and that fn(n) is not dominated by (i.e. not reachable via directed arcs from) another node in the target graph Gn of such entries.</Paragraph>
      <Paragraph position="4"> Transfer search maintains a set M of active run-time entries. InitiMly, this is the set of runtime entries resulting from the lexicon matching phase.</Paragraph>
      <Paragraph position="5"> Overall search control is as follows:  1. Determine the set of decomposition nodes.</Paragraph>
      <Paragraph position="6"> 2. Sort the decomposition nodes into a list D such that if nl dominates n2 in S then n2 precedes nl in D.</Paragraph>
      <Paragraph position="7"> 3. If D is empty, apply the subtree transfer search (given below) to S, return the lowest cost solution, and stop.</Paragraph>
      <Paragraph position="8"> 4. Remove the first decomposition node n from D and apply the subtree transfer search to the sub-tree S ~ dominated by n, to yield solutions (s', C/, T', C/, f', c', C/).</Paragraph>
      <Paragraph position="9"> 5. Partition these solutions into subsets with the same word label for the node fl(n), and select the solution with lowest cost c' from each subset. null 6. Remove from M the set of runtime entries for nodes in S ~.</Paragraph>
      <Paragraph position="10"> 7. For each selected subtree solution, add to M a new runtime entry (S', C/, T', f', c', {n}).</Paragraph>
      <Paragraph position="11"> 8. Repeat from step 3.</Paragraph>
      <Paragraph position="12"> The subtree transfer search maintains a queue Q of configurations corresponding to partial derivations for translating the subtree. Control follows a standard non-deterministic search paradigm: 1. Initialize Q to contain a single configuration (C/, R0, C/, C/, C/, 0, I0) with the input subtree R0 and the set of nodes I0 in R0.</Paragraph>
      <Paragraph position="13"> 2. If Q is empty, return the lowest cost solution found and stop.</Paragraph>
      <Paragraph position="14"> 3. Remove a configuration iS, R, T, P, f, c, I) from the queue.</Paragraph>
      <Paragraph position="15"> 4. If R is empty, add the configuration to the set of subtree solutions.</Paragraph>
      <Paragraph position="16"> 5. Select a node i from I.</Paragraph>
      <Paragraph position="17"> 6. For each runtime entry (Hi, C/, Gi, Pi, fi, cl, {i})  for i, if Hi is a subgraph of R, add to Q a configuration iS 0 Hi, R - Hi, T O Gi 0 G', P U Pi -G', fO fi, c +ci +cv, , I--{ i} ), where G' is the set of newly completed arcs (those in P t3 Pi with both node labels in T U Gi O P 0 Pi) and cg, is the cost of the arcs G' in the target language model.</Paragraph>
      <Paragraph position="18"> 7. For any source node n for which f(n) and fi(n) are both defined, merge these two target nodes.</Paragraph>
      <Paragraph position="19"> 8. Repeat from step 2.</Paragraph>
      <Paragraph position="20"> Keeping the arcs P separate in the configuration allows efficient incremental application of target dependency costs cv, during the search, so these costs are taken into account in the pruning step of the overall search control. This way we can keep the benefits of monolingual/bilingual modularity (Isabelle and Macklovitch 1986) without the computationM overhead of transfer-and-filter (Alshawi et al. 1992).</Paragraph>
      <Paragraph position="21">  It is possible to apply the subtree search directly to the whole graph starting with the initial runtime entries from lexical matching. However, this would result in an exponential search, specifically a search tree with a branching factor of the order of the number of matching entries per input word. Fortunately, long sentences typically have several decomposition nodes, such as the heads of noun phrases, so the search as described is factored into manageable components. null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="172" end_page="173" type="metho">
    <SectionTitle>
5 Cost Functions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="172" end_page="172" type="sub_section">
      <SectionTitle>
5.1 Costed Search Processes
</SectionTitle>
      <Paragraph position="0"> The head automata model and transfer model were originally conceived as probabilistic models. In order to take advantage of more of the information available in our training data, we experimented with cost functions that make use of incorrect translations as negative examples and also to treat the correctness of a translation hypothesis as a matter of degree.</Paragraph>
      <Paragraph position="1"> To experiment with different models, we implemented a general mechanism for associating costs to solutions of a search process. Here, a search process is conceptualized as a non-deterministic computation that takes a single input string, undergoes a sequence of state transitions in a non-deterministic fashion, then outputs a solution string. Process states are distinct from, but may include, head automaton states.</Paragraph>
      <Paragraph position="2"> A cost function for a search process is a real valued function defined on a pair of equivalence classes of process states. The first element of the pair, a context c, is an equivalence class of states before transitions. The second element, an event e, is an equivalence class of states after transitions. (The equivalence relations for contexts and events may be different.) We refer to an event-context pair as a choice, for which we use the notation (efc) borrowed from the special case of conditional probabilities. The cost of a derivation of a solution by the process is taken to be the sum of costs of choices involved in the derivation.</Paragraph>
      <Paragraph position="3"> We represent events and contexts by finite sequences of symbols (typically words or relation symbols in the translation application). We write C(al'&amp;quot;anlbl'&amp;quot;bk) for the cost of the event represented by (al ..-a,~) in the context represented by(b1 ..-bk).</Paragraph>
      <Paragraph position="4"> &amp;quot;Backed off&amp;quot; costs can be computed by averaging over larger equivalence classes (represented by shorter sequences in which positions are eliminated systematically). A similar smoothing technique has been applied to the specific case of prepositional phrase attachment by Collins and Brooks (1995).</Paragraph>
      <Paragraph position="5"> We have used backed off costs in the translation application for the various cost functions described below. Although this resulted in some improvement in testing, so far the improvement has not been statistically significant.</Paragraph>
    </Section>
    <Section position="2" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
5.2 Model Cost Functions
</SectionTitle>
      <Paragraph position="0"> Taken together, the events, contexts, and cost function constitute a process cost model, or simply a model. The cost function specifies the model parameters; the other components are the model structure.</Paragraph>
      <Paragraph position="1"> We have experimented with a number of model types, including the following.</Paragraph>
      <Paragraph position="2"> Probabilistic model: In this model we assume a probability distribution on the possible events for a context, that is, E~ P(elc) = 1.</Paragraph>
      <Paragraph position="3"> The cost parameters of the model are defined as: C(elc) = -ln(P(elc)).</Paragraph>
      <Paragraph position="4"> Given a set of solutions from executions of a process, let n+(e\]e) be the number of times choice (e\[c) was taken leading to acceptable solutions (e.g. correct translations) and n+(c) be the number of times context c was encountered for these solutions. We can then estimate the probabilistic model costs with C(elc ) ~ ln(n+(c)) -ln(n+(elc)).</Paragraph>
      <Paragraph position="5"> Discriminative model: The costs in this model are likelihood ratios comparing positive and negative solutions, for example correct and incorrect translations. (See Dunning 1993 on the application of likelihood ratios in computational linguistics.) Let n-(elc ) be the count for choice (e\]c) leading to negative solutions. The cost function for the discriminative model is estimated as</Paragraph>
      <Paragraph position="7"> Mean distance model: In the mean distance model, we make use of some measure of goodness of a solution ts for some input s by comparing it against an ideal solution is for s with a distance metric h: h(t,,i,) ~ d in which d is a non-negative real number. A parameter for choice (e\]c) in the distance model</Paragraph>
      <Paragraph position="9"> is the mean value of h(t~,t~) for solutions t, produced by derivations including the choice (eIc).</Paragraph>
      <Paragraph position="10"> Normalized distance model: The mean distance model does not use the constraint that a particular choice faced by a process is always a choice between events with the same context. It is also somewhat sensitive to peculiarities of the distance function h.</Paragraph>
      <Paragraph position="11"> With the same assumptions we made for the mean distance model, let Eh(c) be the average of h(t~, ts) for solutions derived from sequences of choices including the context c. The cost parameter for (elc) in the normalized distance model is</Paragraph>
      <Paragraph position="13"> that is, the ratio of the expected distance for derivations involving the choice and the expected distance for all derivations involving the context for that choice.</Paragraph>
      <Paragraph position="14"> Reflexive Training If we have a manually translated corpus, we can apply the mean and normalized distance models to translation by taking the ideal solution t~ for translating a source string s to be the manual translation for s. In the absence of good metrics for comparing translations, we employ a heuristic string distance metric to compare word selection and word order in t~ and ~s.</Paragraph>
      <Paragraph position="15"> In order to train the model parameters without a manually translated corpus, we use a &amp;quot;reflexive&amp;quot; training method (similar in spirit to the &amp;quot;wakesleep&amp;quot; algorithm, Hinton et al. 1995). In this method, our search process translates a source sentence s to ts in the target language and then translates t~ back to a source language sentence #. The original sentence s can then act as the ideal solution of the overall process. For this training method to be effective, we need a reasonably good initial model, i.e. one for which the distance h(s, #) is inversely correlated with the probability that t~ is a good translation of s.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML