File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1508_metho.xml

Size: 19,920 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1508">
  <Title>Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Treebank Transfer</Title>
  <Section position="4" start_page="74" end_page="80" type="metho">
    <SectionTitle>
2 The Probability Model
</SectionTitle>
    <Paragraph position="0"> Our approach assumes that two kinds of resources are available: a source-language treebank, and a target-language text corpus. This is a realistic assumption, which is applicable to many sourcelanguage/target-language pairs. Furthermore, some knowledge of the mapping between source-language syntax and target-language syntax needs to be incorporated into the model. Parallel corpora are not required, but may help when constructing this mapping. null We view the source-language treebank as a sequence of trees S1,...,Sn, and assume that these trees are generated by a common process from a corresponding sequence of latent target-language trees T1,...,Tn. The parameter vector of the process which maps target-language trees to source-language trees will be denoted by Ks. The mapping itself is expressed as a conditional probability distribution p(Si  |Ti,Ks) over source-language trees. The parameter vector Ks is assumed to be generated according to a prior distribution p(Ks  |x) with hyper-parameter x, assumed to be fixed and known.</Paragraph>
    <Paragraph position="1"> We further assume that each target-language tree Ti is generated from a common language model L for the target language, p(Ti  |L). For expository reasons we assume that L is a bigram language model over the terminal yield (also known as the fringe) of Ti. Generalizations to higher-order n-gram models are completely straightforward; more general models that can be expressed as stochastic finite automata are also possible, as discussed in Section 5.</Paragraph>
    <Paragraph position="2"> Let t1,...,tk be the terminal yield of tree T . Then</Paragraph>
    <Paragraph position="4"> where # marks the beginning of the string and $ marks the end of the string.</Paragraph>
    <Paragraph position="5"> There are two options for incorporating the language model L into the overall probability model. In the first case - which we call the full model L is generated by an informative prior distribution p(L |l) with hyper-parameter l. In the second case - the reduced model - the language model L is fixed.</Paragraph>
    <Paragraph position="6"> The structure of the full model is specified graphically in Figure 1. In a directed acyclic graphical model such as this one, we equate vertices with random variables. Directed edges are said to go from a parent to a child node. Each vertex depends directly on all of its parents. Any particular vertex is conditionally independent from all other vertices given its parents, children, and the parents of its children.</Paragraph>
    <Paragraph position="7"> The portion of the full model we are interested in is the following factored distribution, as specified by</Paragraph>
    <Paragraph position="9"> In the reduced model, we drop the leftmost term/ vertex, corresponding to the prior for L with hyper-parameter l, and condition on L instead:</Paragraph>
    <Paragraph position="11"> The difference between the full model (1) and the reduced model (2) is that the reduced model assumes that the language model L is fixed and will not be informed by the latent target-language trees Ti. This is an entirely reasonable assumption in a situation where the target-language text corpus is much larger than the source-language treebank. This will typically be the case, since it is usually very easy to collect large corpora of unannotated text which exceed the largest existing annotated corpora by several orders of magnitude. When a sufficiently large target-language text corpus is available, L is simply a smoothed bigram model which is estimated once from the target-language corpus.</Paragraph>
    <Paragraph position="12"> If the target-language corpus is relatively small, then the bigram model L can be refined on the basis of the imputed target-language trees. A bigram  model is simply a discrete collection of multinomial distributions. A simple prior for L takes the form of a product of Dirichlet distributions, so that the hyper-parameter l is a vector of bigram counts. In the full model (1), we assume l is fixed and set it to the observed bigram counts (plus a constant) in the target-language text corpus. This gives us an informative prior for L. If the bigram counts are sufficiently large, L will be fully determined by this informative prior distribution, and the reduced model (2) can be used instead.</Paragraph>
    <Paragraph position="13"> By contrast, usually very little is known a priori about the syntactic transfer model Ks. Instead Ks needs to be estimated from data. We assume that Ks too is a discrete collection of multinomial distributions, governed by Dirichlet priors. However, unlike in the case of L, the priors for Ks are noninformative. This is not a problem, since a lot of information about the target language is provided by the language model L.</Paragraph>
    <Paragraph position="14"> As one can see in Figure 1 and equation (1), the overall probability model constrains the latent target-language trees Ti in two ways: From the left, the language model L serves as a prior distribution over target-language trees. On the one hand, L is an informative prior, based on large bigram counts obtained from the target-language text corpus; on the other hand, it only informs us about the fringe of the target-language trees and has very little directly to say about their syntactic structure. From the right, the observed source-language trees constrain the latent target-language trees in a complementary fashion. Each target-language tree Ti gives rise to a corresponding source-language tree Si according to the syntactic transfer mapping Ks. This mapping is initially known only qualitatively, and comes with a noninformative prior distribution.</Paragraph>
    <Paragraph position="15"> Our goal is now to simultaneously estimate the transfer parameter Ks and impute the latent trees Ti.</Paragraph>
    <Paragraph position="16"> This is simplified by the following observation: if T1,...,Tn are known, then finding Ks is easy; vice versa, if Ks is known, then finding Ti is easy. Simultaneous inference for Ks and T1,...,Tn is possible via Data Augmentation (Tanner and Wong, 1987), or, more generally, Gibbs sampling (Geman and Geman, 1984).</Paragraph>
    <Paragraph position="17"> 3 Simulation of the Joint Posterior Distribution We now discuss the simulation of the joint posterior distribution over the latent trees T1,...,Tn, the transfer model parameter Ks, and the language model parameter L. This joint posterior is derived from the overall full probability model (1). Using the reduced model (2) instead of the full model amounts to simply omitting L from the joint posterior. We will deal primarily with the more general full model in this section, since the simplification which results in the reduced model will be straightforward.</Paragraph>
    <Paragraph position="18"> The posterior distribution we focus on is p(T1,...,Tn,L,Ks  |S1,...,Sn,l,x), which provides us with information about all the variables of interest, including the latent target-language trees Ti, the syntactic transfer model Ks, and the target-language  language model L. It is possible to simulate this joint posterior distribution using simple sampling-based approaches (Gelfand and Smith, 1990), which are instances of the general Markov-chain Monte Carlo method (see, for example, Liu, 2001).</Paragraph>
    <Paragraph position="19"> Posterior simulation proceeds iteratively, as follows. In each iteration we draw the three kinds of random variables - latent trees, language model parameters, and transfer model parameters - from their conditional distributions while holding the values of  all other variables fixed. Specifically: * Initialize L and Ks by drawing each from its prior distribution.</Paragraph>
    <Paragraph position="20"> * Iterate the following three steps: 1. Draw each Ti from its posterior distribution given Si, L, and Ks.</Paragraph>
    <Paragraph position="21"> 2. Draw L from its posterior distribution given T1,...,Tn and l.</Paragraph>
    <Paragraph position="22"> 3. Draw Ks from its posterior distribution  given S1,...,Sn, T1,...,Tn, and x.</Paragraph>
    <Paragraph position="23"> This simulation converges in the sense that the draws of T1,...,Tn, L, and Ks converge in distribution to the joint posterior distribution over those variables. Further details can be found, for example, in Liu, 2001, as well as the references cited above.</Paragraph>
    <Paragraph position="24"> We assume that the bigram model L is a family of multinomial distributions, and we write L(t j  |t j[?]1) for the probability of the word t j following t j[?]1. Using creative notation, L(* |t j[?]1) can be seen as a multinomial distribution. Its conjugate prior is a Dirichlet distribution whose parameter vector lw are the counts of words types occurring immediately after the word type w of t j[?]1. Under the conventional assumptions of exchangeability and independence, the prior distribution for L is just a product of Dirichlet priors. Since we employ a conjugate prior, the posterior distribution of L</Paragraph>
    <Paragraph position="26"> has the same form as the prior - it is likewise a product of Dirichlet distributions. In fact, for each word type w the posterior Dirichlet density has parameter lw+cw, where lw is the parameter of the prior distribution and cw is a vector of counts for all word forms appearing immediately after w along the fringe of the imputed trees.</Paragraph>
    <Paragraph position="27"> We make similar assumptions about the syntactic transfer model Ks and its posterior distribution, which</Paragraph>
    <Paragraph position="29"> In particular, we assume that syntactic transfer involves only multinomial distributions, so that the prior and posterior for Ks are products of Dirichlet distributions. This means that sampling L and Ks from their posterior distributions is straightforward.</Paragraph>
    <Paragraph position="30"> The difficult part is the first step in each scan of the Gibbs sampler, which involves sampling each target-language latent tree from the corresponding posterior distribution. For a particular tree Tj, the posterior takes the following form:</Paragraph>
    <Paragraph position="32"> The next section discusses sampling from this posterior distribution in the context of a concrete example and presents an algorithmic solution.</Paragraph>
    <Paragraph position="33"> 4 Sampling from the Latent Tree Posterior We are faced with the problem of sampling Tj from its posterior distribution, which is proportional to the product of its language model prior p(Tj  |L) and transfer model likelihood p(S j  |Tj,Ks). Rejection sampling using the prior as the proposal distribution will not work, for two reasons: first, the prior is only defined on the yield of a tree and there are potentially very many tree structures with the same fringe; second, even if the first problem could be overcome, it is unlikely that a random draw from an n-gram prior would result in a target-language tree that corresponds to a particular source-language tree, as the prior has no knowledge of the source-language tree.</Paragraph>
    <Paragraph position="34"> Fortunately, efficient direct sampling from the latent tree posterior is possible, under one very reasonable assumption: the set of all target-language trees which map to a given source-language tree S j  order within a sentence, and prenominal adjectives within noun phrases.</Paragraph>
    <Paragraph position="35"> should be finite and representable as a packed forest. More specifically, we assume that there is a compact (polynomial space) representation of potentially exponentially many trees. Moreover, each tree in the packed forest has an associated weight, corresponding to its likelihood under the syntactic transfer model.</Paragraph>
    <Paragraph position="36"> If we rescale the weights of the packed forest so that it becomes a normalized probabilistic context-free grammar (PCFG), we can sample from this new distribution (corresponding to the normalized likelihood) efficiently. For example, it is then possible to use the PCFG as a proposal distribution for rejection sampling.</Paragraph>
    <Paragraph position="37"> However, we can go even further and sample from the latent tree posterior directly. The key idea is to intersect the packed forest with the n-gram language model and then to normalize the resulting augmented forest. The intersection operation is a special case of the intersection construction for context-free grammars and finite automata (Bar-Hillel et al., 1961, pp. 171-172). We illustrate it here for a bigram language model.</Paragraph>
    <Paragraph position="38"> Consider the tree in Figure 2 and assume it is a source-language tree, whose root is a clause (C) which consists of a subject (S), verb (v) and object (O). The subject and object are noun phrases consisting of an adjective (a) and a noun (n). For simplicity, we treat the part-of-speech labels (a, n, v) as terminal symbols and add numbers to distinguish multiple occurrences. The syntactic transfer model is stated as a conditional probability distribution over source-language trees conditional on target language trees. Syntactic transfer amounts to independently changing the order of the subject, verb, and object, and changing the order of adjectives and nouns, for example as follows:</Paragraph>
    <Paragraph position="40"> Under this transfer model, the likelihood of a target-language tree [A v[S a1 n1][O n2 a2]] corresponding to the source-language tree shown in Figure 2 is Ks5 x Ks7 xKs8. It is easy to construct a packed forest of all target-language trees with non-zero likelihood that give rise to the source-language tree in Figure 2.</Paragraph>
    <Paragraph position="41"> Such a forest is shown in Figure 3. Forest nodes are shown as ellipses, choice points as rectangles connected by dashed lines. A forest node is to be understood as an (unordered) disjunction of the choice points directly underneath it, and a choice point as an (ordered, as indicated by numbers) conjunction of the forest nodes directly underneath it. In other words, a packed forest can be viewed as an acyclic and-or graph, where choice points represent and-nodes (whose children are ordered). As a simplifying convention, for nodes that dominate a single choice node, that choice node is not shown. The forest in Figure 3 represents SvO, SOv, and vSO permutations at the sentence level and an, na permutations below the two noun phrases. The twelve overall permutations are represented compactly in terms of two choices for the subject, two choices for the object, and three choices for the root clause.</Paragraph>
    <Paragraph position="42">  We intersect/compose the packed forest with the bigram language model L by augmenting each node in the forest with a left context word and a right peripheral word: a node N is transformed into a triple (a,N,b) that dominates those trees which N dominates in the original forest and which can occur after a word a and end with a word b. The algorithm is roughly1 as shown in Figure 5 for binary branching forests; it requires memoization (not shown) to be efficient. The generalization to forests with arbitrary branching factors is straightforward, but the presentation of that algorithm less so. At the root level, we call forest_composition with a left context of # (indicating the start of the string) and add dummy nodes of the form (a,$,$) (indicating the end of the string). Further details can be found in the prototype implementation. Each node in the original forest is augmented with two words; if there are n leaf nodes in the original forest, the total number of nodes in the augmented forest will be at most n2 times larger than in the original forest. This means that the compact encoding property of the packed forest (exponentially many trees can be represented in polynomial space) is preserved by the composition algorithm. An example of composing a packed forest 1A detailed implementation is available from http://www.</Paragraph>
    <Paragraph position="43"> cs.columbia.edu/[?]jansche/transfer/.</Paragraph>
    <Paragraph position="44"> with a bigram language model appears in Figure 4, which shows the forest that results from composing the forest in Figure 3 with a bigram language model.</Paragraph>
    <Paragraph position="45"> The result of the composition is an augmented forest from which sampling is almost trivial. The first thing we have to do is to recursively propagate weights from the leaves upwards to the root of the forest and associate them with nodes. In the non-recursive case of leaf nodes, their weights are provided by the bigram score of the augmented forest: observe that leaves in the augmented forest have labels of the form (a,b,b), where a and b are terminal symbols, and a represents the immediately preceding left context. The score of such a leaf is simply L(b  |a). There are two recursive cases: For choice nodes (and-nodes), their weight is the product of the weights of the node's children times a local likelihood score. For example, the node (v,O,n) in Figure 4 dominates a single choice node (not shown, per the earlier conventions), whose weight is L(a  |v) L(n  |a) Ks7. For other forest nodes (ornodes), their weight is the sum of the weights of the node's children (choice nodes).</Paragraph>
    <Paragraph position="46"> Given this very natural weight-propagation algorithm (and-nodes correspond to multiplication, or-nodes to summation), it is clear that the weight of the root node is the sum total of the weights of all trees in the forest, where the weight of a tree is the prod- null forest_composition(N, a): if N is a terminal:  uct of the local likelihood scores times the language model score of the tree's terminal yield. We can then associate outgoing normalized weights with the children (choice points) of each or-node, where the probability of going to a particular choice node from a given or-node is equal to the weight of the choice node divided by the weight of the or-node.</Paragraph>
    <Paragraph position="47"> This means we have managed to calculate the normalizing constant of the latent tree posterior (5) without enumerating the individual trees in the forest. Normalization ensures that we can sample from the augmented and normalized forest efficiently, by proceeding recursively in a top-down fashion, picking a child of an or-node at random with probability proportional to the outgoing weight of that choice.</Paragraph>
    <Paragraph position="48"> It is easy to see (by a telescoping product argument) that by multiplying together the probabilities of each such choice we obtain the posterior probability of a latent tree. We thus have a method for sampling latent trees efficiently from their posterior distribution. The sampling procedure described here is very similar to the lattice-based generation procedure with n-gram rescoring developed by Langkilde (2000), and is in fact based on the same intersection construction (Langkilde seems to be unaware that the CFG-intersection construction from (Bar-Hillel et al., 1961) is involved). However, Langkilde is interested in optimization (finding the best tree in the forest), which allows her to prune away less probable trees from the composed forest in a procedure that combines composition, rescoring, and pruning.</Paragraph>
    <Paragraph position="49"> Alternatively, for a somewhat different but related formulation of the probability model, the sampling method developed by Mark et al. (1992) can be used.</Paragraph>
    <Paragraph position="50"> However, its efficiency is not well understood.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML