File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3228_metho.xml
Size: 17,807 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3228"> <Title>Dependencies vs. Constituents for Tree-Based Alignment</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Tree-to-Tree Model </SectionTitle> <Paragraph position="0"> A tree-to-tree alignment model has tree transformation operations for reordering a node's children, inserting and deleting nodes, and translating individual words at the leaves of the parse trees. The transformed tree must not only match the surface string of the target language, but also the tree structure assigned to the string by the parser. In order to provide enough flexibility to make this possible, tree transformation operations allow a single node in the source tree to produce two nodes in the target tree, or two nodes in the source tree to be grouped together and produce a single node in the target tree.</Paragraph> <Paragraph position="1"> The model can be thought of as a synchronous tree substitution grammar, with probabilities parameterized to generate the target tree conditioned on the structure of the source tree.</Paragraph> <Paragraph position="2"> The probability P(TbjTa) of transforming the source tree Ta into target tree Tb is modeled in a sequence of steps proceeding from the root of the target tree down. At each level of the tree: 1. At most one of the current node's children is grouped with the current node in a single elementary tree, with probability Pelem(taj&quot;a ) children(&quot;a)), conditioned on the current node &quot;a and its children (ie the CFG production expanding &quot;a).</Paragraph> <Paragraph position="3"> 2. An alignment of the children of the current elementary tree is chosen, with probability</Paragraph> <Paragraph position="5"> operation is similar to the re-order operation in the tree-to-string model, with the extension that 1) the alignment can include insertions and deletions of individual children, as nodes in either the source or target may not correspond to anything on the other side, and 2) in the case where two nodes have been grouped into ta, their children are re-ordered together in one step.</Paragraph> <Paragraph position="6"> In the final step of the process, as in the tree-to-string model, lexical items at the leaves of the tree are translated into the target language according to a distribution Pt(fje).</Paragraph> <Paragraph position="7"> Allowing non-1-to-1 correspondences between nodes in the two trees is necessary to handle the fact that the depth of corresponding words in the two trees often differs. A further consequence of allowing elementary trees of size one or two is that some reorderings not allowed when reordering the children of each individual node separately are now possible. For example, with our simple tree A</Paragraph> <Paragraph position="9"> if nodes A and B are considered as one elementary tree, with probability Pelem(tajA ) BZ), their collective children will be reordered with probability</Paragraph> <Paragraph position="11"> giving the desired word ordering XZY. However, computational complexity as well as data sparsity prevent us from considering arbitrarily large elementary trees, and the number of nodes considered at once still limits the possible alignments. For example, with our maximum of two nodes, no transformation of the tree A</Paragraph> <Paragraph position="13"> is capable of generating the alignment WYXZ.</Paragraph> <Paragraph position="14"> In order to generate the complete target tree, one more step is necessary to choose the structure on the target side, specifically whether the elementary tree has one or two nodes, what labels the nodes have, and, if there are two nodes, whether each child attaches to the first or the second. Because we are</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Operation Parameterization </SectionTitle> <Paragraph position="0"> elementary tree grouping Pelem(taj&quot;a ) children(&quot;a)) re-order Palign( j&quot;a ) children(ta)) insertion can include &quot;insertion&quot; symbol lexical translation Pt(fje) cloning Pmakeclone(&quot;) can include &quot;clone&quot; symbol ultimately interested in predicting the correct target string, regardless of its structure, we do not assign probabilities to these steps. The nonterminals on the target side are ignored entirely, and while the alignment algorithm considers possible pairs of nodes as elementary trees on the target side during training, the generative probability model should be thought of as only generating single nodes on the target side. Thus, the alignment algorithm is constrained by the bracketing on the target side, but does not generate the entire target tree structure.</Paragraph> <Paragraph position="1"> While the probability model for tree transformation operates from the top of the tree down, probability estimation for aligning two trees takes place by iterating through pairs of nodes from each tree in bottom-up order, as sketched below: for all nodes &quot;a in source tree Ta in bottom-up order do for all elementary trees ta rooted in &quot;a do for all nodes &quot;b in target tree Tb in bottom-up order do for all elementary trees tb rooted in &quot;b do for all alignments of the children of ta and</Paragraph> <Paragraph position="3"> end for end for end for end for end for The outer two loops, iterating over nodes in each tree, require O(jTj2). Because we restrict our elementary trees to include at most one child of the root node on either side, choosing elementary trees for a node pair is O(m2), where m refers to the maximum number of children of a node. Computing the alignment between the 2m children of the elementary tree on either side requires choosing which sub-set of source nodes to delete, O(22m), which sub-set of target nodes to insert (or clone), O(22m), and how to reorder the remaining nodes from source to target tree, O((2m)!). Thus overall complexity of the algorithm is O(jTj2m242m(2m)!), quadratic in the size of the input sentences, but exponential in the fan-out of the grammar.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Clone Operation </SectionTitle> <Paragraph position="0"> Both our constituent and dependency models make use of the &quot;clone&quot; operation introduced by Gildea (2003), which allows words to be aligned even in cases of radically mismatched trees, at a cost in the probability of the alignment. Allowing mto-n matching of up to two nodes on either side of the parallel treebank allows for limited non-isomorphism between the trees. However, even given this flexibility, requiring alignments to match two input trees rather than one often makes tree-to-tree alignment more constrained than tree-to-string alignment. For example, even alignments with no change in word order may not be possible if the structures of the two trees are radically mismatched.</Paragraph> <Paragraph position="1"> Thus, it is helpful to allow departures from the constraints of the parallel bracketing, if it can be done in without dramatically increasing computational complexity.</Paragraph> <Paragraph position="2"> The clone operation allows a copy of a node from the source tree to be made anywhere in the target tree. After the clone operation takes place, the transformation of source into target tree takes place using the tree decomposition and subtree alignment operations as before. The basic algorithm of the previous section remains unchanged, with the exception that the alignments between children of two elementary trees can now include cloned, as well as inserted, nodes on the target side. Given that specifies a new cloned node as a child of &quot;j, the choice of which node to clone is made as in the tree-to-string model:</Paragraph> <Paragraph position="4"> Because a node from the source tree is cloned with equal probability regardless of whether it has already been &quot;used&quot; or not, the probability of a clone operation can be computed under the same dynamic programming assumptions as the basic tree-to-tree model. As with the tree-to-string cloning operation, this independence assumption is essential to keep the complexity polynomial in the size of the input sentences.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Dependency Tree-to-Tree Alignments </SectionTitle> <Paragraph position="0"> Dependencies were found to be more consistent than constituent structure between French and English by Fox (2002), though this study used a tree representation on the English side only. We wish to investigate whether dependency trees are also more suited to tree-to-tree alignment.</Paragraph> <Paragraph position="1"> Figure 1 shows a typical Xinhua newswire sentence with the Chinese parser output, and the sentence's English translation with its parse tree. The conversion to dependency representation is shown below the original parse trees.</Paragraph> <Paragraph position="2"> Examination of the trees shows both cases where the dependency representation is more similar across the two languages, as well as its potential pitfalls. The initial noun phrase, &quot;14 Chinese open border cities&quot; has two subphrases with a level of constituent structure (the QP and the lower NP) not found in the English parse. In this case, the difference in constituent structure derives primarily from differences in the annotation style between the original English and Chinese treebanks (Marcus et al., 1993; Xue and Xia, 2000; Levy and Manning, 2003). These differences disappear in the constituent representation. In general, the number of levels of constituent structure in a tree can be relatively arbitrary, while it is easier for people (whether professional syntacticians or not) to agree on the word-to-word dependencies.</Paragraph> <Paragraph position="3"> In some cases, differences in the number of level may be handled by the tree-to-tree model, for example by grouping the subject NP and its base NP child together as a single elementary tree. However, this introduces unnecessary variability into the alignment process. In cases with large difference in the depths of the two trees, the aligner may not be able to align the corresponding terminal nodes because only one merge is possible at each level.</Paragraph> <Paragraph position="4"> In this case the aligner will clone the subtree, at an even greater cost in probability.</Paragraph> <Paragraph position="5"> The rest of our example sentence, however, shows cases where the conversion to dependency structure can in some cases exacerbate differences in constituent structure. For example, jingji and jianshe are sisters in the original constituent structure, as are their English translations, economic and construction. In the conversion to Chinese dependency structure, they remain sisters both dependent on the noun chengjiu (achievements) while in English, economic is a child of construction. The correspondence of a three-noun compound in Chinese to a noun modified by prepositional phrase and an adjective-noun relation in English means that the conversion rules select different heads even for pieces of tree that are locally similar.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The Dependency Alignment Model </SectionTitle> <Paragraph position="0"> While the basic tree-to-tree alignment algorithm is the same for dependency trees, a few modifications to the probability model are necessary.</Paragraph> <Paragraph position="1"> First, the lexical translation operation takes place at each node in the tree, rather than only at the leaves. Lexical translation probabilities are maintained for each word pair as before, and the lexical translation probabilities are included in the alignment cost for each elementary tree. When both elementary trees contain two words, either alignment is possible between the two. The direct alignment between nodes within the elementary tree has probability 1 Pswap. A new parameter Pswap gives the probability of the upper node in the elementary tree in English corresponding to the lower node in Chinese, and vice versa. Thus, the probability for the following transformation: A</Paragraph> <Paragraph position="3"> Our model does not represent the position of the head among its children. While this choice would have to be made in generating MT output, for the purposes of alignment we simply score how many tree nodes are correctly aligned, without flattening our trees into a string.</Paragraph> <Paragraph position="4"> We further extended the tree-to-tree alignment algorithm by conditioning the reordering of a node's children on the node's lexical item as well as its syntactic category at the categories of its children. The lexicalized reordering probabilities were smoothed with the nonlexicalized probabilities (which are themselves smoothed with a uniform distribution).</Paragraph> <Paragraph position="5"> We smooth using a linear interpolation of lexicalized and unlexicalized probabilities, with weights proportional to the number of observations for each type of event.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We trained our translation models on a parallel corpus of Chinese-English newswire text. We re- null alignment error rate.</Paragraph> <Paragraph position="1"> stricted ourselves to sentences of no more than 25 words in either language, resulting in a training corpus of 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words. The Chinese data were automatically segmented into tokens, and English capitalization was retained. We replace words occurring only once with an unknown word token, resulting in a Chinese vocabulary of 23,783 words and an English vocabulary of 27,075 words. Chinese data was parsed using the parser of Bikel (2002), and English data was parsed using Collins (1999). Our hand-aligned test data were those used in Hwa et al. (2002), and consisted of 48 sentence pairs also with less than 25 words in either language, for a total of 788 English words and 580 Chinese words. The hand aligned data consisted of 745 individual aligned word pairs. Words could be aligned one-to-many in either direction. This limits the performance achievable by our models; the IBM models allow one-to-many alignments in one direction only, while the tree-based models allow only one-to-one alignment unless the cloning operation is used. A separate set of 49 hand-aligned sentence pairs was used to control overfitting in training our models.</Paragraph> <Paragraph position="2"> We evaluate our translation models in terms of agreement with human-annotated word-level alignments between the sentence pairs. For scoring the viterbi alignments of each system against gold-standard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of</Paragraph> <Paragraph position="4"> where A is the set of word pairs aligned by the automatic system, and G the set aligned in the gold standard. For a better understanding of how the models 1While Och and Ney (2000) differentiate between sure and possible hand-annotated alignments, our gold standard alignments come in only one variety.</Paragraph> <Paragraph position="5"> differ, we break this figure down into precision:</Paragraph> <Paragraph position="7"> Since none of the systems presented in this comparison make use of hand-aligned data, they may differ in the overall proportion of words that are aligned, rather than inserted or deleted. This affects the precision/recall tradeoff; better results with respect to human alignments may be possible by adjusting an overall insertion probability in order to optimize AER.</Paragraph> <Paragraph position="8"> Table 2 provides a comparison of results using the tree-based models with the word-level IBM models.</Paragraph> <Paragraph position="9"> IBM Models 1 and 4 refer to Brown et al. (1993).</Paragraph> <Paragraph position="10"> We used the GIZA++ package, including the HMM model of Och and Ney (2000). We trained each model until AER began to increase on our held-out cross validation data, resulting in running Model 1 for three iterations, then the HMM model for three iterations, and finally Model 4 for two iterations (the optimal number of iterations for Models 2 and 3 was zero). &quot;Constituent Tree-to-Tree&quot; indicates the model of Section 2 trained and tested directly on the trees output by the parser, while &quot;Dependency Tree-to-Tree&quot; makes the modifications to the model described in Section 3. For reasons of computational efficiency, our constituent-based training procedure skipped sentences for which either tree had a node with more than five children, and the dependency-based training skipped trees with more than six children. Thus, the tree-based models were effectively trained on less data than IBM Model 4: 11,422 out of 18,773 sentence pairs for the constituent model and 10,662 sentence pairs for the dependency model. Our tree-based models were initialized with lexical translation probabilities trained using IBM Model 1, and uniform probabilities for the tree reordering operations. The models were trained until AER began to rise on our held-out cross-validation data, though in practice AER was nearly constant for both tree-based models after the first iteration.</Paragraph> </Section> class="xml-element"></Paper>