File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1060_metho.xml
Size: 13,250 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1060"> <Title>Syntax-Based Alignment: Supervised or Unsupervised?</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The Inversion Transduction Grammar </SectionTitle> <Paragraph position="0"> The Inversion Transduction Grammar of Wu (1997) can be thought as a a generative process which simultaneously produces strings in both languages through a series of synchronous context-free grammar productions. The grammar is restricted to binary rules, which can have the symbols in the right hand side appear in the same order in both languages, represented with square brackets:</Paragraph> <Paragraph position="2"> or the symbols may appear in reverse order in the two languages, indicated by angle brackets:</Paragraph> <Paragraph position="4"> Individual lexical translations between English words e and French words f take place at the leaves of the tree, generated by grammar rules with a single right hand side symbol in each language: X ! e=f Given a bilingual sentence pair, a synchronous parse can be built using a two-dimensional extension of chart parsing, where chart items are indexed by their nonterminal Y and beginning and ending positions l; m in the source language string, and beginning and ending positions i; j in the target language string. For Expectation Maximization training, we compute inside probabilities (Y; l; m; i; j) from the bottom up as outlined below: for all l;m;n such that 1 l < m < n < Ns do for all i;j;k such that 1 < i < j < k < Nt do for all rules X ! Y Z 2 G do</Paragraph> <Paragraph position="6"> A similar recursion is used to compute outside probabilities for each chart item, and the inside and outside probabilities are combined to derive expected counts for occurrence of each grammar rule, including the rules corresponding to individual lexical translations. In our experiments we use a grammar with a start symbol S, a single preterminal C, and two nonterminals A and B used to ensure that only one parse can generate any given word-level alignment (ignoring insertions and deletions) (Wu, 1997; Zens and Ney, 2003). The individual lexical translations produced by the grammar may include a NULL word on either side, in order to represent insertions and deletions.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Tree-To-String Model </SectionTitle> <Paragraph position="0"> The model of Yamada and Knight (2001) can be thought of as a generative process taking a tree in one language as input and producing a string in the other through a sequence of probabilistic operations. If we follow the process of an English sentence's transformation into French, the English sentence is first given a syntactic tree representation by a statistical parser (Collins, 1999). As the first step in the translation process, the children of each node in the tree can be re-ordered. For any node with m children, m! re-orderings are possible, each of which is assigned a probability Porder conditioned on the syntactic categories of the parent node and its children. As the second step, French words can be inserted at each node of the parse tree. Insertions are modeled in two steps, the first predicting whether an insertion to the left, an insertion to the right, or no insertion takes place with probability Pins, conditioned on the syntactic category of the node and that of its parent. The second step is the choice of the inserted word Pt(fjNULL), which is predicted without any conditioning information. The final step, a French translation of each original English word, at the leaves of the tree, is chosen according to a distribution Pt(fje). The French word is predicted conditioned only on the English word, and each English word can generate at most one French word, or can generate a NULL symbol, representing deletion. Given the original tree, the re-ordering, insertion, and translation probabilities at each node are independent of the choices at any other node. These independence relations are analogous to those of a stochastic context-free grammar, and allow for efficient parameter estimation by an inside-outside Expectation Maximization algorithm. The computation of inside probabilities , outlined below, considers possible reorderings of nodes in the original tree in a bottom-up manner: for all nodes &quot;i in input tree T do for all k;l such that 1 < k < l < N do for all orderings of the children &quot;1:::&quot;m of &quot;i do for all partitions of span k;l into</Paragraph> <Paragraph position="2"> As with Inversion Transduction Grammar, many alignments between source and target sentences are not allowed. As a minimal example, take the tree: A</Paragraph> <Paragraph position="4"> Of the six possible re-orderings of the three terminals, the two which would involve crossing the bracketing of the original tree (XZY and YZX) are not allowed. While this constraint gives us a way of using syntactic information in translation, it may in many cases be too rigid. In part to deal with this problem, Yamada and Knight (2001) flatten the trees in a pre-processing step by collapsing nodes with the same lexical head-word. This allows, for example, an English subject-verb-object (SVO) structure, which is analyzed as having a VP node spanning the verb and object, to be re-ordered as VSO in a language such as Arabic. Larger syntactic divergences between the two trees may require further relaxation of this constraint, and in practice we expect such divergences to be frequent. For example, a nominal modifier in one language may show up as an adverbial in the other, or, due to choices such as which information is represented by a main verb, the syntactic correspondence between the two sentences may break down completely. While having flatter trees can make more reorderings possible than with the binary Inversion Transduction Grammar trees, fixing the tree in one language generally has a much stronger opposite effect, dramatically restricting the number of permissible alignments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Tree-to-String With Cloning </SectionTitle> <Paragraph position="0"> In order to provide more flexibility in alignments, a cloning operation was introduced for tree-to-string alignment by Gildea (2003). The model is modified to allow for a copy of a (translated) subtree from the English sentences to occur, with some cost, at any point in the resulting French sentence. For example, in the case of the input tree</Paragraph> <Paragraph position="2"> This operation, combined with the deletion of the original node Z, produces the alignment (XZY) that was disallowed by the original tree reordering model.</Paragraph> <Paragraph position="3"> The probability of adding a clone of original node &quot;i as a child of node &quot;j is calculated in two steps: first, the choice of whether to insert a clone under &quot;j, with probability Pins(clonej&quot;j), and the choice of which original node to copy, with probability</Paragraph> <Paragraph position="5"> where Pmakeclone is the probability of an original node producing a copy. In our implementation, Pins(clone) is estimated by the Expectation Maximization algorithm conditioned on the label of the parent node &quot;j, and Pmakeclone is a constant, meaning that the node to be copied is chosen from all the nodes in the original tree with uniform probability.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We trained our translation models on a parallel corpus of Chinese-English newswire text. We restricted ourselves to sentences of no more than 25 words in either language, resulting in a training corpus of 18,773 sentence pairs with a total of 276,113 Chinese words and 315,415 English words. The Chinese data were automatically segmented into tokens, and English capitalization was retained. We replace words occurring only once with an unknown word token, resulting in a Chinese vocabulary of 23,783 words and an English vocabulary of 27,075 words. Our hand-aligned data consisted of 48 sentence pairs also with less than 25 words in either language, for a total of 788 English words and 580 Chinese words. A separate development set of 49 sentence pairs was used to control overfitting. These sets were the data used by Hwa et al. (2002). The hand aligned test data consisted of 745 individual aligned word pairs. Words could be aligned one-to-many in either direction. This limits the performance achievable by our models; the IBM models allow one-to-many alignments in one direction only, while the tree-based models allow only one-to-one alignment unless the cloning operation is used.</Paragraph> <Paragraph position="1"> Our French-English experiments were based on data from the Canadian Hansards made available by Ulrich German. We used as training data 20,000 sentence pairs of no more than 25 words in either language. Our test data consisted of 447 sentence pairs of no more than 30 words, hand aligned by Och and Ney (2000). A separate development set of 37 sentences was used to control overfitting.</Paragraph> <Paragraph position="2"> We used of vocabulary of words occurring at least 10 times in the entire Hansard corpus, resulting in 19,304 English words and 22,906 French words.</Paragraph> <Paragraph position="3"> Our test set is that used in the alignment evaluation organized by Mihalcea and Pederson (2003), though we retained sentence-initial capitalization, used a closed vocabulary, and restricted ourselves to a smaller training corpus. We parsed the English side of the data with the Collins parser. As an artifact of the parser's probability model, it outputs sentence-final punctuation attached at the lowest level of the tree. We raised sentence-final punctuation to be a daughter of the tree's root before training our parse-based model. As our Chinese-English test data did not include sentence-final punctuation, we also removed it from our French-English test set.</Paragraph> <Paragraph position="4"> We evaluate our translation models in terms of agreement with human-annotated word-level alignments between the sentence pairs. For scoring the viterbi alignments of each system against gold-standard annotated alignments, we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words:</Paragraph> <Paragraph position="6"> where A is the set of word pairs aligned by the automatic system, GS is the set marked in the gold standard as &quot;sure&quot;, and GP is the set marked as &quot;possible&quot; (including the &quot;sure&quot; pairs). In our Chinese-English data, only one type of alignment was marked, meaning that GP = GS. For a better understanding of how the models differ, we break this figure down into precision:</Paragraph> <Paragraph position="8"> Since none of the systems presented in this comparison make use of hand-aligned data, they may differ in the overall proportion of words that are aligned, rather than inserted or deleted. This affects the precision/recall tradeoff; better results with respect to human alignments may be possible by adjusting an overall insertion probability in order to optimize AER.</Paragraph> <Paragraph position="9"> Table 1 provides a comparison of results using the tree-based models with the word-level IBM models.</Paragraph> <Paragraph position="10"> IBM Models 1 and 4 refer to Brown et al. (1993).</Paragraph> <Paragraph position="11"> We used the GIZA++ package, including the HMM model of Och and Ney (2000). We ran Model 1 for three iterations, then the HMM model for three iterations, and finally Model 4 for two iterations, training each model until AER began to increase on our held-out cross validation data. &quot;Inversion Transduction Grammar&quot; (ITG) is the model of Wu (1997), &quot;Tree-to-String&quot; is the model of Yamada and Knight (2001), and &quot;Tree-to-String, Clone&quot; allows the node cloning operation described above. Our tree-based models were initialized from uniform distributions for both the lexical translation probabilities and the tree reordering operations, and were trained until AER began to rise on our held-out cross-validation data, which turned out to be four iterations for the tree-to-string models and three for the Inversion Transduction Grammar. French-English results are shown in Table 2. Here, IBM Model 1 was trained for 12 iterations, then the HMM model for 5 iterations and Model 4 for 5 iterations. The ITG and tree-to-string models were both trained for 5 iterations. A learning curve for the Inversion Transduction Grammar, is shown in Figure 1, showing both perplexity on held-out data and alignment error rate.</Paragraph> <Paragraph position="12"> In general we found that while all models would increase in AER if trained for too many iterations, the increases were of only a few percent.</Paragraph> </Section> class="xml-element"></Paper>