File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2124_evalu.xml
Size: 11,113 bytes
Last Modified: 2025-10-06 13:59:42
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2124"> <Title>BiTAM: Bilingual Topic AdMixture Models for Word Alignment</Title> <Section position="7" start_page="972" end_page="975" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We evaluate BiTAM models on the word alignment accuracy and the translation quality. For word alignment accuracy, F-measure is reported, i.e., the harmonic mean of precision and recall against a gold-standard reference set; for translation quality, Bleu (Papineni et al., 2002) and its variation of NIST scores are reported.</Paragraph> <Paragraph position="1"> We have two training data settings with different sizes (see Table 1). The small one consists of 316 document-pairs from Tree-bank (LDC2002E17). For the large training data setting, we collected additional document-pairs from FBIS (LDC2003E14, Beijing part), Sinorama (LDC2002E58), and Xinhua News (LDC2002E18, document boundaries are kept in our sentence-aligner (Zhao and Vogel, 2002)).</Paragraph> <Paragraph position="2"> There are 27,940 document-pairs, containing 327K sentence-pairs or 12 million (12M) English tokens and 11M Chinese tokens. To evaluate word alignment, we hand-labeled 627 sentence-pairs from 95 document-pairs sampled from TIDES'01 dryrun data. It contains 14,769 alignment-links. To evaluate translation quality, TIDES'02 Eval.</Paragraph> <Paragraph position="3"> test is used as development set, and TIDES'03 Eval. test is used as the unseen test data.</Paragraph> <Section position="1" start_page="972" end_page="973" type="sub_section"> <SectionTitle> 5.1 Model Settings </SectionTitle> <Paragraph position="0"> First, we explore the effects of Null word and smoothing strategies. Empirically, we find that adding &quot;Null&quot; word is always beneficial to all models regardless of number of topics selected.</Paragraph> <Paragraph position="1"> estimated from the topic-specific English sentences weighted by {phdnk}. 33 functional words were removed to highlight the main content of each topic. Topic A is about Us-China economic relationships; Topic B relates to Chinese companies' merging; Topic C shows the sports of handicapped people.</Paragraph> <Paragraph position="2"> The interpolation smoothing in SS4.2 is effective, and it gives slightly better performance than Laplace smoothing over different number of topics for BiTAM-1. However, the interpolation leverages the competing baseline lexicon, and this can blur the evaluations of BiTAM's contributions.</Paragraph> <Paragraph position="3"> Laplace smoothing is chosen to emphasize more on BiTAM's strength. Without any smoothing, F-measure drops very quickly over two topics. In all our following experiments, we use both Null word and Laplace smoothing for the BiTAM models.</Paragraph> <Paragraph position="4"> We train, for comparison, IBM-1&4 and HMM models with 8 iterations of IBM-1, 7 for HMM and 3 for IBM-4 (18h743) with Null word and a maximum fertility of 3 for Chinese-English.</Paragraph> <Paragraph position="5"> Choosing the number of topics is a model selection problem. We performed a ten-fold crossvalidation, and a setting of three-topic is chosen for both the small and the large training data sets. The overall computation complexity of the BiTAM is linear to the number of hidden topics.</Paragraph> </Section> <Section position="2" start_page="973" end_page="973" type="sub_section"> <SectionTitle> 5.2 Variational Inference </SectionTitle> <Paragraph position="0"> Under a non-symmetric Dirichlet prior, hyperparameter a is initialized randomly; B (K translation lexicons) are initialized uniformly as did in IBM-1. Better initialization of B can help to avoid local optimal as shown in SS 5.5.</Paragraph> <Paragraph position="1"> With the learned B and a fixed, the variational parameters to be computed in Eqn. (8-10) are initialized randomly; the fixed-point iterative updates stop when the change of the likelihood is smaller than 10[?]5. The convergent variational parameters, corresponding to the highest likelihood from 20 random restarts, are used for retrieving the word alignment for unseen document-pairs. To estimate B, b (for BiTAM-2) and a, at most eight variational EM iterations are run on the training data. Figure 2 shows absolute 2[?]3% better F-measure over iterations of variational EM using two and three topics of BiTAM-1 comparing with IBM-1.</Paragraph> <Paragraph position="2"> smoothing; IBM-1 is shown over eight EM iterations for comparison.</Paragraph> </Section> <Section position="3" start_page="973" end_page="973" type="sub_section"> <SectionTitle> 5.3 Topic-Specific Translation Lexicons </SectionTitle> <Paragraph position="0"> The topic-specific lexicons Bk are smaller in size than IBM-1, and, typically, they contain topic trends. For example, in our training data, North Korean is usually related to politics and translated into &quot;ChaoXian&quot; ( m); South Korean occurs more often with economics and is translated as &quot;HanGuo&quot;(,I). BiTAMs discriminate the two by considering the topics of the context. Table 2 shows the lexicon entries for &quot;Korean&quot; learned by a 3-topic BiTAM-1. The values are relatively sharper, and each clearly favors one of the candidates. The co-occurrence count, however, only favors &quot;HanGuo&quot;, and this can easily dominate the decisions of IBM and HMM models due to their ignorance of the topical context. Monolingual topics learned by BiTAMs are, roughly speaking, fuzzy especially when the number of topics is small. With proper filtering, we find that BiTAMs do capture some topics as illustrated in Table 3.</Paragraph> </Section> <Section position="4" start_page="973" end_page="974" type="sub_section"> <SectionTitle> 5.4 Evaluating Word Alignments </SectionTitle> <Paragraph position="0"> We evaluate word alignment accuracies in various settings. Notably, BiTAM allows to test alignments in two directions: English-to-Chinese (EC) and Chinese-to-English (CE). Additional heuristics are applied to further improve the accuracies. Inter takes the intersection of the two directions and generates high-precision alignments; the Models, and HMMs with a training scheme of 18h743 on the Treebank data listed in Table 1. For each column, the highlighted alignment (the best one under that model setting) is picked up to further evaluate the translation quality. Union of two directions gives high-recall; Refined grows the intersection with the neighboring word-pairs seen in the union, and yields high-precision and high-recall alignments.</Paragraph> <Paragraph position="1"> As shown in Table 4, the baseline IBM-1 gives its best performance of 36.27% in the CE direction; the UDA alignments from BiTAM-1[?]3 give 40.13%, 40.26%, and 40.47%, respectively, which are significantly better than IBM-1. A close look at the three BiTAMs does not yield significant difference. BiTAM-3 is slightly better in most settings; BiTAM-1 is slightly worse than the other two, because the topics sampled at the sentence level are not very concentrated. The BDA alignments of BiTAM-1[?]3 yield 48.26%, 48.63% and 49.02%, which are even better than HMM and IBM-4 -- their best performances are at 44.26% and 45.96%, respectively. This is because BDA partially utilizes similar heuristics on the approximated posterior matrix {phdnji} instead of direct operations on alignments of two directions in the heuristics of Refined. Practically, we also apply BDA together with heuristics for IBM-1, HMM and IBM-4, and the best achieved performances are at 40.56%, 46.52% and 49.18%, respectively. Overall, BiTAM models achieve performances close to or higher than HMM, using only a very simple IBM-1 style alignment model.</Paragraph> <Paragraph position="2"> Similar improvements over IBM models and HMM are preserved after applying the three kinds of heuristics in the above. As expected, since BDA already encodes some heuristics, it is only slightly improved with the Union heuristic; UDA, similar to the viterbi style alignment in IBM and HMM, is improved better by the Refined heuristic.</Paragraph> <Paragraph position="3"> We also test BiTAM-3 on large training data, and similar improvements are observed over those of the baseline models (see Table. 5).</Paragraph> </Section> <Section position="5" start_page="974" end_page="974" type="sub_section"> <SectionTitle> 5.5 Boosting BiTAM Models </SectionTitle> <Paragraph position="0"> The translation lexicons of Bf,e,k are initialized uniformly in our previous experiments. Better initializations can potentially lead to better performances because it can help to avoid the undesirable local optima in variational EM iterations.</Paragraph> <Paragraph position="1"> We use the lexicons from IBM Model-4 to initialize Bf,e,k to boost the BiTAM models. This is one way of applying the proposed BiTAM models into current state-of-the-art SMT systems for further improvement. The boosted alignments are denoted as BUDA and BBDA in Table. 5, corresponding to the uni-direction and bi-direction alignments, respectively. We see an improvement in alignment quality.</Paragraph> </Section> <Section position="6" start_page="974" end_page="975" type="sub_section"> <SectionTitle> 5.6 Evaluating Translations </SectionTitle> <Paragraph position="0"> To further evaluate our BiTAM models, word alignments are used in a phrase-based decoder for evaluating translation qualities. Similar to the Pharoah package (Koehn, 2004), we extract phrase-pairs directly from word alignment together with coherence constraints (Fox, 2002) to remove noisy ones. We use TIDES Eval'02 CE test set as development data to tune the decoder parameters; the Eval'03 data (919 sentences) is the unseen data. A trigram language model is built using 180 million English words. Across all the reported comparative settings, the key difference is the bilingual ngram-identity of the phrase-pair, which is collected directly from the underlying word alignment.</Paragraph> <Paragraph position="1"> Shown in Table 4 are results for the small-data track; the large-data track results are in Table 5. For the small-data track, the baseline Bleu scores for IBM-1, HMM and IBM-4 are 15.70, 17.70 and 18.25, respectively. The UDA alignment of BiTAM-1 gives an improvement over the baseline IBM-1 from 15.70 to 17.93, and it is close to HMM's performance, even though BiTAM doesn't exploit any sequential structures of words. The proposed BiTAM-2 and BiTAM-3 are slightly better than BiTAM-1. Similar improvements are observed for the large-data track (see Table 5). Note that, the boosted BiTAM-3 us- null HMMs, and boosted BiTAMs using all the training data listed in Table. 1. Other experimental conditions are similar to Table. 4. ing IBM-4 as the seed lexicon, outperform the Refined IBM-4: from 23.18 to 24.07 on Bleu score, and from 7.83 to 8.23 on NIST. This result suggests a straightforward way to leverage BiTAMs to improve statistical machine translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>