File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2117_intro.xml
Size: 5,116 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2117"> <Title>Boosting Statistical Word Alignment Using Labeled and Unlabeled Data</Title> <Section position="3" start_page="0" end_page="913" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word alignment was first proposed as an intermediate result of statistical machine translation (Brown et al., 1993). In recent years, many researchers build alignment links with bilingual corpora (Wu, 1997; Och and Ney, 2003; Cherry and Lin, 2003; Wu et al., 2005; Zhang and Gildea, 2005). These methods unsupervisedly train the alignment models with unlabeled data.</Paragraph> <Paragraph position="1"> A question about word alignment is whether we can further improve the performances of the word aligners with available data and available alignment models. One possible solution is to use the boosting method (Freund and Schapire, 1996), which is one of the ensemble methods (Dietterich, 2000). The underlying idea of boosting is to combine simple &quot;rules&quot; to form an ensemble such that the performance of the single ensemble is improved. The AdaBoost (Adaptive Boosting) algorithm by Freund and Schapire (1996) was developed for supervised learning.</Paragraph> <Paragraph position="2"> When it is applied to word alignment, it should solve the problem of building a reference set for the unlabeled data. Wu and Wang (2005) developed an unsupervised AdaBoost algorithm by automatically building a pseudo reference set for the unlabeled data to improve alignment results.</Paragraph> <Paragraph position="3"> In fact, large amounts of unlabeled data are available without difficulty, while labeled data is costly to obtain. However, labeled data is valuable to improve performance of learners. Consequently, semi-supervised learning, which combines both labeled and unlabeled data, has been applied to some NLP tasks such as word sense disambiguation (Yarowsky, 1995; Pham et al., 2005), classification (Blum and Mitchell, 1998; Thorsten, 1999), clustering (Basu et al., 2004), named entity classification (Collins and Singer, 1999), and parsing (Sarkar, 2001).</Paragraph> <Paragraph position="4"> In this paper, we propose a semi-supervised boosting method to improve statistical word alignment with both limited labeled data and large amounts of unlabeled data. The proposed approach modifies the supervised AdaBoost algorithm to a semi-supervised learning algorithm by incorporating the unlabeled data. Therefore, it should address the following three problems. The first is to build a word alignment model with both labeled and unlabeled data. In this paper, with the labeled data, we build a supervised model by directly estimating the parameters in the model instead of using the Expectation Maximization (EM) algorithm in Brown et al.</Paragraph> <Paragraph position="5"> (1993). With the unlabeled data, we build an unsupervised model by estimating the parameters with the EM algorithm. Based on these two word alignment models, an interpolated model is built through linear interpolation. This interpolated model is used as a learner in the semi-supervised AdaBoost algorithm. The second is to build a reference set for the unlabeled data. It is automatically built with a modified &quot;refined&quot; combination method as described in Och and Ney (2000). The third is to calculate the error rate on each round. Although we build a reference set for the unlabeled data, it still contains alignment errors. Thus, we use the reference set of the labeled data instead of that of the entire training data to calculate the error rate on each round.</Paragraph> <Paragraph position="6"> With the interpolated model as a learner in the semi-supervised AdaBoost algorithm, we investigate two boosting methods in this paper to improve statistical word alignment. The first method uses the unlabeled data only in the interpolated model. During training, it only changes the distribution of the labeled data. The second method changes the distribution of both the labeled data and the unlabeled data during training. Experimental results show that both of these two methods improve the performance of statistical word alignment.</Paragraph> <Paragraph position="7"> In addition, we combine the final results of the above two semi-supervised boosting methods.</Paragraph> <Paragraph position="8"> Experimental results indicate that this combination outperforms the unsupervised boosting method as described in Wu and Wang (2005), achieving a relative error rate reduction of 19.52%. And it also achieves a reduction of 28.29% as compared with the supervised boosting method that only uses the labeled data.</Paragraph> <Paragraph position="9"> The remainder of this paper is organized as follows. Section 2 briefly introduces the statistical word alignment model. Section 3 describes parameter estimation method using the labeled data. Section 4 presents our semi-supervised boosting method. Section 5 reports the experimental results. Finally, we conclude in section 6.</Paragraph> </Section> class="xml-element"></Paper>