File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/j97-2004_abstr.xml

Size: 7,978 bytes

Last Modified: 2025-10-06 13:48:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-2004">
  <Title>A Class-based Approach to Word Alignment</Title>
  <Section position="2" start_page="0" end_page="314" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Brown, Cocke, Della Pietra, Della Pietra, Jelinek, Laffert~ Mercer, and Roosin (1990) advocate a statistical approach to machine translation (MT) and initiate much of the recent interest in bilingual corpora. Statistical machine translation (SMT) can be understood as a word-by-word model consisting of two submodels: a language model for generating a source text segment S and a translation model for mapping S to its translation T. They recommend using a bilingual corpus to train the parameters of translation probability, Pr(S I T) in the translation model. For MT and other purposes, many methods have been proposed for sentence alignment of the Hansards, an English-French corpus of Canadian parliamentary debates (Brown, Lai, and Mercer 1991; Gale and Church 1991a; Sirnard, Foster, and Isabelle 1992; Chen 1993; Gale and Church 1993), and for other language pairs, including English-German, English-Chinese, and English-Japanese (Kay and ROscheisen 1993; Church, Dagan, Gale, Fung, Helfman, and Satish 1993; Fung and McKeown 1994; Wu 1994). Alignment at other levels of resolution is obviously useful. A section, paragraph, sentence, phrase, collocation, or word can be aligned to its translation (Kupiec 1993; Smadja, McKeown, and Hatzivassiloglou 1996). Other logical approaches involve aligning parse trees of a sentence and its translation (Matsumoto, Ishimoto, and Utsuro 1993; Meyers, Yangarber, and Grishman 1996), or simultaneously generating parse trees and alignment arrangements (Wu 1995).</Paragraph>
    <Paragraph position="1"> * Department of Computer Science, National Tsing Hua University, Hsinchu, 30043, Taiwan, ROC. E-mail: ksj@volans.cs.scu.edu.tw; jschang@cs.nthu.edu.tw (~) 1997 Association for Computational Linguistics Computational Linguistics Volume 23, Number 2 In addition to machine translation, many applications for aligned corpora have been suggested, including machine-aided translation (Shemtov 1993), translation assessment and critiquing tools (Isabelle 1992; des Tombe and Armstrong-Warwick 1993; Macklovitch 1994), text generation (Smadja 1992; Smadja, McKeown, and Hatzivassiloglou 1996), bilingual lexicography (Klavans and Tzoukermann 1990; Church and Gale 1991; Daille, Gaussier, and Lange 1994; Kupiec 1993; van der Eijk 1993; Li 1994; Wu and Xia 1994), and word-sense disambiguation (Gale, Church, and Yarowsky 1992; Chang, Chen, Sheng, and Ker 1996). For these applications, we must go one step further from sentence alignment and identify alignment at the word level. In the process of word alignment, the translation of each source word is identified. This study concentrates primarily on identifying alignment at the word level for a given sentence and its translation.</Paragraph>
    <Paragraph position="2"> In the context of SMT, Brown et al. (1993) present a series of five models of Pr(S I T) for word alignment. Model 1 assumes that Pr(S \] T) depends only on lexical translation probability (LTP) t(s I t), that is, the probability that the ith word s in S translates into the jth word t in T. The pair of words (s, t), or more precisely (s, t, i,j) since there could be more than one instance of s or t, is called a connection. Model 2 enhances Model I by considering the dependence of Pr(S I T) on the distortion probability (DP) d(i I J, l, m) where I and m are the respective lengths of S and T measured in number of words. Brown et al. (1990) propose using an adaptive Expectation and Maximization (EM) algorithm to estimate the parameters for LTP and DP from a bilingual corpus.</Paragraph>
    <Paragraph position="3"> The EM algorithm iterates between two phases to estimate LTP and DP until both functions converge. In the expectation phase, the parameters t(s I t) and d(i I J, l, m) in the SMT model for all possible values of s, t, i, j, I, and m are estimated from the sample of an aligned bilingual corpus. In the maximization phase, each sentence-translation pair in the corpus is aligned by maximizing the translation probability, Pr(S I T). They examine the feasibility of aligning the English-French Hansards corpus using the SMT model, on both the sentence level and the word level. The SMT model is then tested for the task of machine translation. The model produces 35 acceptable translations for 73 sentences. However, to our knowledge, the degree of success of word alignment has not yet been explored.</Paragraph>
    <Paragraph position="4"> Dagan, Church, and Gale (1993) observe that reliably distinguishing sentence boundaries for a noisy bilingual text scanned by an OCR device is quite difficult. In such a circumstance, they recommend aligning words directly without the preprocessing phase of sentence alignment. Under that proposal, a rough character-by-character alignment is first performed. Based on the character alignment, words are subsequently aligned based on a modified version of Brown et al.'s Model 2. The authors report that 60.5% of 65,000 words in a noisy document are correctly aligned. For 84% of the words, the offset from correct alignment is at most 3.</Paragraph>
    <Paragraph position="5"> Gale and Church (1991b) present an alternative algorithm that does not estimate and store probabilities for all word pairs to reduce memory requirement and to ensure robustness of probability estimation. Instead, for each source word s, only a handful of target words strongly associated with s are found and stored. Such a task is achieved by applying a X2-1ike statistic. They report that the method produces highly precise (95%) alignment for 61% of the words in the 800 sentences tested.</Paragraph>
    <Paragraph position="6"> This paper is motivated by the following observations: First, the above survey dearly reveals that word-based methods offer only limited coverage even after they are trained with an extremely large bilingual corpus. Second, we believe that for most applications, low coverage is just as serious as low precision. For aligned corpora to be useful for NLP tasks such as machine translation and word-sense disambiguation, a coverage rate higher than 60% is desirable, even at the expense of a slightly lower precision rate.</Paragraph>
    <Paragraph position="7">  Sue J. Ker and Jason S. Chang Word Alignment This paper presents a word alignment algorithm based on classification in existing thesauri. The proposed algorithm, called ClassAlign, relies on an automatic procedure to acquire class-based alignment rules; it does not employ word-by-word translation probabilities, nor does it use an iterative EM algorithm for estimating such probabilities. Experimental results indicate that classification based on existing thesauri is highly effective in broadening coverage while maintaining a high precision rate.</Paragraph>
    <Paragraph position="8"> The rest of this paper is organized as follows: In Section 2 we briefly discuss the nature of text and translation that justifies a class-based approach. A set of three algorithms leading to class-based alignment are outlined in Section 3. The algorithms' effectiveness is demonstrated through examples and their translations in the LecDOCE (Longman Group 1992), a bilingual version of the Longman Dictionary of Contemporary English (LDOCE, Proctor 1988), as well as sentences from bilingual texts in the LightShip User's Guide (Pilot Software Inc. 1993; Galaxy Software Services 1994). The experiments we undertook to assess the performance of these algorithms are the topic of Section 4. Quantitative experimental results are also summarized. In Section 5, we analyze the experimental results and consider ways in which the proposed algorithms might be extended and improved. Concluding remarks are made in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML