File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1058_metho.xml
Size: 2,610 bytes
Last Modified: 2025-10-06 14:09:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1058"> <Title>Alignment Model Adaptation for Domain-Specific Word Alignment</Title> <Section position="3" start_page="467" end_page="468" type="metho"> <SectionTitle> 2 Statistical Word Alignment </SectionTitle> <Paragraph position="0"> According to the IBM models (Brown et al., 1993), the statistical word alignment model can be generally represented as in Equation (1).</Paragraph> <Paragraph position="1"> In this paper, we use a simplified IBM model 4 (Al-Onaizan et al., 1999), which is shown in Equation (2). This simplified version does not take word classes into account as described in (Brown et al., 1993).</Paragraph> <Paragraph position="3"> (2) ml, are the lengths of the target sentence and the source sentence respectively.</Paragraph> <Paragraph position="4"> j is the position index of the source word. j a is the position of the target word aligned to the j th source word.</Paragraph> <Paragraph position="6"> p is the fertility probability for e , and is the distortion probability for the remaining words of the cept.</Paragraph> <Paragraph position="8"> is the center of cept i.</Paragraph> <Paragraph position="9"> During the training process, IBM model 3 is first trained, and then the parameters in model 3 are employed to train model 4. During the testing process, the trained model 3 is also used to get an initial alignment result, and then the trained model 4 is employed to improve this alignment result. For convenience, we describe model 3 in Equation (3).</Paragraph> <Paragraph position="10"> The main difference between model 3 and model 4 lies in the calculation of distortion probability.</Paragraph> <Paragraph position="12"> However, both model 3 and model 4 do not take the multiword cept into account. Only one-to-one and many-to-one word alignments are considered. Thus, some multi-word units in the domain-specific corpus cannot be correctly aligned. In order to deal with this problem, we perform word alignment in two directions (source to target, and target to source) as described in (Och and Ney, 2000). The GIZA++ toolkit is used to perform statistical word alignment.</Paragraph> <Paragraph position="13"> We use and to represent the bi-directional alignment sets, which are shown in Equation (4) and (5). For alignment in both sets, we use j for source words and i for target words. If a target word in position i is connected to source words in positions and , then .</Paragraph> <Paragraph position="14"> We call an element in the alignment set an alignment link.</Paragraph> </Section> class="xml-element"></Paper>