File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/e93-1054_intro.xml
Size: 14,548 bytes
Last Modified: 2025-10-06 14:05:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1054"> <Title>Text Alignment in a Tool for Translating Revised Documents</Title> <Section position="3" start_page="0" end_page="451" type="intro"> <SectionTitle> 3 Alignment </SectionTitle> <Paragraph position="0"> Length-based alignment algorithms \[Gale and Church, 1991b; Brown el al., 1991\] are computationally efficient which makes them attractive for aligning large quantities of text. The main problem with them is that they expect that, by and large, every sentence in one language has a corresponding sentence in the other (there can be insertions and deletions but they must be minor). In the character-based algorithm, for example, this is implicit in the assumption that the number of characters of the SL text at each point (counting from the beginning of the text) is a predictor for the number of characters in the TL. This assumption may hold for some texts but it cannot be relied on. As a consequence of nationalization, one text may be substantially longer than the other and this makes the length correspondence assumption incorrect (if the additions and omission were not reflected in the length of the two texts, the situation would have been even worse). Simply, the cumulative length of the text is no longer a good predictor for the length of its translation. This problem affects the consideration of the text as a whole. However, locally, the length-correspondence assumption can still be mainrained. Gale and Church hint that their method works well for aligning sentences within paragraphs and that they use different means to find the correspondence (or lack thereof) of paragraphs. A more detailed description of such an approach is given by Brown et al. that use structural information to drive the correspondence of larger quantities of text. However, such clues are not always available. In order to address this problem more generally I developed an algorithm that is more robust in detecting insertions and deletions which I use for aligning paragraphs.</Paragraph> <Section position="1" start_page="449" end_page="450" type="sub_section"> <SectionTitle> 3.1 Aligning Paragraphs </SectionTitle> <Paragraph position="0"> The paragraph alignment algorithm relies on the observation that long segments of text translate into long segments and short ones into short ones. Unlike the approach taken in Gale and Church, it does not assume that for each text segment in the SL version there is a corresponding segment in the TL. Instead, the algorithm calculates for each pair of text segments (paragraphs in this case) a score based on their lengths. For each potential pair of segments, several editing assumptions (one-to-one, one-to-many, etc.) are considered and the one with the best score is chosen. Dynamic programming is then used to collect the set of pairs which yields the maximum likelihood alignment. The score needs to favor pairing segments of roughly the same length but since there is more variability as the length of the segments increases, the score needs to be more tolerant with longer segments. This effect is achieved by the following formula which provides the basis for scoring: \[i, s(i, j) = X/l' + lj It approaches zero as the lengths get closer but it does so faster as the absolute length of the segments gets longer. So, for example sxo,2o = 1.8257, but s110,220 = .5504 (the square root of the sum is used instead of simply the sum so that sx0,~0 would be different from s100,200). This simple heuristic seems to work well for the purpose of distinguishing correlated text segments. However, since paragraphs can be quite long and the degree of variability between them grows proportionally, this score is not always sufficient to put things in order. To augment it, more information is considered. The actual score for deciding that two paragraphs are matched also takes into consideration a sequence of paragraphs immediately preceding and following them (see figure 1 for an illustration). This is based on the observation that the potential for aligning a pair of segments also depends on the potential of them being in a context of alignable pairs of segments. According to this scheme, a pair with a relatively low score can still be taken as a correspondence if there are segments of text preceding and following it which are likely to form correspondences.</Paragraph> <Paragraph position="1"> This scheme lends itself to calculating a score for the assumption that a given paragraph is an in- null &quot;&quot;:'&quot;:'&quot;:'&quot;:'&quot;.::::: ~ :::: .... -&quot;:&quot;':...:...:..':::: ...L..i...L..L..L..i~...L..i...i...i... :...i...i... * ..:...y..y--y-.y..: ~...y-.f..-y--. ...~...:...:.-- iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii .... deg ............ ............. deg . . .</Paragraph> <Paragraph position="2"> ........... deg...o...o...degdeg..o ....... deg...deg...deg...deg...deg..degdeg....deg ...... ~... ...... deg .......... ..... degdeg ..........</Paragraph> <Paragraph position="3"> sertion (or deletion). So, if segment i is an insertion, the context for considering it will consist of the following pairs ... i - 2/j - 2, i - 1/j - 1, i + l/j, i + 2/j + 1... This way, a score is being assigned to the assumption that a certain segment in one text has no corresponding segment in the other text. Likewise, ifj and j+ 1 are insertions to the other text the score considers ... i - 2/j - 2, i - 1/j - 1, i/j + 2,.</Paragraph> <Paragraph position="4"> i + 1/j + 3... as the appropriate context for calculation the score.</Paragraph> <Paragraph position="5"> It is easy to see how this works for insertions of short sequences but it remains to be explained how arbitrarily long sequences are handled. In principle, it would be best if for each n (the length of a sequence of insertions), the following context would consist of i + n/j, i + n + 1/j / 1 etc. but obviously, this is not practical. This is related to another potential problem which has to do with the contexts calculated near insertions or deletions. Figure 1 depicts this situation (the gray squares identify the context for aligning the pairs denoted by the black squares; the marked path stands for the correct alignment).</Paragraph> <Paragraph position="6"> The alignment score of a segment previous to an insertion is based on appropriate preceding context but irrelevant following context (the reverse holds for a segment following an insertion) 1 . To minimize the effect of this situation, a threshold is introduced'so that when the score of one side of the context is good, the effect of very bad score in the other side of the context is kept below a certain value. Note also that 1This is an importaat factor for selecting the amount of context. It could be assumed that the wider the window of segments around each pair is, the more accurate the determination of its alignment will be. However, this is not the case exactly because of the fact that occasionally the algorithm has to consider some ~noise'. Empirical experimentation revealed that a window of 6 segments (3 to each side) provides the best compromise between beneficial information and noise.</Paragraph> <Paragraph position="7"> although some noise is being introduced into the calculation of these scores, other editing assumptions are likely to be considered even worse. Occasionally this has an effect on the exact placement of the insertion but in most cases, the dynamic programming approach, by seeking a global maximum, picks up the correct alignment.</Paragraph> <Paragraph position="8"> Now, let me return to the issue of long sequences of insertions. The situation is that in one location there is a sequence of high-quality alignment, then there is a disruption with scores calculated for arbitrary pairs of text segments, and then another sequence of high quality alignment begins. What happens in most cases is that between these two points, the scores for insertions or deletions are better than the scores assigned to random pairs of segments. Here too, the effect of global maximization forces the algorithm to pass through the points where the insertion begins, resume synchronization where it ends and consider the points in between as a long sequence of unpaired segments of texts. In other words, once the edges are set correctly, the remainder of the chain is almost always also correct, even though it is not based on appropriate contexts.</Paragraph> <Paragraph position="9"> This potential problem is the weakest aspect of the algorithm but essentially, it does not have an impact on the quality of the alignment. Note also that even if the exact locus of insertion (or deletion) is not known, the fact that the algorithm detects the presence of text with no corresponding translation is the crucial matter. This way, the synchronization of the text segments can be maintained and alignment errors, even when they happen, can only have a very local effect. To demonstrate this, let us consider a concrete example. An English and a French versions of a software manual contain 628 and 640 paragraphs, respectively. In all, there are 30 paragraphs embedded in them which do not have a translation (some in fact do, but due to reordering of the text, these were considered as deletion from one location and then insertion in another location). The algorithm matched 618 pairs of paragraphs, only 11 of which were actually wrong. Note that between the two texts there were 13 different insertions and deletions of sequences varying from 1 to 6 paragraphs in length. The algorithm has proven to be extremely reliable in detecting segments of text that do not have a translation and this makes it very useful in dealing with what I have called &quot;real-life&quot; texts.</Paragraph> <Paragraph position="10"> To summarize, this algorithm relies on the general assumption that the length of a segment of text is correlated with the length of its translation. It uses a sliding window for determining for each segment the likelihood of it being in a sequence of aligned text. This technique considers the correspondence as a local phenomenon, thereby allowing segments of text to appear in one text without a corresponding segments in its translation.</Paragraph> </Section> <Section position="2" start_page="450" end_page="450" type="sub_section"> <SectionTitle> 3.2 Aligning Sentences </SectionTitle> <Paragraph position="0"> Sentences within paragraphs are aligned with the character-based probabilistic algorithm \[Gale and Church, 1991b\]. I used their algorithm since, compared to the algorithm described in the previous section, it is based on more firm theoretical grounds and within paragraphs, the assumptions it is based on are usually met.</Paragraph> <Paragraph position="1"> However, there can be cases where it will be advantageous to use the new algorithm even at the sentence level. In texts where paragraphs are very long and contain sequences of inserted sentences, the character-based alignment will not perform well, because of the same considerations discussed above.</Paragraph> <Paragraph position="2"> Even a small amount of additions or omissions from one of the texts completely throws off alignment algorithms that do not entertain this possibility. In this respect, the new algorithm is more general than previous length-based approaches to alignment.</Paragraph> </Section> <Section position="3" start_page="450" end_page="451" type="sub_section"> <SectionTitle> 3.3 Minimizing alignment errors </SectionTitle> <Paragraph position="0"> An inherent property of the dynamic programming technique is that the effect of errors is kept at the local level; a single wrong pairing of two segments does not force all the following pairs to be also incorrect. This behavior is achieved by forcing another error, close to the first one, which compensates for the mistake and restore synchronization. As a result, errors in the alignment usually occur in pairs of opposite directionality (if the first error is to insert a sentence to one of the texts, the second is to insert a sentence into the other text). This situation is depicted in figure 2.</Paragraph> <Paragraph position="1"> This, of course, can be a perfectly legitimate alignment but it is more likely to be a result of an error. These cases are easy to detect with a simple algorithm, which at the expense of losing some information can yield much better overall accuracy.</Paragraph> <Paragraph position="2"> Each pair in the alignment is assigned one of 3 values: a if it is many-to-one (or one-to-zero) alignment, /~ if it is one-to-one alignment and 7 if it is one-to-many (or zero-to-one) alignment. Intuitively, these values correspond to which text grows faster as a result of each pair of aligned segments. Having done that, the algorithm is simply a finite-state automaton that detects sequences of the form a/~k 7 (or 7flk~) where k ranges from 0 to n (a predefined window size). The effect is that when an error occurs in one position and there is another &quot;error&quot; (with opposite contribution to the relative length of the text) within a certain number of segments, it is interpreted as a case of compensation; if it occurs farther away the situation is interpreted as involving two independent editing operations. The window is set to 4, since the dynamic programming approach is very fast in recovering from local errors.</Paragraph> <Paragraph position="3"> When such a sequence is found, all the segments included in it are marked as insertions so the resulting alignment contains two contiguous sequences of inserted material, one to each one of the texts. This prevents wrong pairings to occur between the two identified alignment errors. For example, in figure 2, the pairing of segments 5/8 and 6/9 is undone, as it is likely to be incorrect.</Paragraph> <Paragraph position="4"> Another possibility for minimizing the effect of alignment error has to do with the fact that occasionally, the exact location of an insertion of text cannot be determined completely accurately. I found that by disregarding a very small region around each instance of an insertion or deletion, the number of alignment mistakes can be reduced even farther. At the moment I found that to be unnecessary but it may be advantageous for other applications, such as obtaining even higher-quality pairs for the purpose of extraction of word correspondences.</Paragraph> </Section> </Section> class="xml-element"></Paper>