File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/p91-1023_metho.xml
Size: 9,314 bytes
Last Modified: 2025-10-06 14:12:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P91-1023"> <Title>A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA</Title> <Section position="3" start_page="178" end_page="179" type="metho"> <SectionTitle> 2. A Dynamic Programming Framework </SectionTitle> <Paragraph position="0"> Now, let us consider how sentences can be aligned within a paragraph. The program makes use of the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. 2 A probabilistic score is assigned to each proposed pair of sentences, based on the ratio of lengths of the two sentences (in characters) and the variance of this We will have little to say about how sentence boanderies am identified. Identifying sentence boundaries is not always as easy as it might appear for masons described in Libennan and Church (to appear). It would be much easier if periods were always used to mark sentence boundaries, but unfortunately, many periods have other purposes. In the Brown Corpus, for example, only 90% of the periods am used to mark seutence boundaries; the remaining 10% appear in nmnerical expressions, abbreviations and so forth. In the Wall Street Journal, there is even more discussion of dollar amotmts and percentages, as well as more use of abbreviated titles such as Mr.; consequently, only 53% of the periods in the the Wall Street Journal are used to identify sentence boundaries.</Paragraph> <Paragraph position="1"> For the UBS data, a simple set of heuristics were used to identify sentences boundaries. The dataset was sufficiently small that it was possible to correct the reznaining mistakes by hand. For a larger dataset, such as the Canadian Hansards, it was not possible to check the results by hand. We used the same procedure which is used in (Church, 1988). This procedure was developed by Kathryn Baker (private communication).</Paragraph> <Paragraph position="2"> ratio. This probabilistic score is used in a dynamic programming framework in order to find the maximum likelihood alignment of sentences.</Paragraph> <Paragraph position="3"> We were led to this approach after noting that the lengths (in characters) of English and German paragraphs are highly correlated (.991), as illustrated in the following figure.</Paragraph> <Paragraph position="4"> length of English paragraphs, while the vertical scale shows the lengths of the corresponding German paragraphs. Note that the correlation is quite large (.991).</Paragraph> <Paragraph position="5"> Dynamic programming is often used to align two sequences of symbols in a variety of settings, such as genetic code sequences from different species, speech sequences from different speakers, gas chromatograph sequences from different compounds, and geologic sequences from different locations (Sankoff and Kruskal, 1983).</Paragraph> <Paragraph position="6"> We could expect these matching techniques to be useful, as long as the order of the sentences does not differ too radically between the two languages.</Paragraph> <Paragraph position="7"> Details of the alignment techniques differ considerably from one application to another, but all use a distance measure to compare two individual elements within the sequences, and a dynamic programming algorithm to minimize the total distances between aligned elements within two sequences. We have found that the sentence alignment problem fits fairly well into this framework.</Paragraph> </Section> <Section position="4" start_page="179" end_page="180" type="metho"> <SectionTitle> 3. The Distance Measure </SectionTitle> <Paragraph position="0"> It is convenient for the distance measure to be based on a probabilistic model so that information can be combined in a consistent way. Our distance measure is an estimate of -log Prob(match\[8), where 8 depends on !1 and 12, the lengths of the two portions of text under consideration. The log is introduced here so that adding distances will produce desirable results.</Paragraph> <Paragraph position="1"> This distance measure is based on the assumption that each character in one language, L 1, gives rise to a random number of characters in the other language, L2. We assume these random variables are independent and identically distributed with a normal distribution. The model is then specified by the mean, c, and variance, s 2, of this distribution, c is the expected number of characters in L2 per character in L1, and s 2 is the variance of the number of characters in L2 per character in LI. We define 8 to be (12-11 c)l~s 2 so that it has a normal distribution with mean zero and variance one (at least when the two portions of text under consideration actually do happen to be translations of one another).</Paragraph> <Paragraph position="2"> The parameters c and s 2 are determined empirically from the UBS data. We could estimate c by counting the number of characters in German paragraphs then dividing by the number of characters in corresponding English paragraphs. We obtain 81105173481 = 1.1. The same calculation on French and English paragraphs yields c = 72302/68450 = 1.06 as the expected number of French characters per English characters. As will be explained later, performance does not seem to very sensitive to these precise language dependent quantities, and therefore we simply assume c = 1, which simplifies the program considerably.</Paragraph> <Paragraph position="3"> The model assumes that s 2 is proportional to length. The constant of proportionality is determined by the slope of a robust regression.</Paragraph> <Paragraph position="4"> The result for English-German is s 2 = 7.3, and for English-French is s 2 = 5.6. Again, we have found that the difference in the two slopes is not too important. Therefore, we can combine the data across languages, and adopt the simpler language independent estimate s 2 = 6.8, which is what is actually used in the program.</Paragraph> <Paragraph position="5"> We now appeal to Bayes Theorem to estimate Prob (match l 8) as a constant times Prob(81match) Prob(match). The constant can be ignored since it will be the same for all proposed matches. The conditional probability</Paragraph> <Paragraph position="7"> where Prob(\[SI) is the probability that a random variable, z, with a standardized (mean zero, variance one) normal distribution, has magnitude at least as large as 18 \[ The program computes 8 directly from the lengths of the two portions of text, Ii and 12, and the two parameters, c and s 2. That is,</Paragraph> <Paragraph position="9"> computed by integrating a standard normal distribution (with mean zero and variance 1).</Paragraph> <Paragraph position="10"> Many statistics textbooks include a table for computing this.</Paragraph> <Paragraph position="11"> The prior probability of a match, Prob(match), is fit with the values in Table 5 (below), which were determined from the UBS data. We have found that a sentence in one language normally matches exactly one sentence in the other language (1-1), three additional possibilities are also considered: 1-0 (including 0-I), 2-I (including I-2), and 2-2. Table 5 shows all four possibilities.</Paragraph> <Paragraph position="12"> This completes the discussion of the distance measure. Prob(matchlS) is computed as an (irrelevant) constant times Prob(Slmatch) Prob(match). Prob(match) is computed using the values in Table 5.</Paragraph> <Paragraph position="13"> Prob(Slmatch) is computed by assuming that</Paragraph> <Paragraph position="15"> first calculate 8 as (12 - 11 c)/~\[-~1 s 2 and then erob(181) is computed by integrating a standard normal distribution.</Paragraph> <Paragraph position="16"> The distance function two side distance is defined in a general way to al\]-ow for insertions, deletion, substitution, etc. The function takes four argnments: xl, Yl, x2, Y2.</Paragraph> <Paragraph position="17"> 1. Let two_side_distance(x1, Yl ; 0, 0) be the cost of substituting xl with y 1, 2. two side_distance(xl, 0; 0, 0) be the cost of deleting Xl, 3. two_sidedistance(O, Yl ; 0, 0) be the cost of insertion of yl, 4. two side_distance(xl, Yl ; xg., O) be the cost of contracting xl and x2 to yl, 5. two_sidedistance(xl, Yl ; 0, Y2) be the cost of expanding xl to Y 1 and yg, and 6. two sidedistance(xl, Yl ; x2, yg.) be the cost of merging Xl and xg. and matching with y i and yg..</Paragraph> <Paragraph position="18"> 4. The Dynamic Programming Algorithm The algorithm is summarized in the following recursion equation. Let si, i= 1...I, be the sentences of one language, and t j, j= 1 .-- J, be the translations of those sentences in the other language. Let d be the distance function (two_side_distance) described in the previous section, and let D(i,j) be the minimum distance between sentences sl. *&quot; si and their translations tl, &quot;&quot; tj, under the maximum likelihood alignment. D(i,j) is computed recursively, where the recurrence minimizes over six cases (substitution, deletion, insertion, contraction, expansion and merger) which, in effect, impose a set of slope constraints. That is, DO,j) is calculated by the following recurrence with the</Paragraph> <Paragraph position="20"/> </Section> class="xml-element"></Paper>