An Experiment in Hybrid Dictionary 
and Statistical Sentence Alignment 
Nigel Collier, Kenji Ono and Hideki Hirakawa 
Communication and Information Systems Laboratories 
Research and Development Center, Toshiba Corporation 
1 Komukai Toshiba-cho, Kawasaki-shi, Kanagawa 210-85S2, Japan 
{nigel, ono, hirakawa}@eel, rdc. toshiba, co. j p 
Abstract 
The task of aligning sentences in parallel corpora of 
two languages has been well studied using pure sta- 
tistical or linguistic models. We developed a linguis- 
tic method based on lexical matching with a bilin- 
gual dictionary and two statistical methods based 
on sentence length ratios and sentence offset prob- 
abilities. This paper seeks to further our knowl- 
edge of the alignment task by comparing the per- 
formance of the alignment models when used sepa- 
rately and together, i.e. as a hybrid system. Our 
results show that for our English-Japanese corpus of 
newspaper articles, the hybrid system using lexical 
matching and sentence length ratios outperforms the 
pure methods. 
1 Introduction 
There have been many approaches proposed to solve 
the problem of aligning corresponding sentences in 
parallel corpora. With a few notable exceptions 
however, much of this work has focussed on ei- 
ther corpora containing European language pairs or 
clean-parallel corpora where there is little reformat- 
ting. In our work we have focussed on developing 
a method for robust matching of English-Japanese 
sentences, based primarily on lexical matching. The 
method combines statistical information from byte 
length ratios. We show in this paper that this hybrid 
model is more effective than its constituent parts 
used separately. 
The task of sentence alignment is a critical first 
step in many automatic applications involving the 
analysis of bilingual texts such as extraction of bilin- 
gum vocabulary, extraction of translation templates, 
word sense disambiguation, word and phrase align- 
ment, and extraction of parameters for statistical 
translation models. Many software products which 
aid human translators now contain sentence align- 
ment tools as an aid to speeding up editing and ter- 
minology searching. 
Various methods have been developed for sentence 
alignment which we can categorise as either lexical 
such as (Chen, 1993), based on a large-scale bilin- 
gual lexicon; statistical such as (Brown et al., 1991) 
(Church, 1993)(Gale and Church, 1903)(Kay and 
RSsheheisen, 1993), based on distributional regular- 
ities of words or byte-length ratios and possibly in- 
ducing a bilingual lexicon as a by-product, or hybrid 
such as (Utsuro et al., 1994) (Wu, 1994), based on 
some combination of the other two. Neither of the 
pure approaches is entirely satisfactory for the fol- 
lowing reasons: 
• Text volume limits the usefulness of statistical 
approaches. We would often like to be able to 
align small amounts of text, or texts from var- 
ious domains which do not share the same sta- 
tistical properties. 
• Bilingual dictionary coverage limitations mean 
that we will often encounter problems establish- 
ing a correspondence in non-general domains. 
• Dictionary-based approaches are founded on an 
assumption of lexicul correspondence between 
language pairs. We cannot always rely on this 
for non-cognate language pairs, such as English 
and Japanese. 
• Texts are often heavily reformatted in trans- 
lation, so we cannot assume that the corpus 
will be clean, i.e. contain many one-to-one sen- 
tence mappings. In this case statistical methods 
which rely on structure correspondence such as 
byte-length ratios may not perform well. 
These factors suggest that some hybrid method 
may give us the best combination of coverage and 
accuracy when we have a variety of text domains, 
text sizes and language pairs. In this paper we seek 
to fill a gap in our understanding and to show how 
the various components of the hybrid method influ- 
ence the quality of sentence alignment for Japanese 
and English newspaper articles. 
2 Bilingual Sentence Alignment 
The task of sentence alignment is to match corre- 
sponding sentences in a text from One language to 
sentences in a translation of that text in another 
language. Of particular interest to us is the ap- 
plication to Asian language pairs. Previous stud- 
ies such as (Fung and Wu, 1994) have commented 
268 
that methods developed for Indo-European language 
pairs using alphabetic characters have not addressed 
important issues which occur with European-Asian 
language pairs. For example, the language pairs are 
unlikely to be cognates, and they may place sentence 
boundaries at different points in the text. It has also 
been suggested by (Wu, 1994) that sentence length 
ratio correlations may arise partly out of historic 
cognate-based relationships between Indo-European 
languages. Methods which perform well for Indo- 
European language pairs have therefore been found 
to be less effective for non-Indo-European language 
pairs. 
In our experiments the languages we use are En- 
glish (source) and Japanese (translation). Although 
in our corpus (described below) we observe that, in 
general, sentences correspond one-to-one we must 
also consider multiple sentence correspondences as 
well as one-to-zero correspondences. These cases are 
summarised below. 
1. 1:1 The sentences match one-to-one. 
2. l:n One English sentence matches to more than 
one Japanese sentence. 
3. m:l More than one English sentence matches ot 
one Japanese sentence. 
4. m:n More than one English sentence matches to 
more than one Japanese sentence. 
5. m:0 The English sentence/s have no correspond- 
ing Japanese sentence. 
6. 0:n The Japanese sentence/s have no corre- 
sponding English sentence. 
In the case of l:n, m:l and m:n correspondences, 
translation has involved some reformatting and the 
meaning correspondence is no longer solely at the 
sentence level. Ideally we would like smaller units of 
text to match because it is easier later on to establish 
word alignment correspondences. In the worst case 
of multiple correspondence, the translation is spread 
across multiple non-consecutive sentences. 
3 Corpus 
Our primarily motivation is knowledge acquisition 
for machine translation and consequently we are in- 
terested to acquire vocabulary and other bilingual 
knowledge which will be useful for users of such sys- 
tems. Recently there has been a move towards In- 
ternet page translation and we consider that one in- 
teresting domain for users is international news. 
The bilingual corpus we use in our experiments is 
made from Reuter news articles which were trans- 
lated by the Gakken translation agency from En- 
glish into Japanese 1 . The translations are quite lit- 
eral and the contents cover international news for 
I The corpus was generously made available to us by special 
arrangement with Gakken 
the period February 1995 to December 1996. We 
currently have over 20,000 articles (approximately 
47 Mb). From this corpus we randomly chose 50 
article pairs and aligned them by hand using a hu- 
man bilingual checker to form a judgement set. The 
judgement set consists of 380 English sentences and 
453 Japanese sentences. On average each English 
article has 8 lines and each Japanese article 9 lines. 
The articles themselves form a boundary within 
which to align constituent sentences. The corpus 
is quite well behaved. We observe many 1:1 corre- 
spondences, but also a large proportion of 1:2 and 
1:3 correspondences as well as reorderings. Omis- 
sions seem to be quite rare, so we didn't see many 
m:0 or 0:n correspondences. 
An example news article is shown in Figure 1 
which highlights several interesting points. Al- 
though the news article texts are clean and in 
machine-tractable format we still found that it was 
a significant challenge to reliably identify sentence 
boundaries. A simple illustration of this is shown by 
the first Japanese line J1 which usually corresponds 
to the first two English lines E1 and E2. This is 
a result of our general-purpose sentence segmenta- 
tion algorithm which has difficulty separating the 
Japanese title from the first sentence. 
Sentences usually corresponded linearly in our 
corpus, with few reorderings, so the major chal- 
lenge was to identify multiple correspondences and 
zero correspondences. We can see an example of a 
zero correspondence as E5 has no translation in the 
Japanese text. A l:n correspondence is shown by E7 
aligning to both J5 and J6. 
4 Alignment Models 
In our investigation we examined the performance of 
three different matching models (lexical matching, 
byte-length ratios and offset probabilities). The ba- 
sic models incorporate dynamic programming to find 
the least cost alignment path over the set of English 
and Japanese sentences. Cost being determined by 
the model's scores. The alignment space includes all 
possible combinations of multiple matches upto and 
including 3:3 alignments. The basic models are now 
outlined below. 
4.1 Model 1: Lexical vector matching 
The lexical approach is perhaps the most robust for 
aligning texts in cognate language pairs, or where 
there is a large amount of reformatting in trans- 
lation. It has also been shown to be particularly 
successful within the vector space model in multilin- 
gual information retrieval tasks, e.g. (Collier et al., 
1998a),(Collier et al., 1998b), for aligning texts in 
non-cognate languages at the article level. 
The major limitation with lexical matching is 
clearly the assumption of lexical correspondence - 
269 
El. Taiwan ruling party sees power struggle in China 
E2. TAIPEI , Feb 9 ( Reuter ) - Taiwan's ruling Nationalist Party said a struggle to succeed Deng 
Xiaoping as China's most powerful man may have already begun. 
E3. "Once Deng Xiaoping dies, a high tier power struggle among the Chinese communists is in- 
evitable," a Nationalist Party report said. 
E4. China and Taiwan have been rivals since the Nationalists lost the Chinese civil war in 1949 and 
fled to Taiwan. 
E5. Both Beijing and Taipei sometimes portray each other in an unfavourable light. 
E6. The report said that the position of Deng's chosen successor, President 3iang Zemin, may have 
been subtly undermined of late. 
E7. It based its opinion on the fact that two heavyweight political figures have recently used the 
phrase the "solid central collective leadership and its core" instead of the accepted "collective leader- 
ship centred on Jiang Zemin" to describe the current leadership structure. 
E8. "Such a sensitive statement should not be an unintentional mistake ... 
E9. Does this mean the power struggle has gradually surfaced while Deng Xiaoping is still alive ?," 
said the report , distributed to journalists. 
El0. "At least the information sends a warning signal that the 'core of Jiang' has encountered some 
subtle changes," it added . 
31. ~'~l~l~.~l~:~, ~P\[~:,~.-,i~-~~'a~t.~l~j~'~."~/~:i~.fl~.~:/t'~'H:\]~ \[~'~ 9 13 ~ -I' 9--\] ~'~'~ 
J2. ~l~:, ~~.~6t:i~.~L,/~_@~©~"e, r l-e).,j,~~, ~,~-~.~.~e, ~,~@~,, 
J3. q~l~-~i'~t~, 1~7)", 1 9 4 9~l:-q~I~l~,~e)~l:-I~( , ~'~I:-~-9~A~, ti~ 
~lz~b,5o 
Js. ~©~I~:, ~~t2.,,L~, ~L~©~-e, ~t£~~_~, "~:~-~ 
J6..: h.~ el:t. "i~-:v,~ t: ¢.5 q~:~l~J" ~ ~,~' 5 ~z~t~h."¢ ~ I::o 
Figure 1: Example English-Japanese news article pair 
which is particularly weak for English and Asian 
language pairs where structural and semantic dif- 
ferences mean that transfer often occurs at a level 
above the lexicon. This is a motivation for incor- 
porating statistics into the alignment process, but 
in the initial stage we wanted to treat pure lexical 
matching as our baseline performance. 
We translated each Japanese sentence into En- 
glish using dictionary term lookup. Each Japanese 
content word was assigned a list of possible English 
translations and these were used to match against 
the normalised English words in the English sen- 
tences. For an English text segment E and the En- 
glish term list produced from a Japanese text seg- 
ment J, which we considered to be a possible unit 
of correspondence, we calculated similarity using 
Dice's coefficient score shown in Equation 1. This 
rather simple measure captures frequency, but not 
positional information, q_\]m weights of words are 
their frequencies inside a sentence. 
2fEj (1) Dice(E, .1) - fE + fJ 
where lea is the number of lexical items which 
match in E and J, fE is tile number of lexical items 
in E and fj is the number of lexical items in J. 
The translation lists for each Japanese word are used 
disjunctively, so if one word in the list matches then 
we do not consider the other terms in the list. In 
this way we maintain term independence. 
270 
Our transfer dictionary contained some 79,000 En- 
glish words in full form together with the list of 
translations in Japanese. Of these English words 
some 14,000 were proper nouns which were directly 
relevant to the vocabulary typically found in interna- 
tional news stories. Additionally we perform lexical 
normalisation before calculating the matching score 
and remove function words with a stop list. 
4.2 Model 2: Byte-length ratios 
For Asian language pairs we cannot rely entirely 
on dictionary term matching. Moreover, algorithms 
which rely on matching cognates cannot be applied 
easily to English and some Asian language. We 
were motivated by statistical alignment models such 
as (Gale and Church, 1991) to investigate whether 
byte-length probabilities could improve or replace 
the lexical matching based method. The underlying 
assumption is that characters in an English sentence 
are responsible for generating some fraction of each 
character in the corresponding Japanese sentence. 
We derived a probability density function by mak- 
ing the assumption that English .and Japanese sen- 
tence length ratios are normally distributed. The 
parameters required for the model are the mean, p 
and variance, ~, which we calculated from a training 
set of 450 hand-aligned sentences. These are then 
entered into Equation 2 to find the probability of 
any two sentences (or combinations of sentences for 
multiple alignments) being in an alignment relation 
given that they have a length ratio of x. 
The byte length ratios were calculated as the 
length of the Japanese text segment divided by the 
length of the English text segment. So in this way we 
can incorporate multiple sentence correspondences 
into our model. Byte lengths for English sentences 
are calculated according to the number of non-white 
space characters, with a weighting of 1 for each valid 
character including punctuation. For the Japanese 
text we counted 2 for each non-white space char- 
acter. White spaces were treated as having length 
0. The ratios for the training set are shown as a 
histogram in Figure 2 and seem to support the as- 
sumption of a normal distribution. 
The resulting normal curve with ~r = 0.33 and 
/1 = 0.76 is given in Figure 3, and this can then be 
used to provide a probability score for any English 
and Japanese sentence being aligned in the Reuters' 
corpus. 
Clearly it is not enough simply to assume that our 
sentence pair lengths follow the normal distribution. 
We tested this assumption using a standard test, by 
plotting the ordered ratio scores against the values 
calculated for the normal curve in Figure 3. If the 
~,o 
°-4 .s 2 -1 
Ill,. 
o ~ 4 S e 
Figure 2: Sentence length ratios in training set 
1.4 
1.a 
1 
O.S 
o.e 
0.4 
0.2 
o.. 4 + + 3 4 5 
i 
*~1 II +1 
Figure 3: Sentence le, gth ratio normal curve 
distribution is indeed normal then we would expect 
the plot in Figure 4 to yi,?ld a straight line. We can 
see that this is the case l:',r most, although not all, 
of the observed scores. 
Although the curve in Figure 4 shows that our 
training set deviated from the normal distribution at 
i ! 
i 
o.m 0.,, o.,, o., +,2,,o ,.2 ,..,+ ,.,, ,.° 
Figure 4: Sentence length ratio normal check curve 
271 
I -2 
~t 
Oodl i 0-6 -4 
Figure 5: Sentence offsets in training set 
the extremes we nevertheless proceeded to continue 
with our simulations using this model considering 
that the deviations occured at the extreme ends of 
the distribution where relatively few samples were 
found. The weakness of this assumption however 
does add extra evidence to doubts which have been 
raised, e.g. (Wu, 1994), about whether the byte- 
length model by itself can perform well. 
4.3 Model 3: Offset ratios 
We calculated the offsets in the sentence indexes for 
English and Japanese sentences in an alignment re- 
lation in the hand-aligned training set. An offset 
difference was calculated as the Japanese sentence 
index minus the English sentence index within a 
bilingual news article pair. The values are shown 
as a histogram in Figure 5. 
As with the byte-length ratio model, we started 
from an assumption that sentence correspondence 
offsets were normally distributed. We then cal- 
culated the mean and variance for our sample set 
shown in Figure 5 and used this to form a normal 
probability density function (where a = 0.50 and 
/J - 1.45) shown in Figure 6. 
The test for normality of the distribution is the 
same as for byte-length ratios and is given in Figure 
7. We can see that the assumption of normality is 
particularly weak for the offset distribution, but we 
are motivated to see whether such a noisy probabil- 
ity model can improve alignment results. 
5 Experiments 
In this section we present the results of using dif- 
ferent combinations of the three basic methods. We 
combined the basic methods to make hybrid models 
simply by taking the product of the scores for the 
models given above. Although this is simplistic we 
felt that in the first stage of our investigation it was 
better to give equal weight to each method. 
The seven methods we tested are coded as follows: 
0.11 
O.l 
~t5 .2 0 SD 4 m 
Figure 6: Sentence offsets normal curve 
f 
"mO 
Figure 7: Sentence offscts normal check curve 
DICE: sentence alignmelit using bilingual dictionary 
and Dice's coefficient scores; LEN: sentence align- 
ment using sentence length ratios; OFFSET: sen- 
tence alignment using offs,:t probabilities. 
We performed sentence alignment on our test set 
of 380 English sentences and 453 Japanese sentences. 
The results are shown as recall and precision which 
we define in the usual way as follows: 
recall = #correctly matched sentences retrieved #matched sentences in the test collection 
(a) 
precision = #correctly matched sentences retrieved 
matched sentences retrieved (4) 
The results are shown in Table 1. We see that the 
baseline method using lexical matching with a bilin- 
gual lexicon, DICE, performs better than either of 
the two statistical methods LEN or OFFSET used 
separately. Offset probabilities in particular per- 
formed poorly showing tltat we cannot expect the 
correctly matching sentence to appear constantly in 
272 
the same highest probability position. 
-Method Rec. (%) Pr. (%) 
DICE (baseline) 84 85 
LEN 82 83 
OFFSET 50 57 
LEN+OFFSET 70 70 
DICE+LEN 89 87 
DICE+OFFSET 80 80 
DICE+LEN+OFFSET 88 85 
Table 1: Sentence alignment results as recall and 
precision. 
Considering the hybrid methods, we see signifi- 
cantly that DICE+LEN provides a clearly better re- 
sult for both recall and precision to either DICE or 
LEN used separately. On inspection we found that 
DICE by itself could not distinguish clearly between 
many candidate sentences. This occured for two rea- 
sons. 
1. As a result of the limited domain in which news 
articles report, there was a strong lexical over- 
lap between candidate sentences in a news arti- 
cle. 
2. Secondly, where the lexical overlap was poor be- 
tween the English sentence and the Japanese 
translation, this leads to low DICE scores. 
The second reason can be attributed to low cov- 
erage in the bilingual lexicon with the domain of 
the news articles. If we had set a minimum thresh- 
old limit for overlap frequency then we would have 
ruled out many correct matches which were found. 
In both cases LEN provides a decisive clue and en- 
ables us to find the correct result more reliably. Fur- 
thermore, we found that LEN was particularly ef- 
fective at identifying multi-sentence correspondences 
compared to DICE, possibly because some sentences 
are very small and provide weak evidence for lexi- 
cal matching, whereas when they are combined with 
neighbours they provide significant evidence for the 
LEN model. 
Using all methods together however in 
DICE+LEN+OFFSET seems less promising and we 
believe that the offset probabilities are not a reliable 
model. Possibly this is due to lack of data in the 
training stage when we calculated ~ and p, or the 
data set may not in fact be normally distributed as 
indicated by Figure 7. 
Finally, we noticed that a consistent factor in the 
English and Japanese text pairs was that the first 
two lines of the English were always matched to the 
first line of the Japanese. This was because the En- 
glish text separated the title and first line, whereas 
our sentence segmenter could not do this for the 
Japanese. This factor was consistent for all the 50 
article pairs in our test collection and may have led 
to a small deterioration in the results, so the figures 
we present are the minimum of what we can expect 
when sentence segmentation is performed correctly. 
6 Conclusion 
The assumption that a partial alignment at the word 
level from lexical correspondences can clearly in- 
dicate full sentence alignment is flawed when the 
texts contain many sentences with similar vocabu- 
lary. This is the case with the news stories used in 
our experiments and even technical vocabulary and 
proper nouns are not adequate to clearly discrimi- 
nate between alternative alignment choices because 
the vocabulary range inside the news article is not 
large. Moreover, the basic assumption of the lexical 
approach, that the coverage of the bilingual dictio- 
nary is adequate, cannot be relied on if we require 
robustness. This has shown the need for some hybrid 
model. 
For our corpus of newspaper articles, the hybrid 
model has been shown to clearly improve sentence 
alignment results compared with the pure models 
used separately. In the future we would like to make 
extensions to the lexical model by incorporating 
term weighting methods from information retrieval 
such as inverse document frequency which may help 
to identify more important terms for matching. In 
order to test the generalisability of our method we 
also want to extend our investigation to parallel cor- 
pora in other domains. 
Acknowledgements 
We would like to thank Reuters and Gakken for al- 
lowing us to use the corpus of news stories in our 
work. We are grateful to Miwako Shimazu for hand 
aligning the judgement sct used in the experiments 
and to Akira Kumano and Satoshi Kinoshita for 
useful discussions. Finally we would also like ex- 
press our appreciation to the anonymous reviewers 
for their helpful comments. 

References 
P. Brown, J. Lai, and R. Mercer. 1991. Aligning sen- 
tences in parallel corpora. In P9th Annual Meeting 
of the Association for Computational Linguistics, 
Berkeley, California, USA. 
S. Chen. 1993. Aligning sentences in bilingual cor- 
pora using lexical information. 31st Annual Meet- 
ing of the Association of Computational Linguis- 
tics, Ohio, USA, 22-26 June. 
K. Church. 1993. Char_align: a program for align- 
ing parallel texts at the character level. In 31st 
Annual Meeting of the Association for Computa- 
tional Linguistics, Ohio, USA, pages 1-8, 22-26 
June. 
N. Collier, H. Hirakawa, and A. Kumano. 1998a. 
Creating a noisy parallel corpus from newswire 
articles using multi-lingual information retrieval. 
Trans. of Information Processing Society of Japan 
(to appear). 
N. Collier, H. Hirakawa, and A. Kumano. 1998b. 
Machine translation vs. dictionary term transla- 
tion - a comparison for English-Japanese news 
article alignment. In Proceedings of COLING- 
ACL'98, University of Montreal, Canada, 10th 
August. 
P. Fung and D. Wu. 1994. Statistical augmenta- 
tion of a Chinese machine readable dictionary. In 
Second Annual Workshop on Very Large Corpora, 
pages 69-85, August. 
W. Gale and K. Church. 1991. A program for align- 
ing sentences in bilingual corpora. In Proceedings 
of the 29th Annual Conference of the Association 
for Computational Linguistics (ACL-91}, Berke- 
ley, California, pages 177-184. 
W. Gale and K. Church. 1993. A program for align- 
ing sentences in a bilingual corpora. Computa- 
tional Linguistics, 19(1):75-102. 
M. Kay and M. Rbshcheisen. 1993. Text-translation 
alignment. Computational Linguistics, 19:121- 
142. 
T. Utsuro, H. Ikeda, M. Yamane, Y. Matsumoto, 
and N. Nagao. 1994. Bilingual text match- 
ing using bilingual dictionary and statistics. In 
COLING-94, 15th International Conference, Ky- 
oto, Japan, volume 2, August 5-9. 
D. Wu. 1994. Aligning a parallel English-Chinese 
corpus statistically with lexical criteria. In 3end 
Annual Meeting of the Association for Computa- 
tional Linguistics, New Mexico, USA, pages 80- 
87, June 27-30. 
