Extraction of Translation Unit from Chinese-English Parallel Corpora 
 
CHANG Baobao 
Institute of Computational Linguistics 
Peking University,  
Beijing, P.R.China, 100871  
chbb@pku.edu.cn 
Pernilla DANIELSSON and 
Wolfgang TEUBERT 
Centre for Corpus Linguistics 
Birmingham University,  
Birmingham, B15 2TT United Kingdom 
pernilla@ccl.bham.ac.uk 
teubertw@hhs.bham.ac.uk 
 
Abstract  
More and more researchers have recognized 
the potential value of the parallel corpus in the 
research on Machine Translation and Machine 
Aided Translation. This paper examines how 
Chinese English translation units could be 
extracted from parallel corpus. An iterative 
algorithm based on degree of word association is 
proposed to identify the multiword units for 
Chinese and English. Then the Chinese-English 
Translation Equivalent Pairs.are extracted from 
the parallel corpus. We also made comparison 
between different statistical association 
measurement in this paper.  
 
Keywords: Parallel Corpus, Translation 
Unit , Automatic Extraction of Translation 
unit  
Introduction 
The field of machine translation has changed 
remarkably little since its earliest days in the 
fifties. So far, useful machine translation could 
only obtained in very restricted domain. We 
believe one of the problems of traditional 
machine translation lies in how it deals with unit 
of translation. Normally Rule-Based Machine 
Translation system takes word as basic 
translation unit. However, word is normally 
polysemous and therefore ambiguous, which 
causes many difficulties in selecting proper 
target equivalent words in machine translation, 
especially in translation between unrelated 
language pairs, such as Chinese and English. On 
the other hand, human translation is rarely 
word-based. Human translators always translate 
group of words as a whole, which means human 
do not view words as the basic translation units, 
and it seems they view language expressions that 
can transfer meaning unambiguously as basic 
translation units instead. Following this 
observation, we believe translation unit shall be 
not only words but also words groups 
(Multi-Word Unit) and a collection of bilingual 
translation unit will be certainly a very useful 
resource to machine translation.    
Manual compilation of such a database of 
translation unit is certainly labor intensive. But 
following the recent progress in Corpus 
Linguistics, especially in parallel corpus 
research such as Gale,W. (1991), Tufis,D. (2001), 
Wu, D., Xia, X.(1994). Automatic identification of 
translation unit and its target equivalents from 
existed authentic translation might be a feasible 
solution; at least it can be used to produce a 
candidate list of bilingual translation unit.  
As a first step towards building a database of 
bilingual translation units, we selected the Hong 
Kong Legal Documents Corpus (HKLDC) as the 
parallel corpus for the feasibility study. This 
paper elaborates the methods we adopted. We 
will first give our model of (semi-) automatic 
acquisition of bilingual translation unit based on 
parallel corpora in section 1. Then we will show 
how the corpus could be preprocessed in section 
2. In section 3, several statistic measurements 
will be introduced which will serve as a basis for 
late steps in extracting of bilingual translation 
units. Section 4 will focuses on identification of 
multi-word units. Section 5 will describe how 
translation equivalents could be extracted. In 
section 6, we give some evaluation regarding to 
the performance in extracting the translation 
equivalent pairs.  
1 Framework of automatic acquisition of 
bilingual translation unit 
The whole process of identification of bilingual 
translation unit could be further divided into 
three major steps as depicted in Figure 1. 
(1) Preprocessing of the parallel corpora  
For the purpose of extracting bilingual 
translation unit, some prior processing of the 
corpus is necessary. These include alignment of 
the bilingual texts at sentence level and 
unilingual annotation of the Chinese and English 
texts respectively. 
(2)Identification of multi-word unit in the 
aligned the texts 
As we mentioned before, translation unit shall 
not be only single words, but also multi-word 
units. In this step, Both the Chinese and English 
multi-word units are identified separately from 
the corpus. 
(3) Extraction of the bilingual translation units     
After identification of the multi-word units for 
texts of both languages, this step tries to set the 
correspondence between Chinese and English 
translation units. The result of this step will be a 
list of bilingual Translation Equivalent Pairs and 
every TEP is composed of a Chinese Translation 
unit and an English Translation unit. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 1. Framework of translation unit 
acquisition 
2 Preprocessing the corpus 
The Hong Kong legal documents were collected 
from Internet. The corpus is composed of laws 
and amendments issued by the Hong Kong 
Special Administration Region (HKSAR) during. 
All the texts in it are in both Chinese and 
English. We selected about 6 million words of 
both Chinese texts and English words 
(6,833,762 Chinese words and 6,391,919 
English words). 
All the Chinese texts in the corpus are 
encoded with Big-5 code. Since all our Chinese 
tools can only deal with Chinese GB code. We 
firstly converted all the Chinese texts from 
Big-5 code into GB code. Then the Corpus was 
aligned with a length-based sentence aligner. For 
the legal documents have been already well 
arranged with section by section, which makes 
the sentence alignment much easier and the 
precision is high. The Chinese texts were then 
segmented and pos-tagged with a program 
developed by the institute of Computational 
linguistics, Peking University. And all the 
English Texts were tokenized, lemmatized, and 
pos-tagged with a freely available tree-based 
tagger. Two tag sets were used for Chinese and 
English respectively, ICL/PKU tag set for 
Chinese texts and UPENN tag set for English 
texts. Figure 2. shows a sample of the corpus 
after preprocessing. 
Chinese texts English Texts 
… 
<s id=5> 
本  r 
条例  n 
可  d 
… 
通则  n 
条例  n 
》  w 
。  w 
<s id=6> 
附注  n 
：  w 
… 
… 
<s_id=5> 
This  DT this 
Ordinance NN ordinance 
may  MD may 
... 
General  JJ general 
Clauses  NNS clause 
Ordinance NN ordinance 
.  . . 
<s_id=6> 
Remarks NNS remark 
: : : 
Figure 2. Samples of the corpus after 
preprocessing 
In Figure 2., both corpus was arranged one token 
per line. The XML-like tag <s> marks the start 
of the sentence. The single-letter tags right to the 
Chinese tokens are their part of speech tags. The 
two columns right to the English tokens are part 
of speech tags and lemmas. 
3 Statistical measurement used 
Four statistical measurements were used in 
identification of unilingual multi-word units and 
the correspondences of the bilingual translation 
(3) 
TEP Extractor 
(1) 
Sentence Alignment 
Chinese Annotation English Annotation
(2) 
Chinese MWU 
Identification 
English MWU 
Identification 
units. All four statistical formulas measures the 
degree of association of two random events. 
Given two random events, X and Y, they 
might be two Chinese words appears in the 
Chinese texts and two translation units appears 
in an aligned region of the corpus. The 
distribution of the two events could be depicted 
by a 2 by 2 contingency table. 
 Y  Y¬  
X  a b 
X¬  c d 
Figure 3. A 2 by 2 contingency table 
The numbers in the four cells of the table has the 
following meanings: 
a : all counts of the cases the two events X 
and Y co-occur. 
b : all counts of the cases that X occurs but 
Y does not 
c : all counts of the cases that X does not 
occur but Y does 
d : all counts of the cases that both X and Y 
do not occur 
Based on the above-mentioned contingency 
table, different kinds of measurements could be 
used. We have tried four of them, namely, 
point-wise mutual information, DICE coefficient, 
2
χ score and Log-likelihood score. One other 
measurement used by Gale(1991) is 
2
φ score, 
which is equivalent to the 
2
χ  score. All the 
four measurements could be easily calculated 
using the following formula. 
(1) Point-wise mutual information  
)()(
log),(
2
caba
an
ttstMI
+×+
×
=  
(2) DICE coefficient 
)()(
2
),(
caba
a
ttstDICE
+×+
=  
(3) 
2
χ score 
)()()()(
)(
),(χ
2
2
dcdbcaba
cbdan
ttst
+×+×+×+
×−××
=  
(4) Log-Likelihood score 
)
)()(
log
)()(
log                   
)()(
log
)()(
log(2),(
dbdc
nd
d
cadc
nc
c
dbba
nb
b
caba
na
attstLL
+×+
×
×+
+×+
×
×+
+×+
×
×+
+×+
×
××=
 
4. Identification of multi-word units 
What might constitute multi-word units is 
probably a question critical to identification of 
them. It seems rational to assume Multi-word 
units are something between phrases and words, 
which might have the following properties: 
1) The component words of a multi-word 
unit should tend to co-occur frequently. 
In the significance of statistics, 
multi-word unit should be word group 
that co-occur more frequently than 
expectation. 
2) Multi-words units are not arbitrary 
combinations of arbitrary words; they 
shall form valid syntactic structure in 
the meaning of linguistics. 
Based on the above-mentioned observations, 
we used an iterative algorithm using both 
statistical and linguistics means. The algorithm 
runs as follows: firstly the algorithm tries to find 
all word pairs that show strong coherence. This 
could be done using the measurements listed in 
section 3. After this step, all the word pairs in 
both of Chinese texts and English Texts whose 
association value is greater than a predefined 
threshold are marked. But this can only list of 
word groups of length of 2. Word groups of 
length more than 3 words could not be found by 
only one run of the algorithm. But apparently 
they could be found by a series of runs until 
there are no word groups having greater 
association value than the threshold anymore. 
The algorithm is designed as recursive structure, 
it marks longer word groups by viewing the 
shorter word group marked in the previous run 
as one word.  
It is no doubt that pure statistics cannot 
perform very reliable. Some word groups found 
by the algorithm are awkward to be accepted as 
multi-word unit. The result of the algorithm 
shall be viewed as a candidate list of 
multi-words units. Some kind of refinement of 
the results might be required. For thinking that 
multi-word unit shall form valid syntactic 
pattern, we use a filter module which check all 
the word groups found and see if they fall into a 
set of predefined syntactic patterns.  
"a+n",  
"b+n",  
"n+n",  
… 
"MWU+n", 
"n+MWU",  
"MWU+MWU" 
"NN+NN",  
"NN+NNS", 
…… 
"NN+IN<of>" 
"JJ+NN", 
… 
"MWU+MWU" 
Figure 4. Syntactic patterns 
 Figure 4. shows some patterns used by the 
filter. Patterns in the left side are for Chinese 
while the right side for English. 
5. Extracting of the bilingual translation units 
We adopt the same hypothesis-testing approach 
to set the correspondence between the 
Chinese-English translation units. It follows the 
observations that words are translation of each 
other are more likely to appear in aligned 
regions(Gale,W. (1991), Tufis,D. (2001)). But we 
also take the multi-word units into 
consideration. 
The whole procedure could be divided 
logically into two phases. The first phase could 
be called a generative phase, which lists all 
possible translation equivalent pairs from the 
aligned corpus. And the second phase can be 
viewed as a testing operation, which selects the 
Translation Equivalent Correspondences show 
an association measure higher than expected 
under the independence assumption as 
translation equivalence pairs. Again we use 
DICE coefficient, point-wise mutual information, 
LL score and 
2
χ  score to measure the degree 
of association. 
One of problems of above-mentioned 
approach is its inefficiency in processing large 
corpus. Because in the generative phase, the 
above-mentioned approach will list all 
translation equivalent pairs and can lead to huge 
search space. To make the approach more 
efficient, we adopted the following assumption: 
Source translation units tend to be translated into 
translation units of the same syntactic categories. 
For example, English nouns tend to be translated 
into Chinese nouns, and English pattern 
“JJ+NN” tend to be translated into Chinese 
pattern “a+n” or “b+n”. Apparently, this 
assumption is not always true for translation of 
Chinese into English and vice versa. But it really 
makes the algorithm much more efficient while 
the precision does not fall severely. 
6. Experiments and Results 
We have performed some preliminary 
experiments to test the performance of different 
statistic measurements, performance change 
when the categorial hypothesis is used. 
For the experiments, we used a very small 
portion of the corpus of 500 sentence pairs. 
Figure 5. show the performance of Chinese 
MultiWord Unit Identification, we count how 
many correct MWUs are there in the first 
hundred of candidate MWUs produced by the 
program.  
 MI DICE LL 
2
χ  
Correct 63 31 76 74 
Incorrect 37 69 24 26 
Accuracy  63％31％ 76％ 74% 
Figure 5. Performance variations of different 
statistical measurements for identification of 
MWU 
Figure 6 shows the performance of the TEP 
extraction using different statitical means. we 
count how many correct and partially correct 
correspondences there are in the first hundred of 
translation equivalent pairs produced by the 
algorithm. 
 MI DICE LL 
2
χ  
Correct 39 5 70 75 
Partially correct 5 1 10 6 
Accuracy  44％6％ 80％ 81% 
Figure 6. Performance variations of different 
statistical measurements for TEP extraction 
  
Both Figure 5. and Figure 6 shows LL score and 
2
χ  score achieves better accuracy over mutual 
information and DICE coefficient.  
Experiments also show the categorial 
hypothesis might lead to fall in accuracy, we did 
tests on the above-mention 500 sentence pair 
corpus using the hypothesis, the precision fall by 
4% but the efficiency improved by more than 
200%. 
Figure 7. shows a sample of extracted 
translation equivalent pair from the test corpus. 
Some of them are wrong(see no 2), but most of 
them are correct translation equivalent pairs.The 
numbers in the right are 
2
χ  scores 
 1. 见 see /* 496.471 */ 
2. 追溯力_的 see /* 496.471 */ 
3. 款 subsection  /* 496.237 */ 
4. 废除 repeal /* 495.814 */ 
5. 命令 order /* 493.195 */ 
7. 豁免 exemption /* 490.829 */ 
25. 附属_法例 subsidiary_legislation /* 477.173 */ 
26. 公共_机构 public_body /* 475.711 */ 
28. 财政司_司长 Financial_Secretary /* 475.711 */ 
31. 条例 ordinance /* 470.081 */ 
34. 基本_文书 primary_instrument /* 468.068 */ 
41. 卫生_主任 health_officer /* 468.068 */ 
42. 裁判官 magistrate /* 468.068 */ 
43. 担当 discharge /* 468.068 */ 
45. 合约 contract /* 468.068 */ 
46. 终审_法院_首席_法官  
Chief_Justice_of_Final Appeal /* 468.068 */ 
53. 香港_特别_行政区  
Hong_Kong_Special_Administrative_region  
/* 448.576 */ 
63. 审裁处 tribunal /* 420.579 */ 
64. 宣布 declare /* 420.579 */  
Figure 7. sample of results extracted from the 
corpus 
Conclusion 
As we see in the last section, the approach used 
in this paper does really list many real 
translation equivalent pairs from the corpus. It 
seems not all the results could be taken as 
translation units, but it really offers a candidate 
list from which useful translation unit could be 
selected by means of human validation. For a 
complete evaluation of the approach, large scale 
experiments are still needed, which are now 
underway.    
Acknowledgements 
We would like to give our thanks to Professor 
Dan Tufis. His help in the lexical alignment and 
suggestions are very important for our work. We 
also would like to give thanks to all our 
colleagues who help us in many kinds of forms. 

References  
Teubert,W.(1997). Translation and the corpus, 
proceedings of the second TELRI seminar, 147-164.  
Gale,W. (1991). Identifying words correspondences 
in parallel Texts, DARPA speech and Natural 
language workshop. Asilomar, CA.  
Tufis,D. (2001), Computational bilingual 
lexicography: automatic extraction of translation 
dictionaries, In Journal of Information Science and 
Technology, Romanian Academy, Vol. 4, No. 3  
Maynard, D., Term Recognition using Combined 
Knowledge Sources, PH. D. thesis, Manchester 
University, United Kingdom. 
Yu Shiwen, Specification of Chinese text 
segmentation and POS tagging, see: 
http://www.icl.pku.edu.cn/research/corpus/coprus-an
notation.htm 
Manual of Upenn Tree bank tag set, see: 
http://www.cis.upenn.edu/~treebank/ 
Wu, D., Xia, X.(1994), Leaning an English-Chinese 
Lexicon from a Parallel Corpus, in AMTA-94, 
Association for MT in the Americas, Columbia, MD 
