An IR Approach for Translating New Words 
from Nonparallel, Comparable Texts 
Pascale Fung and Lo Yuen Yee 
HKUST 
Human Language Technology Center 
Department of Electrical and Electronic Engineering 
University of Science and Technology 
Clear Water Bay, Hong Kong 
{pascale, eeyy}©ee, ust. hk 
1 Introduction 
In recent years, there is a phenomenal growth 
in the amount of online text material available 
from the greatest information repository known 
as the World Wide Web. Various traditional 
information retrieval(IR) techniques combined 
with natural language processing(NLP) tech- 
niques have been re-targeted to enable efficient 
access of the WWW--search engines, indexing, 
relevance feedback, query term and keyword 
weighting, document analysis, document clas- 
sification, etc. Most of these techniques aim at 
efficient online search for information already on 
the Web. 
Meanwhile, the corpus linguistic community 
regards the WWW as a vast potential of cor- 
pus resources. It is now possible to download 
a large amount of texts with automatic tools 
when one needs to compute, for example, a 
list of synonyms; or download domain-specific 
monolingual texts by specifying a keyword to 
the search engine, and then use this text to ex- 
tract domain-specific terms. It remains to be 
seen how we can also make use of the multilin- 
gual texts as NLP resources. 
In the years since the appearance of the first 
papers on using statistical models for bilin- 
gual lexicon compilation and machine transla- 
tion(Brown et al., 1993; Brown et al., 1991; 
Gale and Church, 1993; Church, 1993; Simard 
et al., 1992), large amount of human effort and 
time has been invested in collecting parallel cor- 
pora of translated texts. Our goal is to alleviate 
this effort and enlarge the scope of corpus re- 
sources by looking into monolingual, compara- 
ble texts. This type of texts are known as non- 
parallel corpora. Such nonparallel, monolingual 
texts should be much more prevalent than par- 
allel texts. However, previous attempts at using 
nonparallel corpora for terminology translation 
were constrained by the inadequate availability 
of same-domain, comparable texts in electronic 
form. The type of nonparallel texts obtained 
from the LDC or university libraries were of- 
ten restricted, and were usually out-of-date as 
soon as they became available. For new word 
translation, the timeliness of corpus resources 
is a prerequisite, so is the continuous and au- 
tomatic availability of nonparallel, comparable 
texts in electronic form. Data collection ef- 
fort should not inhibit the actual translation 
effort. Fortunately, nowadays the World Wide 
Web provides us with a daily increase of fresh, 
up-to-date multilingual material, together with 
the archived versions, all easily downloadable by 
software tools running in the background. It is 
possible to specify the URL of the online site of 
a newspaper, and the start and end dates, and 
automatically download all the daily newspaper 
materials between those dates. 
In this paper, we describe a new method 
which combines IR and NLP techniques to ex- 
tract new word translation from automatically 
downloaded English-Chinese nonparallel news- 
paper texts. 
2 Encountering new words 
To improve the performance of a machine trans- 
lation system, it is often necessary to update 
its bilingual lexicon, either by human lexicog- 
raphers or statistical methods using large cor- 
pora. Up until recently, statistical bilingual lex- 
icon compilation relies largely on parallel cor- 
pora. This is an undesirable constraint at times. 
In using a broad-coverage English-Chinese MT 
system to translate some text recently, we dis- 
covered that it is unable to translate ~,~,/li- 
ougan which occurs very frequently in the text. 
Other words which the system cannot find in 
its 20,000-entry lexicon include proper names 
414 
such as the Taiwanese president Lee Teng-Hui, 
and the Hong Kong Chief Executive Tung Chee- 
Hwa. To our disappointment, we cannot lo- 
cate any parallel texts which include such words 
since they only start to appear frequently in re- 
cent months. 
A quick search on the Web turned up archives 
of multiple local newspapers in English and Chi- 
nese. Our challenge is to find the translation of 
~/liougan and other words from this online 
nonparallel, comparable corpus of newspaper 
materials. We choose to use issues of the En- 
glish newspaper Hong Kong Standard and the 
Chinese newspaper Mingpao, from Dec.12,97 to 
Dec.31,97, as our corpus. The English text con- 
tains about 3 Mb of text whereas the Chinese 
text contains 8.8 Mb of 2 byte character texts. 
So both texts are comparable in size. Since they 
are both local mainstream newspapers, it is rea- 
sonable to assume that their contents are com- 
parable as well. 
3 YL~,/liougan is associated with flu 
but not with Africa 
Unlike in parallel texts, the position of a word 
in a text does not give us information about its 
translation in the other language. (Rapp, 1995; 
Fung and McKeown, 1997) suggest that a con- 
tent word is closely associated with some words 
in its context. As a tutorial example, we postu- 
late that the words which appear in the context 
of ~/liougan should be similar to the words 
appearing in the context of its English trans- 
lation, flu. We can form a vector space model 
of a word in terms of its context word indices, 
similar to the vector space model of a text in 
terms of its constituent word indices (Salton and 
Buckley, 1988; Salton and Yang, 1973; Croft, 
1984; Turtle and Croft, 1992; Bookstein, 1983; 
Korfhage, 1995; Jones, 1979). 
The value of the i-th dimension of a word 
vector W is f if the i-th word in the lexicon 
appears f times in the same sentences as W. 
Left columns in Table 1 and Table 2 show 
the list of content words which appear most fre- 
quently in the context of flu and Africa respec- 
tively. The right column shows those which oc- 
cur most frequently in the context of ~,~,. We 
can see that the context of ~ is more similar 
to that of flu than to that of Africa. 
Table 1: ~ and flu have similar contexts 
English Freq. 
bird 170 
virus 26 
spread 17 
people 17 
government 13 
avian 11 
scare 10 
deadly 10 
new 10 
suspected 9 
chickens 9 
spreading 8 
prevent 8 
crisis 8 
health 8 
symptoms 7 
Chinese Freq. 
~ (virus) 147 
\]:~ (citizen) 90 
~'~ (nong Kong) 84 
,~ (infection) 69 
~ (confirmed) 62 
~-~ (show) 62 
~ (discover) 56 
\[~\[\] (yesterday) 54 
~i~ j~ (patient) 53 
~i\]~ (suspected) 50 
~- (doctor) 49 
~_t2 (infected) 47 
~y~ (hospital) 44 
~:~ (no) 42 
~ (government) 41 
$~1= (event) 40 
Table 2: ~ and Africa have different contexts 
English Freq. 
South 109 
African 32 
China 20 
ties 15 
diplomatic 14 
Taiwan 12 
relations 9 
Test 9 
Mandela 8 
Taipei 7 
Africans 7 
January 7 
visit 6 
tense 6 
survived 6 
Beijing 6 
Chinese Freq. 
~j~ (virus) 147 
~ (citizen) 90 
~ (Uong Kong) 84 
,~ (infection) 69 
-~J~ (confirmed) 62 
~p-~ (show) 62 
• ~.t~ (discover) 56 
I~ \[\] (yesterday) 54 
~j~ (patient) 53 
~ (suspected) 50 
~ (doctor) 49 
~l" (infected) 47 
~ (hospital) 44 
bq~ (no) 42 
~\[ J~J: (government) 41 
~: (event) 40 
4 Bilingual lexicon as seed words 
So the first clue to the similarity between a word 
and its translation number of common words in 
their contexts. In a bilingual corpus, the "com- 
mon word" is actually a bilingual word pair. We 
use the lexicon of the MT system to "bridge" all 
bilingual word pairs in the corpora. These word 
pairs are used as seed words. 
We found that the contexts of flu and ~,~ 
/liougan share 233 "common" context words, 
whereas the contexts of Africa and ~,~/liougan 
share only 121 common words, even though the 
context of flu has 491 unique words and the con- 
text of Africa has 328 words. 
In the vector space model, W\[flu\] and 
W\[liougan\] has 233 overlapping dimensions, 
whereas there are 121 overlapping dimensions 
between W\[flu\] and W\[A frica\]. 
415 
5 Using TF/IDF of contextual seed 
words 
The flu example illustrates that the actual rank- 
ing of the context word frequencies provides a 
second clue to the similarity between a bilingual 
word pair. For example, virus ranks very high 
for both flu and ~g~/liougan and is a strong 
"bridge" between this bilingual word pair. This 
leads us to use the term frequency(TF) mea- 
sure. The TF of a context word is defined as 
the frequency of the word in the context of W. 
(e.g. TF of virus in flu is 26, in ~,~ is 147). 
However, the TF of a word is not indepen- 
dent of its general usage frequency. In an ex- 
treme case, the function word the appears most 
frequently in English texts and would have the 
highest TF in the context of any W. In our HK- 
Standard/Mingpao corpus, Hong Kong is the 
most frequent content word which appears ev- 
erywhere. So in the flu example, we would like 
to reduce the significance of Hong Kong's TF 
while keeping that of virus. A common way to 
account for this difference is by using the inverse 
document frequency(IDF). Among the variants 
of IDF, we choose the following representation 
from (Jones, 1979): 
maxn IDF = log--+l 
ni 
where maxn = the maximum frequency of 
any word in the corpus 
ni = the total number of occurrences 
of word i in the corpus 
The IDF of virus is 1.81 and that of Hong 
Kong is 1.23 in the English text. The IDF of 
~,~ is 1.92 and that of Hong Kong is 0.83 in 
Chinese. So in both cases, virus is a stronger 
"bridge" for ~,~,/liougan than Hong Kong. 
Hence, for every context seed word i, we as- 
sign a word weighting factor (Salton and 
Buckley, 1988) wi = TFiw x IDFi where TFiw 
is the TF of word i in the context of word W. 
The updated vector space model of word W has 
wi in its i-th dimension. 
The ranking of the 20 words in the contexts 
of ~/liougan is rearranged by this weighting 
factor as shown in Table3. 
Table 3: virus is a 
Kong 
bird 259.97 
spread 51.41 
virus 47.07 
avian 43.41 
scare 36.65 
deadly 35.15 
spreading 30.49 
suspected 28.83 
symptoms 28.43 
prevent 26.93 
people 23.09 
crisis 22.72 
health 21.97 
new 17.80 
government 16.04 
chickens 15.12 
stronger bridge than Hong 
~iij~ (virus) 282.70 
,1~, ~1~ (infection) 187.50 
i=~i~ (citizens) 163.49 
LI~ (confirmed) 161.89 
~\[-_ (infected) 158.43 
~ijj~ (patient) 132.14 
~i~ (suspected) 123.08 
U~:~_ (doctor) 108.54 
U~ (hospital) 102.73 
~ (discover) 98.09 
~J~ ~: (event) 83.75 
~ (Hong Kong) 69.68 
\[~ \[\] (yesterday) 66.84 
~--~ (possible) 60.20 
~p-~ (no) 59.76 
~ (government) 59.41 
6 Ranking translation candidates 
Next, a ranking algorithm is needed to match 
the unknown word vectors to their counterparts 
in the other language. A ranking algorithm se- 
lects the best target language candidate for a 
source language word according to direct com- 
parison of some similarity measures (Frakes and 
Baeza-Yates, 1992). 
We modify the similarity measure proposed 
by (Salton and Buckley, 1988) into the following 
SO: 
so(wc, We) = t .2 
~/~'~i=l Wzc 
where Wic = TFic 
Wie = T Fie 
~=1 (Wic X Wie ) 
t 2 X Y\]~i=lWie 
Variants of similarity measures such as the 
above have been used extensively in the IR com- 
munity (Frakes and Baeza-Yates, 1992). They 
are mostly based on the Cosine Measure of two 
vectors. For different tasks, the weighting fac- 
tor might vary. For example, if we add the IDF 
into the weighting factor, we get the following 
measure SI: 
t SI(Wc, We) 
= ~i=l(Wic × Wie) 
t .2 t 2 ~/~i=lWzc X ~i=lWie 
where wic = TFic x IDFi 
Wie = TFie x IDFi 
416 
In addition, the Dice and Jaccard coefficients 
are also suitable similarity measures for doc- 
ument comparison (Frakes and Baeza-Yates, 
1992). We also implement the Dice coefficient 
into similarity measure $2: 
t 2Ei=l (Wic X Wie) 
S2(W , We) = t .2 t .2 ~i=l W2c "~- ~i=l W~e 
where Wic = TFic x IDFi 
Wie = TFie x IDFi 
S1 is often used in comparing a short query 
with a document text, whereas $2 is used in 
comparing two document texts. Reasoning that 
our objective falls somewhere in between--we 
are comparing segments of a document, we also 
multiply the above two measures into a third 
similarity measure $3. 
7 Confidence on seed word pairs 
In using bilingual seed words such as IN~/virus 
as "bridges" for terminology translation, the 
quality of the bilingual seed lexicon naturally 
affects the system output. In the case of Eu- 
ropean language pairs such as French-English, 
we can envision using words sharing common 
cognates as these "bridges". Most importantly, 
we can assume that the word boundaries are 
similar in French and English. However, the 
situation is messier with English and Chinese. 
First, segmentation of the Chinese text into 
words already introduces some ambiguity of the 
seed word identities. Secondly, English-Chinese 
translations are complicated by the fact that 
the two languages share very little stemming 
properties, or part-of-speech set, or word order. 
This property causes every English word to have 
many Chinese translations and vice versa. In a 
source-target language translation scenario, the 
translated text can be "rearranged" and cleaned 
up by a monolingual language model in the tar- 
get language. However, the lexicon is not very 
reliable in establishing "bridges" between non- 
parallel English-Chinese texts. To compensate 
for this ambiguity in the seed lexicon, we intro- 
duce a confidence weighting to each bilingual 
word pair used as seed words. If a word ie is the 
k-th candidate for word ic, then wi,~ = wi,~/ki. 
The similarity scores then become $4 and $5 
and $6 = $4 x $5: 
~=l(Wic × Wie)/ki S4(Wc, We) = 
t .2 t 2 ~/~i=lWzc × ~i=lWie 
where wic = TFic × IDFi 
Wie = TFie x IDFi 
2~=l(Wic x Wie)/ki s5(wc, we) = 
t .2 t 2 Ei=lWzc + ~i=lWie 
where wic = TFic x IDFi 
wie = TFie x IDFi 
We also experiment with other combinations 
of the similarity scores such as $7 --- SO x $5. 
All similarity measures $3 - $7 are used in the 
experiment for finding a translation for ~,~,. 
8 Results 
In order to apply the above algorithm to find the 
translation for ~/liougan from the HKStan- 
dard/Mingpao corpus, we first use a script to 
select the 118 English content words which are 
not in the lexicon as possible candidates. Using 
similarity measures $3-$7, the highest ranking 
candidates of ~ are shown in Table 6. $6 and 
$7 appear to be the best similarity measures. 
We then test the algorithm with $7 on more 
Chinese words which are not found in the lex- 
icon but which occur frequently enough in the 
Mingpao texts. A statistical new word extrac- 
tion tool can be used to find these words. The 
unknown Chinese words and their English coun- 
terparts, as well as the occurrence frequencies of 
these words in HKStandard/Mingpao are shown 
in Table 4. Frequency numbers with a * in- 
dicates that this word does not occur frequent 
enough to be found. Chinese words with a * 
indicates that it is a word with segmentation 
and translation ambiguities. For example, 
(Lam) could be a family name, or part of an- 
other word meaning forest. When it is used as 
a family name, it could be transliterated into 
Lam in Cantonese or Lin in Mandarin. 
Disregarding all entries with a * in the above 
table, we apply the algorithm to the rest of the 
Chinese unknown words and the 118 English un- 
known words from HKStandard. The output is 
ranked by the similarity scores. The highest 
ranking translated pairs are shown in Table 5. 
The only Chinese unknown words which are 
not correctly translated in the above list are 
417 
Table 4: Unknown words which occur often Freq. Chinese 
59 ~'~ (Causeway) 
1965 ~J (Chau)* 
481 ~ (Chee-hwa) 
115 ~ (Chek)* 
164 ~ ~J~ (Diana) 
3164 ~j (Fong)* 
2274 ~ (HONG) 
1128 ~ (Huang)* 
477 ~ (Ip)* 
1404 ~ (Lam)* 
687 ~lJ (Lau)* 
324 I~ (Lei) 
967 ~ (Leung) 
312 A~ (Lunar) 
164 ~'$~ (Minister) 
949 ~,)~ (Personal) 
56 ~~ (Pornography) 
493 ~$I (Poultry) 
1027 :~.\]~ (President) 
946 ~,~ (Qian)* 
154 ~\]~ (Qichen) 
824 ~j~ (SAR) 
325 -~ (Tam)* 
281 ~ (Tang) 
307 ~_}~ (Teng-hui) 
350 ~ (Tuen) 
lO52 t (Tung) 
79 ¢tl~. (Versace)* 
107 ~J~ (Yeltsin) 
ll2 ~ (Zhuhai) 
1171 ~ (flu) 
Freq. English 
37* Causeway 
49 Chau 
77 Chee-hwa 
28 Chek 
100 Diana 
32 Fong 
60 HONG 
30 Huang 
32 Ip 
175 Lam 
111 Lau 
30 Lei 
145 Leung 
36 Lunar 
197 Minister 
8* Personal 
13" Pornography 
57 Poultry 
239 President 
62 Qian 
28* Qichen 
142 SAR 
154 Tam 
80 Tang 
37 Teng-hui 
76 Tuen 
274 Tung 
74 Versace 
100 Yeltsin 
76 Zhuhai 
491 flu 
~/Lunar and ~J~/Yeltsin I. Tung/Chee- 
Hwa is a pair of collocates which is actually 
the full name of the Chief Executive. Poultry 
in Chinese is closely related to flu because the 
Chinese name for bird flu is poultry flu. In fact, 
almost all unambiguous Chinese new words find 
their translations in the first 100 of the ranked 
list. Six of the Chinese words have correct trans- 
lation as their first candidate. 
9 Related work 
Using vector space model and similarity mea- 
sures for ranking is a common approach in 
IR for query/text and text/text comparisons 
(Salton and Buckley, 1988; Salton and Yang, 
1973; Croft, 1984; Turtle and Croft, 1992; Book- 
stein, 1983; Korfhage, 1995; Jones, 1979). This 
approach has also been used by (Dagan and Itai, 
1994; Gale et al., 1992; Shiitze, 1992; Gale et 
al., 1993; Yarowsky, 1995; Gale and Church, 
1Lunar is not an unknown word in English, Yeltsin 
finds its translation in the 4-th candidate. 
Table 5: 
tion out 
score 
0.008421 
0.007895 
0.007669 
0.007588 
0.007283 
0.006812 
0.006430 
0.006218 
0.005921 
0.005527 
0.005335 
0.005335 
0.005221 
0.004731 
0.004470 
0.004275 
0.003878 
0.003859 
0.003859 
0.003784 
0.003686 
0.003550 
0.003519 
0.003481 
0.003407 
0.003407 
0.003338 
0.003324 
Some Chinese 
)ut 
English 
Teng-hui 
SAR 
flu 
Lei 
poultry 
SAR 
hijack 
poultry 
Tung 
Diaoyu 
PrimeMinister 
President 
China 
Lien 
poultry 
China 
flu 
PrimeMinister 
President 
poultry 
Kalkanov 
poultry 
SAR 
Zhuhai 
PrimeMinister 
President 
flu 
apologise 
unknown word transla- 
Chinese 
~}~ (Weng-hui) 
~ (~u) 
(Lei) 
~j~ (Poultry) 
~ (Chee-hwa) 
~}~ (Teng-hui) 
~#~ (SAR) 
~'~ (Chee-hwa) 
:~ (Teng-hui) 
~}~ (Weng-hui) W}~ 
(Weng-hui) 
CLam) 
~}~ (Teng-hui) 
~-~ (Chee-hwa) 
~_}~ (Teng-hui) 
(Lei) 
~'~ (Chee-hwa) 
~'~ (Chee-hwa) 
.~ (Leung) 
~ (Zhuhai) 
I~ (Lei) 
~J~ (Yeltsin) 
~-~ (Chee-hwa) 
)~ (Lam) 
(Lam) 
~j~ (Poultry) 
W~ (Teng-hui) 
0.003250 DPP 
0.003206 Tang 
0.003202 Tung 
0.003040 Leung 
0.003033 China 
0.002888 Zhuhai 
0.002886 Tung 
~}~ (Teng-hui) 
(Tang) 
(Leung) 
(Leung) 
~#~ (SAR) 
~ (Lunar) 
(Tung) 
1994) for sense disambiguation between mul- 
tiple usages of the same word. Some of the 
early statistical terminology translation meth- 
ods are (Brown et al., 1993; Wu and Xia, 1994; 
Dagan and Church, 1994; Gale and Church, 
1991; Kupiec, 1993; Smadja et al., 1996; Kay 
and RSscheisen, 1993; Fung and Church, 1994; 
Fung, 1995b). These algorithms all require par- 
allel, translated texts as input. Attempts at 
exploring nonparallel corpora for terminology 
translation are very few (Rapp, 1995; Fung, 
1995a; Fung and McKeown, 1997). Among 
these, (Rapp, 1995) proposes that the associ- 
ation between a word and its close collocate 
is preserved in any language, and (Fung and 
McKeown, 1997) suggests that the associations 
between a word and many seed words are also 
preserved in another language. In this paper, 
418 
we have demonstrated that the associations be- 
tween a word and its context seed words are 
well-preserved in nonparallel, comparable texts 
of different languages. 
10 Discussions 
Our algorithm is the first to have generated a 
collocation bilingual lexicon, albeit small, from 
a nonparallel, comparable corpus. We have 
shown that the algorithm has good precision, 
but the recall is low due to the difficulty in 
extracting unambiguous Chinese and English 
words. 
Better results can be obtained when the fol- 
lowing changes are made: 
• improve seed word lexicon reliability by 
stemming and POS tagging on both En- 
glish and Chinese texts; 
• improve Chinese segmentation by using a 
larger monolingual Chinese lexicon; 
• use larger corpus to generate more un- 
known words and their candidates by sta- 
tistical methods; 
We will test the precision and recall of the 
algorithm on a larger set of unknown words. 
11 Conclusions 
We have devised an algorithm using context 
seed word TF/IDF for extracting bilingual 
lexicon from nonparallel, comparable cor- 
pus in English-Chinese. This algorithm takes 
into account the reliability of bilingual seed 
words and is language independent. This al- 
gorithm can be applied to other language pairs 
such as English-French or English-German. In 
these cases, since the languages are more sim- 
ilar linguistically and the seed word lexicon is 
more reliable, the algorithm should yield bet- 
ter results. This algorithm can also be applied 
in an iterative fashion where high-ranking bilin- 
gual word pairs can be added to the seed word 
list, which in turn can yield more new bilingual 
word pairs. 

References 
A. Bookstein. 1983. Explanation and generalization of vector 
models in information retrieval. In Proceedings of the 6th 
Annual International Conference on Research and Devel- 
opment in Information Retrieval, pages 118-132. 
P. Brown, J. Lai, and R. Mercer. 1991. Aligning sentences in 
parallel corpora. In Proceedings of the P9th Annual Con- 
ference of the Association for Computational Linguistics. 
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. 
Mercer. 1993. The mathematics of machine transla- 
tion: Parameter estimation. Computational Linguistics, 
19(2):263-311. 
Kenneth Church. 1993. Char.align: A program for aligning 
parallel texts at the character level. In Proceedings of the 
31st Annual Conference of the Association for Computa- 
tional Linguistics, pages 1-8, Columbus, Ohio, June. 
W. Bruce Croft. 1984. A comparison of the cosine correla- 
tion and the modified probabilistic model. In Information 
Technology, volume 3, pages 113-114. 
Ido Dagan and Kenneth W. Church. 1994. Termight: Iden- 
tifying and translating technical terminology. In Proceed- 
ings of the 4th Conference on Applied Natural Language 
Processing, pages 34-40, Stuttgart, Germany, October. 
Ido Dagan and Alon Itai. 1994. Word sense disambiguation 
using a second language monolingual corpus. In Compu- 
tational Linguistics, pages 564-596. 
William B. Frakes and Ricardo Baeza-Yates, editors. 1992. 
Information Retrieval: Data structures ~ Algorithms. 
Prentice-Hall. 
Pascale Fung and Kenneth Church. 1994. Kvec: A new ap- 
proach for aligning parallel texts. In Proceedings of COL- 
ING 9J, pages 1096-1102, Kyoto, Japan, August. 
Pascale Fung and Kathleen McKeown. 1997. Finding termi- 
nology translations from non-parallel corpora. In The 5th 
Annual Workshop on Very Large Corpora, pages 192-202, 
Hong Kong, Aug. 
Pascale Fung and Dekai Wu. 1994. Statistical augmentation 
of a Chinese machine-readable dictionary. In Proceedings 
of the Second Annual Workshop on Very Large Corpora, 
pages 69-85, Kyoto, Japan, June. 
Pascale Fung. 1995a. Compiling bilingual lexicon entries from 
a non-parallel English-Chinese corpus. In Proceedings of 
the Third Annual Workshop on Very Large Corpora, pages 
173-183, Boston, Massachusettes, June. 
Pascale Fung. 1995b. A pattern matching method for find- 
ing noun and proper noun translations from noisy parallel 
corpora. In Proceedings of the 33rd Annual Conference of 
the Association for Computational Linguistics, pages 236- 
233, Boston, Massachusettes, June. 
William Gale and Kenneth Church. 1991. Identifying word 
correspondences in parallel text. In Proceedings of the 
Fourth Darpa Workshop on Speech and Natural Language, 
Asilomar. 
William A. Gale and Kenneth W. Church. 1993. A program 
for aligning sentences in bilingual corpora. Computational 
Linguistics, 19(1):75-102. 
William A. Gale and Kenneth W. Church. 1994. Discrim- 
ination decisions in 100,000 dimensional spaces. Current 
Issues in Computational Linguisitcs: In honour of Don 
Walker, pages 429-550. 
W. Gale, K. Church, and D. Yarowsky. 1992. Estimating 
upper and lower bounds on the performance of word-sense 
disambiguation programs. In Proceedings of the 30th Con- 
ference of the Association for Computational Linguistics. 
Association for Computational Linguistics. 
W. Gale, K. Church, and D. Yarowsky. 1993. A method for 
disambiguating word senses in a large corpus. In Comput- 
ers and Humanities, volume 26, pages 415-439. 
K. Sparck Jones. 1979. Experiments in relevance weighting 
of search terms. In Information Processing and Manage- 
ment, pages 133-144. 
Martin Kay and Martin R6scheisen. 1993. Text-Translation 
alignment. Computational Linguistics, 19(1):121-142. 
Robert Korfhage. 1995. Some thoughts on similarity mea- 
sures. In The SIGIR Forum, volume 29, page 8. 
Julian Kupiec. 1993. An algorithm for finding noun phrase 
correspondences in bilingual corpora. In Proceedings of the 
31st Annual Conference of the Association for Computa- 
tional Linguistics, pages 17-22, Columbus, Ohio, June. 
Reinhard Rapp. 1995. Identifying word translations in non- 
parallel texts. In Proceedings of the 35th Conference of 
the Association of Computational Linguistics, student ses- 
sion, pages 321-322, Boston, Mass. 
G. Salton and C. Buckley. 1988. Term-weighting approaches 
in automatic text retrieval. In Information Processing and 
Management, pages 513-523. 
G. Salton and C. Yang. 1973. On the specification of term 
values in automatic indexing, volume 29. 
Hinrich Shiitze. 1992. Dimensions of meaning. In Proceedings 
of Supercomputing '92. 
M. Simard, G Foster, and P. Isabelle. 1992. Using cognates 
to align sentences in bilingual corpora. In Proceedings 
of the Forth International Conference on Theoretical and 
Methodological Issues in Machine Translation, Montreal, 
Canada. 
Frank Smadja, Kathleen McKeown, and Vasileios Hatzsivas- 
siloglou. 1996. Translating collocations for bilingual lexi- 
cons: A statistical approach. Computational Linguistics, 
21(4):1-38. 
Howard R. Turtle and W. Bruce Croft. 1992. A compari- 
son of text retrieval methods. In The Computer Journal, 
volume 35, pages 279-290. 
Dekai Wu and Xuanyin Xia. 1994. Learning an English- 
Chinese lexicon from a parallel corpus. In Proceedings 
of the First Conference of the Association for Machine 
Translation in the Americas, pages 206-213, Columbia, 
Maryland, October. 
D. Yarowsky. 1995. Unsupervised word sense disambiguation 
rivaling supervised methods. In Proceedings of the 33rd 
Conference o.f the Association for Computational Linguis- 
tics, pages 189-196. Association for Computational Lin- 
guistics. 
