AN ALGORITHM FOR FINDING 
NOUN PHRASE CORRESPONDENCES 
IN BILINGUAL CORPORA 
Julian Kupiec 
Xerox Palo Alto Research Center 
3333 Coyote Hill Road, Palo Alto, CA 
kupiec@parc.xerox.com 
94304 
Abstract 
The paper describes an algorithm that employs 
English and French text taggers to associate noun 
phrases in an aligned bilingual corpus. The tag- 
gets provide part-of-speech categories which are 
used by finite-state recognizers to extract simple 
noun phrases for both languages. Noun phrases 
are then mapped to each other using an iterative 
re-estimation algorithm that bears similarities to 
the Baum-Welch algorithm which is used for train- 
ing the taggers. The algorithm provides an alter- 
native to other approaches for finding word cor- 
respondences, with the advantage that linguistic 
structure is incorporated. Improvements to the 
basic algorithm are described, which enable con- 
text to be accounted for when constructing the 
noun phrase mappings. 
INTRODUCTION 
Areas of investigation using bilingual corpora have 
included the following: 
• Automatic sentence alignment \[Kay and 
RSscheisen, 1988, Brown eL al., 1991a, Gale 
and Church, 1991b\]. 
• Word-sense disambiguation \[Dagan el al., 
1991, Brown et ai., 1991b, Church and Gale, 
1991\]. 
• Extracting word correspondences \[Gale and 
Church, 1991a\]. 
• Finding bilingual collocations \[Smadja, 1992\]. 
• Estimating parameters for statistically-based 
machine translation \[Brown et al., 1992\]. 
The work described here makes use of the 
aligned Canadian Hansards \[Gale and Church, 
1991b\] to obtain noun phrase correspondences be- 
tween the English and French text. 
The term "correspondence" is used here to sig- 
nify a mapping between words in two aligned sen- 
tences. Consider an English sentence Ei and a 
French sentence Fi which are assumed to be ap- 
proximate translations of each other. The sub- 
script i denotes the i'th alignment of sentences in 
both languages. A word sequence in E/is defined 
here as the correspondence of another sequence in 
Fi if the words of one sequence are considered to 
represent the words in the other. 
Single word correspondences have been investi- 
gated \[Gale and Church, 1991a\] using a statistic 
operating on contingency tables. An algorithm for 
producing collocational correspondences has also 
been described \[Smadja, 1992\]. The algorithm in- 
volves several steps. English collocations are first 
extracted from the English side of the corpus. In- 
stances of the English collocation are found and 
the mutual information is calculated between the 
instances and various single word candidates in 
aligned French sentences. The highest ranking 
candidates are then extended by another word and 
the procedure is repeated until a corresponding 
French collocation having the highest mutual in- 
formation is found. 
An alternative approach is described here, 
which employs simple iterative re-estimation. It 
is used to make correspondences between simple 
noun phrases that have been isolated in corre- 
sponding sentences of each language using finite- 
state recognizers. The algorithm is applicable for 
finding single or multiple word correspondences 
and can accommodate additional kinds of phrases. 
In contrast to the other methods that have been 
mentioned, the algorithm can be extended in a 
straightforward way to enable correct correspon- 
dences to be made in circumstances where numer- 
ous low frequency phrases are involved. This is 
important consideration because in large text cor- 
pora roughly a third of the word types only occur 
once. 
Several applications for bilingual correspon- 
dence information have been suggested. They can 
be used in bilingual concordances, for automat- 
ically constructing bilingual lexicons, and proba- 
bilistically quantified correspondences may be use- 
ful for statistical translation methods. 
COMPONENTS 
Figure 1 illustrates how the corpus is analyzed. 
The words in sentences are first tagged with their 
17 
corresponding part-of-speech categories. Each 
tagger contains a hidden Markov model (HMM), 
which is trained using samples of raw text from 
the Hansards for each language. The taggers are 
robust and operate with a low error rate \[Ku- 
piec, 1992\]. Simple noun phrases (excluding pro- 
nouns and digits) are then extracted from the sen- 
tences by finite-state recognizers that are specified 
by regular expressions defined in terms of part-of- 
speech categories. Simple noun phrases are iden- 
tified because they are most reliably recognized; 
it is also assumed that they can be identified un- 
ambiguously. The only embedding that is allowed 
is by prepositional phrases involving "of" in En- 
glish and "de" in French, as noun phrases involv- 
ing them can be identified with relatively low error 
(revisions to this restriction are considered later). 
Noun phrases are placed in an index to associate 
a unique identifier with each one. 
A noun phrase is defined by its word sequence, 
excluding any leading determiners. Singular and 
plural forms of common nouns are thus distinct 
and assigned different positions in the index. For 
each sentence corresponding to an alignment, the 
index positions of all noun phrases in the sentence 
are recorded in a separate data structure, provid- 
ing a compact representation of the corpus. 
So far it has been assumed (for the sake of sim- 
plicity) that there is always a one-to-one mapping 
between English and French sentences. In prac- 
tice, if an alignment program produces blocks of 
several sentences in one or both languages, this 
can be accommodated by treating the block in- 
stead as a single bigger "compound sentence" in 
which noun phrases have a higher number of pos- 
sible correspondences. 
THE MAPPING ALGORITHM 
Some terminology is necessary to describe the al- 
gorithm concisely. Let there be L total alignments 
in the corpus; then Ei is the English sentence for 
alignment i. Let the function ¢(Ei) be the num- 
ber of noun phrases identified in the sentence. If 
there are k of them, k = ¢(Ei), and they can 
be referenced by j = 1...k. Considering the j'th 
noun phrase in sentence Ei, the function I~(Ei, j) 
produces an identifier for the phrase, which is the 
position of the phrase in the English index. If this 
phrase is at position s, then I~(Ei,j) = s. 
In turn, the French sentence Fi will contain 
¢(Fi) noun phrases and given the p'th one, its po- 
sition in the French index will be given by/~(Fi, p). 
It will also be assumed that there are a total of 
VE and Vr phrases in the English and French in- 
dexes respectively. Finally, the indicator function 
I 0 has the value unity if its argument is true, and 
zero otherwise. 
Assuming these definitions, the algorithm is 
I English sentence E i 
1 I English Tagger I 
I English NP Recognizer I 
I  n0.sh'o ex I 
I Bilingual Corpus I rth alignment 
I French FTntence I 
French Tagger I 
I French I NP Recognizer 
I Frenchlndex I 
Figure 1: Component Layout 
stated in Figure 2. The equations assume a direc- 
tionality: finding French "target" correspondences 
for English "source" phrases. The algorithm is re- 
versible, by swapping E with F. 
The model for correspondence is that a source 
noun phrase in Ei is responsible for producing the 
various different target noun phrases in Fi with 
correspondingly different probabilities. Two quan- 
tities are calculated; Cr(s, t) and Pr(s, t). Compu- 
tation proceeds by evaluating Equation (1), Equa- 
tion (2) and then iteratively applying Equations 
(3) and (2); r increasing with each successive iter- 
ation. The argument s refers to the English noun 
phrase nps(s) having position s in the English 
index, and the argument t refers to the French 
noun phrase npF(t) at position t in the French 
index. Equation (1) assumes that each English 
noun phrase in Ei is initially equally likely to cor- 
respond to each French noun phrase in Fi. All cor- 
respondences are thus equally weighted, reflecting 
a state of ignorance. Weights are summed over 
the corpus, so noun phrases that co-occur in sev- 
eral sentences will have larger sums. The weights 
C0(s, t) can be interpreted as the mean number of 
times that npF(t) corresponds to apE(s) given the 
corpus and the initial assumption of equiprobable 
correspondences. 
These weights can be used to form a new esti- 
mate of the probability that npF(t) corresponds to npE(s), 
by considering the mean number of times npF(t) 
corresponds to apE(s) as a fraction of the 
total mean number of correspondences for apE(s), 
as in Equation (2). The procedure is then iter- 
ated using Equations (3), and (2) to obtain suc- 
cessively refined, convergent estimates of the prob- 
18 
Co( ,t) = 
= 
cr( ,t) = 
r>O 
VE>s>I 
Vv>t>l 
L ¢(E~) ¢(F0 1 
E E E I(tt(Ei' J) = s)l(tt(Fi' k) = t) ¢(F,) 
i=1 j=l k=l 
Cr-l(S,t) vF 
Eq=l Cr-l(s, q) 
L ¢(E0 ¢(F0 
E E E I(#(Ei,j) = s)I(tt(Fi,k) = t)Pr_l(s,t) 
i=I j=l k=l 
(1) 
(2) 
(3) 
Figure 2: The Algorithm 
ability that ripE(t) corresponds to ripE(s). The 
probability of correspondences can be used as a 
method of ranking them (occurrence counts can 
be taken into account as an indication of the re- 
liability of a correspondence). Although Figure 2 
defines the coefficients simply, the algorithm is not 
implemented literally from it. The algorithm em- 
ploys a compact representation of the correspon- 
dences for efficient operation. An arbitrarily large 
corpus can be accommodated by segmenting it ap- 
propriately. 
The algorithm described here is an instance of 
a general approach to statistical estimation, rep- 
resented by the EM algorithm \[Dempster et al., 
1977\]. In contrast to reservations that have been 
expressed \[Gale and Church, 1991a\] about us- 
ing the EM algorithm to provide word correspon- 
dences, there have been no indications that pro- 
hibitive amounts of memory might be required, or 
that the approach lacks robustness. Unlike the 
other methods that have been mentioned, the ap- 
proach has the capability to accommodate more 
context to improve performance. 
RESULTS 
A sample of the aligned corpus comprising 2,600 
alignments was used for testing the algorithm (not 
all of the alignments contained sentences). 4,900 
distinct English noun phrases and 5,100 distinct 
French noun phrases were extracted from the sam- 
ple. 
When forming correspondences involving long 
sentences with many clauses, it was observed that 
the position at which a noun phrase occurred in El 
was very roughly proportional to the correspond- 
ing noun phrase in Fi. In such cases it was not 
necessary to form correspondences with all noun 
phrases in Fi for each noun phrase in Ei. Instead, 
the location of a phrase in Ei was mapped lin- 
early to a position in Fi and correspondences were 
formed for noun phrases occurring in a window 
around that position. This resulted in a total of 
34,000 correspondences. The mappings are stable 
within a few (2-4) iterations. 
In discussing results, a selection of examples will 
be presented that demonstrates the strengths and 
weaknesses of the algorithm. To give an indication 
of noun phrase frequency counts in the sample, the 
highest ranking correspondences are shown in Ta- 
ble 1. The figures in columns (1) and (3) indicate 
the number of instances of the noun phrase to their 
right. 
185 Mr. Speaker 187 M. Le PrSsident 
128 Government 141 gouvernement 
60 Prime Minister 65 Premier Ministre 
63 Hon. Member 66 d6put6 
67 House 68 Chambre 
Table 1: Common correspondences 
To give an informal impression of overall per- 
formance, the hundred highest ranking correspon- 
dences were inspected and of these, ninety were 
completely correct. Less frequently occurring 
noun phrases are also of interest for purposes of 
evaluation; some of these are shown in Table 2. 
32 Atlantic Canada 
Opportunities 
Agency 
5 DREE 
1 late spring 
1 whole issue 
of free trade 
23 Agence de promotion 
6conomique du 
Canada atlantique 
4 MEER 
1 fin du printemps 
1 question 
du libre-~change 
Table 2: Other correspondences 
The table also illustrates an unembedded En- 
glish noun phrase having multiple prepositional 
19 
phrases in its French correspondent. Organiza- 
tional acronyms (which may be not be available in 
general-purpose dictionaries) are also extracted, as 
the taggers are robust. Even when a noun phrase 
only occurs once, a correct correspondence can be 
found if there are only single noun phrases in each 
sentence of the alignment. This is demonstrated 
in the last row of Table 2, which is the result of 
the following alignment: 
Ei: "The whole issue of free trade has been men- 
tioned." 
Fi: "On a mentionn~ la question du libre- 
~change." 
Table 3 shows some incorrect correspondences 
produced by the algorithm (in the table, "usine" 
means "factory"). 
11 r ° tho obtraining I 01 asia0 I 1 mix of on-the-job 6 usine 
Table 3 
The sentences that are responsible for these cor- 
respondences illustrate some of the problems asso- 
ciated with the correspondence model: 
Ei: "They use what is known as the dual system 
in which there is a mix of on-the-job and off- 
the-job training." 
Fi: "Ils ont recours £ une formation mixte, partie 
en usine et partie hors usine." 
The first problem is that the conjunctive modifiers 
in the English sentence cannot be accommodated 
by the noun phrase recognizer. The tagger also 
assigned "on-the-job" as a noun when adjectival 
use would be preferred. If verb correspondences 
were included, there is a mismatch between the 
three that exist in the English sentence and the 
single one in the French. If the English were to 
reflect the French for the correspondence model 
to be appropriate, the noun phrases would per- 
haps be "part in the factory" and "part out of 
the factory". Considered as a translation, this 
is lame. The majority of errors that occur are 
not the result of incorrect tagging or noun phrase 
recognition, but are the result of the approximate 
nature of the correspondence model. The corre- 
spondences in Table 4 are likewise flawed (in the 
table, "souris" means "mouse" and "tigre de pa- 
pier" means "paper tiger"): 
1 toothless tiger 1 souris 
1 toothless tiger 1 tigre de papier 
1 roaring rabbit 1 souris 
1 roaring rabbit 1 tigre de papier 
Table 4 
These correspondences are the result of the fol- 
lowing sentences: 
Ei: "It is a roaring rabbit, a toothless tiger." 
Fi: "C' est un tigre de papier, un souris qui rugit." 
In the case of the alliterative English phrase "roar- 
ing rabbit", the (presumably) rhetorical aspect is 
preserved as a rhyme in "souris qui rugit"; the re- 
sult being that "rabbit" corresponds to "souris" 
(mouse). Here again, even if the best correspon- 
dence were made the result would be wrong be- 
cause of the relatively sophisticated considerations 
involved in the translation. 
EXTENSIONS 
As regards future possibilities, the algorithm lends 
itself to a range of improvements and applications, 
which are outlined next. 
Finding Word Correspondences: The algo- 
rithm finds corresponding noun phrases but pro- 
vides no information about word-level correspon- 
dences within them. One possibility is simply to 
eliminate the tagger and noun phrase recognizer 
(treating all words as individual phrases of length 
unity and having a larger number of correspon- 
dences). Alternatively, the following strategy can 
be adopted, which involves fewer total correspon- 
dences. First, the algorithm is used to build noun 
phrase correspondences, then the phrase pairs that 
are produced are themselves treated as a bilingual 
noun phrase corpus. The algorithm is then em- 
ployed again on this corpus, treating all words as 
individual phrases. This results in a set of sin- 
gle word correspondences for the internal words in 
noun phrases. 
Reducing Ambiguity: The basic algorithm 
assumes that noun phrases can be uniquely identi- 
fied in both languages, which is only true for sim- 
ple noun phrases. The problem of prepositional 
phrase attachment is exemplified by the following 
corresp on den ces: 
16 Secretary 20 secrdtaire d' Etat 
of State 
16 Secretary 19 Affaires extdrieures 
of State 
16 External Affairs 19 Affaires extdrieures 
16 External Affairs 20 secrdtaire d' Etat 
Table 5 
The correct English and French noun phrases 
are "Secretary of State for External Affairs" and 
"secr~taire d' Etat aux Affaires ext~rieures". If 
prepositional phrases involving "for" and "~" were 
also permitted, these phrases would be correctly 
20 
identified; however many other adverbial preposi- 
tional phrases would also be incorrectly attached 
to noun phrases. 
If all embedded prepositional phrases were per- 
mitted by the noun phrase recognizer, the algo- 
rithm could be used to reduce the degree of ambi- 
guity between alternatives. Consider a sequence 
np~ppe of an unembedded English noun phrase 
npe followed by a prepositional phrase PPe, and 
likewise a corresponding French sequence nplpp I. 
Possible interpretations of this are: 
1. The prepositional phrase attaches to the noun 
phrase in both languages. 
2. The prepositional phrase attaches to the noun 
phrase in one language and does not in the 
other. 
3. The prepositional phrase does not attach to 
the noun phrase in either language. 
If the prepositional phrases attach to the noun 
phrases in both languages, they are likely to be 
repeated in most instances of the noun phrase; it 
is less likely that the same prepositional phrase 
will be used adverbially with each instance of the 
noun phrase. This provides a heuristic method 
for reducing ambiguity in noun phrases that oc- 
cur several times. The only modifications required 
to the algorithm are that the additional possible 
noun phrases and correspondences between them 
must be included. Given thresholds on the num- 
ber of occurrences and the probability of the cor- 
respondence, the most likely correspondence can 
be predicted. 
Including Context: In the algorithm, cor- 
respondences between source and target noun 
phrases are considered irrespectively of other cor- 
respondences in an alignment. This does not make 
the best use of the information available, and can 
be improved upon. For example, consider the fol- 
lowing alignment: 
El: "The Bill was introduced just before 
Christmas." 
Fi: "Le projet de lot a ~t~ present~ juste avant le 
cong~ des F~tes." 
Here it is assumed that there are many instances 
of the correspondence "Bill" and "projet de lot", 
but only one instance of "Christmas" and "cong~ 
des F~tes". This suggests that "Bill" corresponds 
to "projet de lot" with a high probability and 
that "Christmas" likewise corresponds strongly to 
"cong~ des F~tes". However, the model will assert 
that "Christmas" corresponds to "projet de lot" 
and to "cong~ des F~tes" with equal probability, 
no matter how likely the correspondence between 
"Bill" and "projet de lot". 
The model can be refined to reflect this situ- 
ation by considering the joint probability that a 
target npr(t) corresponds to a source ripE(s) and 
all the other possible correspondences in the align- 
ment are produced. This situation is very similar 
to that involved in training HMM text taggers, 
where joint probabilities are computed that a par- 
ticular word corresponds to a particular part-of- 
speech, and the rest of the words in the sentence 
are also generated (e.g. \[Cutting et al., 1992\]). 
CONCLUSION 
The algorithm described in this paper provides a 
practical means for obtaining correspondences be- 
tween noun phrases in a bilingual corpus. Lin- 
guistic structure is used in the form of noun phrase 
recognizers to select phrases for a stochastic model 
which serves as a means of minimizing errors due 
to the approximations inherent in the correspon- 
dence model. The algorithm is robust, and exten- 
sible in several ways. 

References 
\[Brown et al., 1991a\] P. F. Brown, J. C. Lai, and 
R. L. Mercer. Aligning sentences in parallel cor- 
pora. In Proceedings of the 29th Annual Meeting 
of the Association of Computational Linguis- 
tics, pages 169-176, Berkeley, CA., June 1991. 
\[Brown et al., 1991b\] P. F. Brown, S. A. Della 
Pietra, V. J. Della Pietra, and R. L. Mer- 
cer. Word sense disambiguation using statisti- 
cal methods. In Proceedings of the 29th Annual 
Meeting of the Association of Computational 
Linguistics, pages 264-270, Berkeley, CA., June 
1991. 
\[Brown et al., 1992\] P. F. Brown, S. A. Della 
Pietra, V. J. Della Pietra, J. D. Lafferty, and 
R. L. Mercer. Analysis, statistical transfer, and 
synthesis in machine translation. In Proceedings 
of the Fourth International Conference on The- 
oretical and Methodological Issues in Machine 
Translation, pages 83-100, Montreal, Canada., 
June 1992. 
\[Church and Gale, 1991\] K. W. Church and 
W. A. Gale. Concordances for parallel text. In 
Proceedings of the Seventh Annual Conference 
of the UW Center for the New OED and Text 
Research, pages 40-62, September 1991. 
\[Cutting et at., 1992\] D. Cutting, J. Kupiec, 
J. Pedersen, and P. Sibun. A practical part- 
of-speech tagger. In Proceedings of the Third 
Conference on Applied Natural Language Pro- 
cessing, Trento, Italy, April 1992. ACL. 
\[Dagan et al., 1991\] I. Dagan, A. Itai, and 
U. Schwall. Two languages are more informa- 
tive than one. In Proceedings of the 29th Annual 
Meeting of the Association of Computational 
Linguistics, pages 130-137, Berkeley, CA., June 
1991. 
\[Dempster et ai., 1977\] 
A.P. Dempster, N.M. Laird, and D.B. Rubin. 
Maximum likelihood from incomplete data via 
the EM algorithm. Journal of the Royal Statis- 
tical Society, B39:1-38, 1977. 
\[Gale and Church, 1991a\] W. A. Gale and K. W. 
Church. Identifying word correspondences in 
parallel texts. In Proceedings of the Fourth 
DARPA Speech and Natural Language Work- 
shop, pages 152-157, Pacific Grove, CA., Febru- 
ary 1991. Morgan Kaufmann. 
\[Gale and Church, 1991b\] W. A. Gale and K. W. 
Church. A program for aligning sentences in 
bilingual corpora. In Proceedings of the 29th 
Annual Meeting of the Association of Compu- 
tational Linguistics, pages 177-184, Berkeley, 
CA., June 1991. 
\[Kay and RSscheisen, 1988\] 
M. Kay and M. RSscheisen. Text-translation 
alignment. Technical Report P90-00143, Xerox 
Palo Alto Research Center, 3333 Coyote Hill 
Rd., Palo Alto, CA 94304, June 1988. 
\[Kupiec, 1992\] J. M. Kupiec. Robust part-of- 
speech tagging using a hidden markov model. 
Computer Speech and Language, 6:225-242, 
1992. 
\[Smadja, 1992\] F. Smadja. How to compile a 
bilingual collocational lexicon automatically. In 
C. Weir, editor, Proceedings of the AAAI- 
92 Workshop on Statistically-Based NLP Tech- 
niques, San Jose, CA, July 1992. 
