DUAL-CODING THEORY AND CONNECTIONIST LEXICAL 
SELECTION 
Ye-Yi Wang* 
Computational Linguistics Program 
Carnegie Mellon University 
Pittsburgh, PA 15232 
Internet: yyw@cs.cmu.edu 
Abstract 
We introduce the bilingual dual-coding theory as a 
model for bilingual mental representation. Based on 
this model, lexical selection neural networks are imple- 
mented for a connectionist transfer project in machine 
translation. 
Introduction 
Psycholinguistic knowledge would be greatly helpful, 
as we believe, in constructing an artificial language 
processing system. As for machine translation, we 
should take advantage of our understandings of (1) 
how the languages are represented in human mind; (2) 
how the representation is mapped from one language 
to another; (3) how the representation and mapping are 
acquired by human. 
The bilingual dual-coding theory (Paivio, 1986) 
partially answers the above questions. It depicts the 
verbal representations for two different languages as 
two separate but connected logogen systems, charac- 
terizes the translation process as the activation along 
the connections between the logogen systems, and at- 
tributes the acquisition of the representation to some 
unspecified statistical processes. 
We have explored an information theoretical neu- 
ral network (Gorin and Levinson, 1989) that can ac- 
quire the verbal associations in the dual-coding theory. 
It provides a learnable lexical selection sub-system for 
a conneetionist transfer project in machine translation. 
Dual-Coding Theory 
There is a well-known debate in psycholinguistics 
concerning the bilingual mental representation: inde- 
pendence position assumes that bilingual memory is 
represented by two functionally independent storage 
and retrieval systems, whereas interdependence po- 
sition hypothesizes that all information of languages 
exists in a common memory store. Studies on cross- 
language transfer and cross-language priming have 
*This work was partly supported by ARPA and ATR In- 
terpreting Telephony Research Laboratorie. 
provided evidence for both hypotheses (de Groot and 
Nas, 1991; Lambert, 1958). 
Dual-coding theory explains the coexistence of in- 
dependent and interdependent phenomena with sepa- 
rate but connected structures. The general dual-coding 
theory hypothesizes that human represents language 
with dual systems -- the verbal system and the im- 
agery system. The elements of the verbal system are 
logogens for words in a language. The elements of 
the imagery system, called "imagens", are connected 
to the logogens in the verbal systems via referential 
connections. Logogens in a verbal system are also in- 
terconnected with associative connections. The bilin- 
gual dual-coding theory proposes an architecture in 
which a common imagery system is connected to two 
verbal systems, and the two verbal systems are inter- 
connected to each other via associative connections 
\[Figure 1\]. Unlike the within-language associations, 
which are rich and diverse, these between-language 
associations involve primarily translation equivalent 
terms that are experienced together frequently. The 
interconnections among the three systems explain the 
interdependent functional behavior. On the other hand, 
the different characteristics of within-language and 
between-language associations account for the inde- 
pendent functional behavior. 
Based on the above structural assumption, dual-" 
coding theory proposes a parallel set of processing 
assumptions. Activation of connections between ref- 
erentially related imagens and logogens is called ref- 
erential processing. Naming objects and imaging to 
words are prototypical examples. Activation of asso- 
ciative connections between logogens is called asso- 
ciative processing. Lexical translation is an example 
of associative processing between two languages. 
Connectionist Lexical Selection 
Lexical Selection 
Lexical selection is the task of choosing target lan- 
guage words that accurately reflect the meaning of the 
corresponding source language words. It plays an im- 
portant role in machine translation (Pustejovsky and 
325 
L1 Verbal System 
f.. -~ 
V I Association Network 
L2 Verbal System 
f 
V 2 Association Nelwork 
VI - I Connections V 2 - I Connections 
Imagery System 
Figure 1: Bilingual Dual-Coding Representation 
Nirenburg, 1987). 
A common lexical selection practice involves 
an intermediate representation. It disambiguates the 
source language words to entities in the intermediate 
representation, then maps from the entities to the target 
lexical entries. This intermediate representation may 
be Lexical Concept Structure (Dorr, 1989) or inter- 
lingua (Nirenberg, 1987). This engineering approach 
requires great effort in designing the representation and 
the mapping rules. 
Currently, there are some efforts in statistical lex- 
ical selection. A target language word W t can be se- 
lected with the posterior probability Pr(Wt I Ws) given 
the source language word Ws. Several target language 
lexicai entries may be selected for a single source lan- 
guage word. Then the correct selections can be iden- 
tiffed by the language model of the target language 
(Brown, 1990). This approach is learnable. However, 
the accuracy is low. One reason is that it does not use 
any structural information of a language. 
In next subsections, we propose information- 
theoretical networks based on the bilingual dual-coding 
theory for lexical selection. 
Information-Theoretical Networks 
Information-theoretical network is a neural network 
formalism that is capable of doing associations be- 
tween two layers of representations. The associations 
can be obtained statistically according to the network's 
experiences. 
An information-theoretical network has two lay- 
ers. Each unit of a layer represents an element in the 
input or output of a training pattern, which might be a 
logogen or a word. Units in different layers are con- 
nected. The weight of the connection between unit i 
in one layer and unit j in the other layer is assigned 
with the mutual information between the elements rep- 
resenled by the two units 
(1) wij = l(vi, vj) = log(Pr(vjvi)/er(vi)) l 
Each layer also contains a bias unit, which is al- 
ways activated. The weight of the connection between 
the bias unit in one layer and unitj in the other layer is 
(2) woj = loger(vj) 
Both the information-theoretical network and the 
back-propagation network compute the posterior prob- 
abilities for an association task (Gorin and Levin- 
son, 1989; Robinson, 1992). However, only the 
information-theoretical network is isomorphic to the 
directly interconnected verbal systems in the dual- 
coding theory. Besides, an information-theoretical net- 
work has the following advantages: (1) it learns fast. 
The network can learn in a single pass without gra- 
dient decent. (2) it is adaptive. It can incrementally 
adapt to new experiences simply by adding new data 
to the training samples and modifying the associations 
according to the changed statistics. These make the 
network more psychologically plausible. 
Lexical Selection as an Associative Process 
We tried to map source language f-structures to target 
language f-structure in a connectionist transfer project 
(Wang, 1994). Functionally, there were two sub-tasks: 
1. finding the target sub-structures, their phrasal cat- 
egories and their corresponding source structures; 2. 
finding the head of a target structure. The second sub- 
task is a problem of lexical selection. It was first im- 
plemented with a back-propagation network. 
We replaced the back-propagation networks for 
lexical selection with information-theoretical networks 
simulating the associative process in the dual-coding 
theory. The networks have two layers of units. Each 
source (target) language lexical item is represented by 
a unit in the input (output) layer. One network is con- 
structed for each phrasal category (NP, VP, AP, etc.). 
The networks works in the following way: for a 
target-language f-structure to be generated, the transfer 
system knows its phrasal category and its correspond- 
ing source-language f-structure from the networks that 
perform the sub-task 1. It then activates the lexical se- 
lection network for that phrasal category with the input 
units that correspond to the heads of the source lan- 
guage f-structure and its sub-structures. Through the 
connections between the two layers, the output units 
are activated, and the lexical item that corresponds to 
the most active output unit is selected as the head of 
the target f-structure. The following example illus- 
trates how the system selects the head anmelden for 
1Where vi means the event that unit i is activated. 
326 
the German XCOMP sub-structure when it does the 
transfer from 
\[sentence \[subj i\] would \[xcomp \[subj \]\] like \[xeomp \[subj 
I\] register \[pp-adjfor the conference\]\]\]\] to 
\[sentence \[subj Ich\] werde \[xcomp \[subj Ich\] \[adj gerne\] 
anmelden \[pp-aajfuer der Konferenz\]\]\] 2. 
Since the structure networks find that there is a 
VP sub-structure of XCOMP in the target structure 
whose corresponding input structure is \[xcomp \[subj 
to register \[pp-adjfor the conference\]\]\], it activates the 
VP lexical selection network's input units for I, register 
and conference. By propagating the activation via the 
associative connections, the unit for anmelden is the 
most active output. Therefore, anmelden is chosen as 
the head of the xcomp sub-structure. 
Preliminary Result 
The domain of our work was the Conference Registra- 
tion Telephony Conversations. The lexicon for the task 
contained about 500 English and 500 German words. 
There were 300 English/German f-structurepairs avail- 
able from other research tasks (Osterholtz, 1992). A 
separate set of 154 sentential f-structures was used to 
test the generalization performance of the system. The 
testing data was collected for an independent task (Jain, 
1991). 
From the 300 sentential f-structure pairs, every 
German VP sub-structure is extracted and labeled with 
its English counterpart. The English counterpart's head 
and its immediate sub-structures' heads serve as the 
input in a sample of VP association, and the German 
f-structure's head become the output of the association. 
For the above example, the association (\]input I, regis- 
ter, conference\] \[output anmelden\]) is a sample drawn 
from the f-structures for the VP network. The training 
samples for all the other networks are created in the 
same way. 
The accuracy of our system with information- 
theoretical network lexical selection is lower than the 
one with back-propagation networks (around 84% ver- 
sus around 92%) for the training data. However, the 
generalization performance on the unseen inputs is bet- 
ter (around 70% versus around 62%). The information- 
theoretical networks do not over-learn as the back- 
propagation networks. This is partially due to the 
reduced number of free parameters in the information- 
theoretical networks. 
Summary 
The lexical selection approach discussed here has two 
advantages. First, it is learnable. Little human effort 
on knowledge engineering is required. Secondly, it is 
psycholinguisticaUy well-founded in that the approach 
2The f-structures are simplified here for the sake of 
conciseness. 
adopts a local activation processing model instead of 
relies upon symbol passing, as symbolic systems usu- 
ally do. 

References 
P. F. Brown and et al. A statistical approach to machine 
translation. ComputationalLinguistics, 16(2):73- 
85, 1990. 
A. M. de Groot and G. L. Nas. Lexical representation 
of cognates and noncognates in compound bilin- 
gums. Journal of Memory and Language, 30(1), 
1991. 
B. J. Dorr. Conceptual basis of the lexicon in ma- 
chine translation. Technical Report A.I. Memo 
No. 1166, Artificial Intelligence Laboratory, MIT, 
August, 1989. 
A. L. Gorin and S. E. Levinson. Adaptive acquisition of 
language. Technical report, Speech Research De- 
partment, AT&T Bell Laboratories, Murray Hill, 
1989. 
A. N. Jain. Parsec: A connectionist learning archi- 
tecture for parsing spoken language. Technical 
Report CMU-CS-91-208, Carnegie Mellon Uni- 
versity, 1991. 
W. E. Lambert, J. Havelka and C. Crosby. The influ- 
ence of language acquisition contexts on bilingual- 
ism. Journal of Abnormal and Social Psychology, 
56, 1958. 
S. Nirenberg, V. Raskin and A. B. Tucker. The struc- 
ture of interlingua in translator. In S. Niren- 
burg, editor, Machine Translation: Theoretical 
andMethodologicallssues. Cambridge University 
Press, Cambridge, England, 1987. 
L. Osterholtz and et al. Janus: a multi-lingual speech 
to speech translation system. In Proceedings of 
the IEEE International Conference on Acoustics, 
Speech and Signal Processing, volume 1, pages 
209-212. IEEE, 1992. 
A. Paivio. Mental Representations ~ A Dual Coding 
Approach. Oxford University Press, New York, 
1986. 
J. Pustejovsky and S. Nirenburg. Lexical selection in 
the process of language generation. In Proceed- 
ings of the 25th Annual Conference of the Associ- 
ation for Computational Linguistics, pages 201- 
206, Standford University, Standford, CA, 1987. 
A. Robinson. Practical network design and implemen- 
tation. In Cambridge Neural Network Summer 
School, 1992. 
Y. Wang and A. Waibel. Connectionist transfer in ma- 
chine translation. Inprepare, 1994. 
