Multilinguality and Reversibility in Computational Semantic 
Lexicons 
Evelyne Viegas and Stephen Beale 
viegas@crl.nmsu.edu and sb,.~crl.nmsu.edu 
Computing Research Laboratory 
Las Cruces, NM 88003 USA 
Abstract 
In this paper, we address the issue of generating 
multilingua.1 computational semantic lexicons from 
analysis lexicons, showing the necessity of rely- 
ing on a conceptual lexicon. We first discuss the 
type of information which should be found in NLP 
lexicons, whatever their use (analysis, generation, 
speech, robotics). We claim that we should take 
advantage of the existing large-scale analysis lex- 
icons and use tliem as the starting point in the 
process of building large-scale generation lexicons, 
by first reversing tliem and then enhancing them. 
Tliis implies having access to a conceptual lexi- 
con, which will serve as a pivot point between the 
analysis and the generation lexicons. We imple- 
mented the work reported here for Spanish and 
English MT projects, within the knowledge-based 
paradigm. From a theoretical point of view, regen- 
erating the source text with the reversed analysis 
lexicon enabled us to enhance several issues as di- 
verse as: evaluating analysis lexicons, testing the 
semantic analyser, evaluating which information 
should be added to the generation lexicon; and 
testing the grain-size of the pivot point between 
analysis and generation. 
Introduction 
There is no consensus on the type of lexicons 
which should be used for generators. It seems 
to depend on the type of generator. It also 
seems to depend on the kind of application in- 
volved: monolingual generation, naultilingual 
generation, machine translation: generation of 
sentences vs. texts vs. speech; or also genera- 
tion from raw data vs. from conceptual repre- 
sentations built with generation in mind. 
Once one has an application in mind, then 
there are three main approaches one can adopt 
to build the lexicon: 
lexicographic: very attractive for NLP ap- 
plications at first sight, as they provide a useful 
description of the vocabulary; entries are dis- 
tinguished on the basis of multiple senses and 
We would like to thank in the mikrokosmos team, 
Tom Herndon, .Jeff Longwel, Oscar Cossio and Javier 
Ochoa. 
subcategorisations. But in practice, this ap- 
proach complicates the process of lexical dis- 
ambiguation for parsing and lexical choice in 
generation by an unjustified proliferation of en- 
tries. 
statistical: very attractive for NLP appli- 
cations as it seems to replace knowledge-based 
approaches and therefore supplant the needs 
for human acquisition of large-scale seman- 
tic lexicons, which is a very time consuming 
task. However, the limits of statistical ap- 
proaches haw: been pointed out by \[Smadja, 
1993\]. Moreover, some phenomenon, such as 
event ellipsis (EE) cannot be handled by a pure 
statistical approach nor by a lexicographic ap- 
proach, as its recovery necessitates a seman- 
tic treatment, (\[Viegas and Nirenburg, 1995a\]), 
which is, in fact, handled by a computalional 
Linguistic approach making use of semantics. 
computational linguistic: the main ad- 
vantage of this approach is that it is usu- 
ally theoretically grounded, and is domain- 
and application-independent. Moreover, it can 
handle phenomena which are out of the reach 
of other approaches and yet are necessary to 
enhance lexical choice in generation. For in- 
stance, the EE triggered by enjoy as in i) \[ 
enjoyed the salmon very much, must be mod- 
eled with a semantic representation so that its 
recovery can be taken care of as in ii) I enjoyed 
eating .... This is part of lex.ical choice as one 
can choose to realise the synthetic version or 
the analytic version of the EE, as exempli- 
fied in i) and ii) respectively (cf. \[Viegas and 
Nirenburg, 1995a\]). 
However necessary, adopting a linguistic ap- 
proach is a difficult task, as one of the main 
drawbacks of this approach is that it is time 
consuming as far as the building of the lexicon 
is concerned. We address in next section how 
to bypass this drawback. 
49 
A Multi-purpose Knowledge Base 
Since building computational semantic lexi- 
coas is a very time-consuming task, we should 
aim at lexicons which conform to the three fol- 
lowing conditions: 
a multi-lingual: French, Engfish, 
Japanese, Russain, Spanish, etc..., (format of 
the lexicon) 
b - multi-rr'~d'._: .'_'" ..... :... 1~ .... ;~ 
tic information for natural language process- 
ing, phonological information, essentially for 
speech recognition and production, (structure 
of the lexicons) 
c - multi-use: so that they can be used 
for analysis, generation (mono/multi-lingual), 
MT, or speech processing. (reversibility of the 
lexicons) 
The way we organised and structured our 
lexicons directly follows these conditions. 
Large-scale computational generation lexicons 
carrying semantic information are indeed not 
that common, the obvious reason being that 
acquiring semantic information is a difficult 
and time consuming task. However, it is by 
no means an unattainable task, if we structure 
and organise our analysis lexicons in such a 
way so that the information they contain can 
be used at best for building generation lexi- 
cons. 
Acquiring a large-scale lexicon is very expen- 
sive, which is why building lexicons that are 
reusable for other domains or applications is 
recommended. It is well known in computa- 
tional lexical semantics that a sense enumer- 
ation approach only based on subcategorisa- 
tion differences is computationally expensive 
and unrealistic from a theoretical viewpoint. 
Our lexicons are composed of superentries, 
where each entry consists of a list of words, 
stored there independently of their part of 
speech (the verb and noun form of walk are 
under the same superentry), as described in 
length in \[Onyshkevych and Nirenburg, 1994\]. 
Reversing an Analysis Lexicon 
Before addressing the issue of reversing the 
analysis lexicon, we want first to show how we 
could acquire a large-scale analysis lexicon. 
Acquisition of the Analysis Lexicon 
We acquired a Spanish semantic lexicon of 
about 40,000 word meanings, for an MT 
Project, described in \[Beale et al., 1995\]. We 
automated as much as possible the task of ac- 
quisition by providing the lexicographers with 
access to on-fine dictionaries, on-fine corpora, 
and also software allowing lexicographers to ac- 
cess all this on fine information in an easy way 
(see \[Viegas and Nirenburg, 1995b\] for the task 
of acquisition). Our interfaces have been de- 
signed with respect to users needs, and con- 
tinue to evolve on a needed basis. 
We give below the example of the partial en- 
try cornpafii'a in Spanish, with two different 
marmings, represented by the following con- 
cepts in our world model (or ontology): COR- 
PORATION, INTEIrtACT-SOCIALLY. One impor- 
tant point here to notice is our transcatego- 
rim approach. There is no one-to-one map- 
ping between semantic categories or concepts 
and lexical items, and some EVENTS, (such as 
INTERACT-SOCIALLY here) can be lexicalised 
as nouns (Figure 1) 1 or verbs such as in 
acorn ~afiar: 
"cornpafi{a-N 1 
syn: 
~em: 
compafi{a-N2 
syn: 
sem: 
corpora,ion\] 
\[r°°t: @ \[::tm: ~\] \] 
\[\[~\[~\] interact-socially\] 
Figure l: Sense Entries for the Spanish lexical item 
compa~i'a. 
Let us now consider some of the entries for 
the Spanish verb adquirir with the following 
corresponding semantics: ACQUIRE, LEARN, 
displayed in (Figure 2). 
The sub-entries for adquirir have different 
selectional restrictions for the theme, OBJECT 
and INFORMATION for ACQUIRE and LEARN re- 
spectively. 
We have acquired about 1/5 of our lexi- 
con semi-automatically and have developed a 
morpho-semantic acquisition program, which 
has allowed us to acquire the remaining 4/5 
entirely automatically to create at the end a 
large-scale lexicon of about 40000 word senses. 2 
The main advantage of our approach is that it 
enabled us to economically multiply the size of 
the lexicon. The main drawback is that the en- 
tries produced automatically need some semi- 
manual checking. We bypass this drawback by 
1We use the typed feature structures (tfs) as de- 
scribed in \[Pollard and Sag, 1987)9. 
2See \[Viegas et al., 1996\] which describes the ad- 
vantages and drawbacks of using lexical rules to build 
lexicons. 
50 
"adquirir-V 1 
syn: 
sem: 
adquirir-V2 
"root: \[\] 
• \[cat: ~Ii_i\]P \] 
subj: \[B lsem: ~\]PJ ,cat: 
obj: \[\] \[sem: 
acquire 
agent: \[~\] humav 
theme: \[~\] object 
"root: \[\] 
sub j: \[\] |sere: sun: 
,obj: NLsem: 
learn 
sem: agent: ~ human 
theme: \[Z\] information 
Figure 2: Sense Entries for the Spanish lexical item 
adquirir. 
using the reversed lexicon to regenerate the en- 
try as explained below. 
The Reversed Lexicon 
The algorithm to "reverse" the analysis lexicon 
(AL) to produce the generation lexicon (GL) 
mainly involves rearranging, modifying, delet- 
ing, and adding certain items. We focus below 
on the zones which have been reversed, namely: 
SYN (subcategorisation information) and SEN\[ 
(providing the semantic information with as- 
sociated selectional restrictions), as shown in 
( Figure 3). 
"corporation-C1 
syn: \[root: 
sere: 
corporation-C2 
syn: \[root: 
sere: 
, \[cat: 
• \[cat: 1 
Figure 3: Partial Entry for the concept CORPORA- 
TION. 
Our transcategorial approach to sense dis- 
crimination is a good basis for paraphrasing, 
thus the concept ACQUIRE from the ontol- 
ogy, can be lexicafised in our Spanish lex- 
icon, at least in: adquirir, obtener, con- 
seguir (verbs), adquisicidn, obtcnciSn, en- 
riquecimiento (nouns), codicioso (adjective). 
We only show partial entries for superentry of 
the concept ACQUIIIE, as shown in (Figure 4). 
-acquire-C 1 
syn: 
sem: 
acquire-C2 
syn: 
sem: 
tEE\] \]cat: 
aspect{tefic: yes\]\] 
• root: adquirir \[C~ \[se 
• \[cat: NF 
subj: \[\] |sere: fie 
k 
• \[cat: NP\] 
°bJ: ~\[sem: ~\]J 
0~ 
agent: \[55\] human 
theme: ~ object 
Figure 4: Partial Entry for the concept ACQUIRE. 
We are now in the phase of enhancing the re- 
versed lexicon for producing the Spanish and 
Engfish generation lexicons: namely, we are en- 
coding information which is specific to the pro- 
cess of generation and which can be avoided in 
an analysis lexicon, such as word order (in Adj- 
noun constructions), and collocational infor- 
mation acquired semi-automatically fi'om cor- 
pora ( hacer una adquisici6n). 
Moreover, with this technique, we can pro- 
duce multifingual generation lexicons by lex- 
icafising the concepts of the reversed lexicons 
in different languages, this ensures that we will 
have a lexical item or phrase for lexicallsation 
available. 
Another advantage of reversing an analysis 
lexicon is using it to regenerate the same text 
that was parsed to gain some insight into the 
issue of the pivot point between parsing and 
generation, and as a resnlt of this, what is the 
best input for generation. 
Advantages of a Reversed Lexicon 
A reversed lexicon has advantages beyond its 
practical use in generation. We have identified, 
and in some cases begun work on the following 
areas: 
Evaluation of semantic analysis• With 
a reversed lexicon that is based on the original 
analysis lexicon, it is possible to take the out- 
put semantic representations from the analyser 
and submit them to a text generator. The out- 
put surface structures can then be compared 
to the input text. Apart from this, evaluation 
of semantic analyses can be difficult because 
51 
it involves reading and understanding complex 
meaning representations. 
Evaluating Text Meaning Representa- 
tion language. For example, the granularity 
of semantic representation can be studied. Is 
the representation precise enough to correctly 
translate all meaning components, or is a spe- 
cific source term mapped into a generalised one 
from which the original meaning cannot be re- 
covered? This will be especially helpful in a 
multilingual environment where meaning com- 
ponents might be bundled differently. 
Testing lexicon entries. We have devel- 
oped a suite of tools to help in testing the anal- 
ysis lexicon, to ensure the high-quality of our 
large-scale lexicon. These tools range in com- 
plexity from checking placement of parentheses 
to automatically creating sentences to test in- 
dividual lexicon entries. For the latter, having 
a reversed lexicon available is extremely help- 
ful. For example, a simple lexicon entry for the 
English word read might look like: 
READ-V 1 
syn-struc : 
root : read 
Subj : VARI 
0BJ : VAR2 
sem-struc : 
READ 
AGENT: VARi = HUMAN 
THEME: VAR2 = BOOK 
We can then use the reversed lexicon to 
generate sentences that conform to the given 
syn-struc but substitute appropriate words or 
phrases in place of the variables. Sentences 
such as the following would signal problems: 
The book read John. 
John read into the book. 
John read the cheese. 
This is especially helpful for automatically 
generated lexicon entries such as nominalisa- 
tions, which are created from verbal entries 
using lexical rules. Many thousands of such 
entries have been created; tools such as this 
provide a simple way to check their accuracy. 
Conclusion 
In this paper we argued that, although lexico- 
graphic and statistical approaches have their 
place in natural language processing, compu- 
tational semantic lexicons are necessary for a 
wide range of phenomena and are applicable 
to a number of purposes. Unfortunately, large- 
scale acquisition of computational lexicons is 
difficult. Compounding this problem is the fact 
that analysis systems require different informa- 
tion than generation lexicons. 
We have developed a method that enables 
us to take advantage of the large investment 
made in the Mikrokosmos analysis lexicon. We 
outlined a relatively simple process for revers- 
ing analysis lexicons for eventual use in gener- 
ation. This process transfers relevant informa- 
tion and re-indexes it according to the needs 
of generation. The process is not perfect; some 
information required in generation, such as col- 
locational constraints, is not typically recorded 
in analysis lexicons. Nevertheless, the methods 
described here provide a baseline to which ad- 
ditional information can be added. 
Creating these reversed lexicons has pro- 
duced a number of additional advantages be- 
yond those originally envisioned, mostly in the 
area of testing and evaluating the semantic 
analysis system. By back-translating the re- 
sults of semantic analysis, evaluation is simpli- 
fied. Testing the content of individual lexicon 
entries can also be made easier by generating 
sample sentences that conform to them. And 
finally, theoretical issues concerning the con- 
tent of analysis lexicons, generation lexicons 
and the text meaning representation language 
can be more fully investigated. 

References 
Beale, S,, Nirenburg, S. and Mahesh. K. 
(1995) Semantic Analysis in the Mikrokosmos 
Machine Translation Project. In Plvceedings of 
the 2nd Symposium on NLP, Bangkok, Thai- 
land. 
Onyshkevych, B. et S. Nirenburg (1994) The 
Lexicon in the Scheme of KBMT Things. Tech- 
nical Report MCCS-94-277, CRL, NMSU. 
Smadja, F. (1993) Retrieving Collocations 
from Texts: Xtract. In Computational Lin- 
guistics, i9(1). 
Viegas, E. and Nirenburg, S. (1995a) The 
Semantic Recovery of Event Ellipsis: its Com- 
putational Treatment. In Proceedings of the 
Workshop "Context in Natural Language Pro- 
cessing", of (IJCAI95), Montr4al, Qu4bec. 
Viegas, E. and Nirenburg, S. (1995b) Acqui- 
sition semi-automatique du lex.ique. Proceed- 
ings of LTT, Lyon 95, France. 
Viegas, E., Onyshkevych, B., Raskin, V., 
Nirenburg, S. (1996) From Submit to Submit- 
ted via Submission: on Lexical Rules in Large- 
scale Lexicon Acquisition. In ACL'96, Santa- 
Cruz, California. 
