Collocations in Multilingual Generation 
Ulrich tieid, Sybille Raab 
Universit~t Stuttgart, Projekt Polygloss 
Institut f/ir maschinelle Sprachverarbeitung 
Keplerstrasse 17 
D-7000 Stuttgart 1, West Germany 
Abstract 
We present a proposal for the structuring 
of collocation knowledge 1 in the lexicon of 
a multilingual generation system and show 
to what extent it can be used in the pro- 
cess of lexical selection. This proposal is 
part of Polygloss, a new research project 
on multilingual generation, and it has been 
inspired by work carried out in the S EM- 
SYN project (see e.g. \[I~(~SNEtt 198812). 
The descriptive approach presented in this 
proposal is based on a combination of re- 
sults from recent lexicographical research 
and the application of Meaning-Text-Theory 
(MTT) (see e.g. \[MEL'CUK et al. 1981\], 
\[MEL'CUK et al. 1984\]). We first outline the 
overall structure of the dictionary system that 
is needed by a multilingual generator; section 2 
gives an overview of the results of lexicograph- 
ical work on collocations and compares them 
with "lexical functions" as used in Meaning- 
Text-Theory. Section 3 shows how we intend 
to integrate collocations in the generation dic- 
1We use the term "collocation" in the sense of 
\[HAUSMANN 1985\] referring to constraints on the 
cooccurrence of two lexeme words; the two elements 
are not completely freely combined, but one of them 
semantically determines the other one. Examples are 
for instance solve a problem, turn dark, expose someone 
to a risk, etc. For a more detailed definition see section 
2. 
2 Research reported in this paper is supported by the 
German Bundesministerium fiir Forschung und Tech- 
nologie, BMFT, under grant No. 08 B 3116 3. The 
views and conclusions contained herein are those of the 
authors and should not be interpreted as positions of 
the project as a whole. 
tionary and how "lexical functions" can be 
used in generation. 
1 Lexical knowledge for 
multilingual generation 
Within a multilingual generation system, it 
seems necessary to keep the dictionary as 
modular as possible, separating information 
that pertains to different levels of linguistic 
description 3. We assume that the system's lex- 
ical knowledge is stored in the following types 
of "specialized dictionaries": 
• semantic: inventory of possible lexicaliza- 
tions of a concept in a given language; 
syntactic: one inventory of realization 
classes per language, providing informa- 
tion about number, type and realization 
of the arguments of a given lexeme; 
• morphological: one inventory of inflec- 
tional classes per language. 
Since none of these levels of decsription 
is completely independent, the dictionaries 
should be linked to each other by means of 
cross-references and reference to class mem- 
bership. Templates and mechanisms allow- 
ing for explicit inheritance of shared proper- 
ties, e.g. redundancy rules, will be used within 
aFor more details on the dictionary structure see 
\[HEID/MOMMA 1989\]. 
- 130 - 
each of the layers. These mechanisms give ac- 
cess to the knowledge about the linguistic "be- 
haviour" of lexemes needed in the process of 
lexicalization 4. 
2 Approaches to the descrip- 
tion of collocations 
2.1 Contributions from lexicogra- 
phy 
The tradition of British Contextualism 5 de- 
fines collocations on the basis of statistical as- 
sumptions about the probability of the cooc- 
curence of two lexemes. Particularly frequent 
combinations of lexical units are regarded as 
collocations. 
A more detailed definition can be found in 
the work of Franz Josef Hausmann (1985:119): 
"One partner determines, another is 
determined. In other words: colloca- 
tions have a basis and a cooccurring 
collocate. "6 
This determination manifests itself in so 
far as a given basis does not allow all of the 
collocates that would be possible according to 
general semantic coocurrence conditions, but 
only a certain subset: so in French, retenir son 
admiration, retenir sa haine, sa joie are possi- 
ble, but *retenir son dgsespoir is not. 
The choice of collocates depends strongly 
on the lexeme that has been chosen as the ba- 
sis; knowledge about possible collocations can 
be only partly derived from knowledge about 
general semantic properties of lexemes. There- 
fore general cooccurrence rules or selectional 
4Possibly including classifications according to se- 
mantically motivated lexeme classes and a modelling 
of paradigmatic relations between lexemes, such as hy- 
ponymy or synonymy. 
5The term "collocation" was introduced into linguis- 
tic discussion by John R. Firth (1951:94). 
eTranslation by the authors. We use the terms ba- 
sis and collocate in the sense of \[ttAUSMANN 1985\]; 
HAUSMANN'S original terms are Basis and Kollokator. 
restrictions (e.g. using semantic markers) are 
not adequate for the choice of collocates in the 
process of lexicalization. 
These considerations lead to two propos- 
als for the structuring of the lexical knowledge 
used in a generator: 
• Heuristic for the lexicalization process: 
"First the basis is lexicalized, 
then the collocate, depending 
on which lexeme has been cho- 
sen as the basis." 
Knowledge about the possibility of com- 
bining lexemes in collocations should be 
stored in the lexicalization dictionary 
(where lexicalization candidates for con- 
cepts are provided), and specifically in the 
entries for the bases. 
The following table shows in terms of 
categories 7 what can be a possible collocate 
for a particular basisS: 
basis possible collocates 
noun noun, Verb , adjective 
verb adverb 
adjective adverb 
7Unlike British Contextualism (cf. the recent 
\[SINCLAIR 1987\]) we assume that bases and collocates 
are of one of the following categories: noun, verb, ad- 
jective or adverb. 
s For substantive-verb-coliocations, the classification 
as basis and collocate is opposed to the usual syntac- 
tic description according to head and modifier; this 
has consequences for the lexicalization process: while 
it is usually possible to frst lexicalize the heads of 
phrases, then the modifiers (e.g. substantiveh~d,bo~s < 
adjective,~od~1~e~,coUo~ot~, the choice of verbs depends 
on their nominal complements (which are modifiers, 
but which have to be considered as bases of colloca- 
tions). This means that nouns have to be lexicalized be- 
fore verbs, e.g. Pi~'ne schmieden, but not *gute Vors~'tze 
schmieden). 
- 131 - 
2.2 Lexical functions of the 
Meaning-Text-Theory as a tool 
for the description of colloca- 
tions 
In MTT, developed by Mel'~uk and co- 
workers, there exist about 60 "lexical func- 
tions" which describe regular dependencies be- 
tween lexical units of a language. In MTT, 
lexical functions are understood as cross- 
linguistically constant operators (f), whose 
application to a lexeme ("keyword", L) 
yields other lexemes (v). Mel'~uk (1984:6), 
(1988:31f) uses the following notation: 
f(L) = v 
The result of the application of a lexi- 
cal function to a given lexeme can be another 
"one-word" lexeme, or a collocation, an idiom 
or even an interjection. 
The parallelism between the collocation 
definition used in this paper and the notion 
of lexical function is that both start from the 
principle that collocates depend upon the re- 
spective bases (in MTT, v is a function of L). 
Therefore lexical functions seem to be a useful 
device for the description of collocations in a 
generation lexicon. 
In the following, we only consider lexi- 
ca/ functions which, when applied to a lex- 
eme word, yield collocationsS; Table 1 gives 
some examples of such lexical functions, to- 
gether with a definitional gloss, taken from 
\[STEELE/MEYER 198811°: 
sit should be investigated to what extent the cat- 
egory of v is predictable for every f, according to 
the category of L. For instance, J~s of group 1 and 2 
specified in the table below, applied to nouns, yield 
substantive+verb-collocations, those of groups 3 and 
4 yield substantive+adjective-collocations, and those 
of groups 5 and 6 return substantive+substantive- 
collocations. 
l°Lexical functions of group 2, normally occur to- 
gether with those from 1; ABLB only occurs in combi- 
nation with other lexical functions. 
3 Generating Collocations 
We propose that every lexeme entry in the lex- 
icalization dictionary contains slots for lexical 
functions, whose fillers are possible collocates; 
within a slot/filler-notation as the one used 
in Polygloss, a (partial) lexical entry, e.g. for 
problem, could be represented in the following 
way: 
(problem 
(...) 
(caus func (create, pose)) 
(real (solve .... )) 
(...)) 
It might be possible to predict the types 
of lexical functions applicable to a given lex- 
eme from its membership in a semantic class. 
Syntactic properties of bases and collocates are 
accessible through reference to the realization 
lexicon. 
\[MEL'CUK/POLGUERE 1987\]:271f 
themselves stress the advantage of describ- 
ing collocations with lexical functions within 
language generation and machine translation: 
they give the example of OPER (*QUESTION*), 
realized as 
• English ask a question, 
• French poser une question, 
• Spanish hacer una pregunta and 
• Russian zadat' vopros 
respectively 11 . 
3.1 Lexicon structure and possible 
generalizations 
On the basis of the analysis of some entries 
in \[MEL'CUK et al. 1984\] and of material we 
11Here *QUI~STION* refers to a concept that stands 
for the language-specific items. 
- 132- 
\[1111 
1. 
. 
. 
. 
5. 
6. 
\[ Lexical Functions Meaning Examples 
OPER, FUNC, LABOR, 
REAL, FACT, LABREAL 
PROX, INCEP 
CONT, FIN 
CAUS, PERM 
LIQU 
MAGN, POS, VER 
occurrence 
realization 
MULT, SING 
phases 
phase + \[CAUSE\] 
(high) degree 
ABLE, QUAL ability 
count ~ mass 
OPER( attention) = pay 
REAL(promise) = keep 
INCEP OPER(form) "-- take 
CAUS FUNC(problem) = create, pose 
MAGN( eater) = big, hearty 
VZR(praise) = merited 
A B L E2 (writing) = readable 
MULT(goose) = gaggle 
GERM, CULM germ, culmination CULM(joy) = height 
Table 1: Examples of lexical functions used for the description of collocations 
have analysed within Polygloss x2, it seems pos- 
sible to generalize over some regularities in 
collocation formation for members of seman- 
tically homogenous lexeme classes. 
An example: the following default assumptions 
can be made for nouns expressing information 
handled by a computer (we assume seman- 
tic classes *I-NoUNSG* and *I-NoUNSF* for 
German and French respectively): 
OPERI(*PA* ) 
Exception: 
O P EIt 1 (admiration) 
O P E R l ( haine ) 
= ressentir ( SUBJ OBJ 
(OBJ PRED) ~;*PA* 
= nourrir (sosJ OBJ), 
(OBJ PRED)= 
"admiration" 
= nourrir (SUBJ OBJ), 
(OBJ PRED)= "haine" 
• *I-NOUNSG* = { Datei, 
Nachrichten, Verzeichnis } 
• *I-NoUNSF* = { fichier, 
messages, rgpertoire } 
Information, 
information, 
LIQU FUNC0(*I-NouNsG*) = ldschen 
LIQU FUNCo(*I-NoUNSF*) --- supprimer 
Some exceptions, however, have to be 
stated explicitly, as illustrated by the example 
of French nouns expressing personal attitudes, 
treated in \[MEL'CUK et al. 1984\]: 
PA* -" { admiration, coldre, dgsespoir, en- 
thousiasme, enyie, gtonnement, haine, joie, 
mgpris, respect } 
12Manuals for PC-Networks that have been provided 
in machine-readable form in German and French by 
IBM; cf. \[RAAB 1988\]. 
3.2 The generation of paraphrases 
One of the aims in the development of the 
"how-to-say"-component of a generation sys- 
tem is to ensure that variants (i.e. true para- 
phrases) can be generated for one and the same 
semantic structure. 
This involves two types of knowledge: 
more 'static' knowledge about interchangeabil- 
ity of realization variants (synonymous items, 
information about paraphrase relations be- 
tween certain constructions or between col- 
locations) and more 'procedural' knowledge 
about heuristics guiding the choice between 
candidates. The 'static' knowledge should be 
represented declaratively. It can be divided 
into information about syntactic variants (e.g. 
participle form vs. relative clause) and in- 
formation about lexicalization variants. In 
133 - 
\[MEL'(~UK 1988\]:38-41 rules are stated, which 
express paraphrase relations between certain 
types of collocations. Ideally these rules can 
be set up for pairs of lexical functions, without 
consideration of concrete lexemes. Examples 
are: 
Jean s'est mis en colors contre Paul 
(--INCEP OPER1) 
John got angry with Paul 
Paul s'est attirg la colors de Jean. 
(--INCEP OPER2) 
Panl angered John. 
Jean s'est pris d'enthousiasme pour cette 
ddcouverte. 
(=oPER) 
John got enthusiastic about this discovery. 
(A cause de cette ddcouverte) 
l'enthousiasme s'est empard de Jean. 
(=FuNc) 
John was enthused by this discovery. 
Within a generation system, such descrip- 
tions can be used to state paraphrase rela- 
tions between collocational lexicalization can- 
didates. The choice between candidates de- 
pends on parameters, amongst which the fol- 
lowing ones seem to be essential: 
• syntactic "behaviour" of the lexemes 
building up a collocation 13 
- in relation to roles in the frame struc- 
ture to be realized; 
- in relation to the thematic structure 
of the intended utterance; 
18We plan to investigate to what extent it is possible 
to describe the syntactic form of certain collocations 
with general rules. This is possible e.g. for OVER, 
FUNC, LABOR, i.e. for lexical functions yielding col- 
locations of the type of "Funktionsverbgeffige": 
OPBR(L) , verb (SUBJ OBJ ... ) 
(OBJ PRBD) = L 
PUNO(L) , verb < SUBJ ... ) 
(SUBJ PRED) -~ 
LABOR(L) ~ verb (SUBJ OBJ Y ) 
(V PRBD) = L 
• markedness of lexemes (e.g. registers, 
style); 
• general heuristics for text generation (e.g. 
"avoid repetition", "avoid deep embed- 
ding" etc. ) 
In the following, we give an example for 
the lexicalization possibilities that can be de- 
scribed with the proposed device: 
given the following (rudimentary) semantic 
representation 14: 
mental process : *BE- HAPPY* 
:BEARER *PIERRE* 
:CAUSE *NEWS*, 
there should be available the following in- 
formation about collocations with joie as a 
basislS: 
CAUS FU NC(joie) 
CAUS OVER(joie) 
INCEP FUNC(joie) 
INCEP OPER(joie) 
= causer la joie 
de qn, 
causer de 
la joie chez qn 
= rgjouir qn, 
mettre qn en joie 
remplir qn de joie 
= la joie 
s'empare de qn 
la joie saisit qn, 
la joie nab dans 
le coeur de qn 
= qn se met enjoie 
The choice between INCEP and CAUSE de- 
pends on whether (and how) the causality is to 
be expressed. The choice between INCEP OPER 
and INCEP FUNC depends on whether the re- 
laization of *PIERRE* or Of*NEWS* should be- 
come the subject. 
14 menta/ process is meant to be a concept type; 
:BBARBR and :OAUSB are semantic relations; *BB- 
HAPPY*~ *PIBRRB* and *NBWS* are concepts. 
ZSIn simplified notation. The first two examples are 
roughly equivalent to English make someone happy, fill 
someone with joy, the latter ones to to please someone. 
- 134 - 
Here constraints caused by the syntax of 
the utterance to be generated play an impor- 
tant role: in a relative clause e.g. the an- 
tecedent has already been introduced. This 
fact limits the choice: 
• - ... et alors cette nouvelle arriva, qui ... 
- causa la joie de Pierre 
(= cAus FUNC) 
- mit Pierre en joie 
(= CAUS FUNC) 
• ... et alors Marie envoya cette nouvelle fi 
Pierre, qui... 
- se rdjouit (= CAUS FUNC) 
-- se mit en joie (= CAUS FUNC) 
This example shows that the heuristic 
"lexicalize bases first, then collocates" inter- 
acts with constraints stemming e.g. from syn- 
tax; these constraints can also be produced by 
a text structuring component (decisions about 
topic, thematic order etc.). The modular de- 
sign of the lexicon supports generation of vari- 
ants by giving access to all information needed 
at the appropriate choicepoints. 
4 Conclusion and directions 
for future work 
We propose a method for the description of 
knowledge about collocations in the dictionary 
of a multilingual generation system. Advan- 
tages for text generation result from the ap- 
plication of MTT's lexical functions and the 
formulation of the heuristic discussed above. 
In the generation literature, the gener- 
ation of collocations is regarded as a prob- 
lem (cf. \[MATTHIESSEN 1988\]). The only 
system we know of, in which attempts have 
been made to bring it to a solution, is DIO- 
GENES, a knowledge based generation sys- 
tem under development at Carnegie Mel- 
lon University 16. Our approach differs from 
NIRENBURG'S in that it introduces the dis- 
tinction between basis and collocate. This 
leads to differences in the lexicalization strat- 
egy: within DIOGENES, heads are lexicalized 
before modifiers, irrespective of word classes, 
cf. \[NIRENBURG/NIRENBUI~G 1988\].; we 
have come up with data that seems to favour 
the distinction between basis and collocate. 
Further contrastive descriptive work will 
be the basis for a prototypical implementa- 
tion within Polygloss. With respect to lexical 
functions, some questions related to defaults 
(e.g. syntactic realization defaults, inheritance 
of collocational properties within lexem classes 
etc.) should be investigated in more detail. 
4.1 Acknowledgements 
We would like to thank Sergei Nirenburg and 
our collegues at the IMS for the fruitful discus- 
sions in this paper. All remaining errors are of 
course our own. 
References 
\[FIRTH 1951\] John Rupert Firth: "Modes of 
Meaning." (1951) in: Papers in Linguis- 
tics 193~-51. (London) 1957 (SS.190-215) 
\[HAUSMANN 1985\] Franz Josef Hausmann : 
"Kollol~tionen im deutschen 
WSrterbuch. Ein Beitrag zur Theorie des 
lexikographischen Beispiels." in: Henning 
Bergenholtz / Joachim Mugdan (Eds.): 
Lezikographie und Grammatik. Akten des 
Essener Kolloquiums zur Grammatik irn 
W6rterbuch. 1985: 118-129 \[= Lexico- 
graphica. Series Major 3\] 
\[IIEID/MOMMA 1989\] Ulrich Held, Stefan 
Momma: "Layered Lexicons for Gen- 
aeFor a general overview of DIOCJBNSS, see 
\[NIRENBURG et al. 1988\]. Questions of lexicaliza- 
tion and of the treatment of collocations are treated 
in \[NIRENBURG 1988\], \[NIRENBURG et al. 1988\], 
\[NIRENBURG/NIRENBURG 1988\]. 
¢,~ - 135- 
eration", internal paper, University of 
Stuttgart, IMS, 1989 
\[MATTHIESSEN 1988\] 
Christian Matthiessen: "Lexicogrammat- 
ical Choices in Natural Language Gen- 
eration', ms., paper presented at the 
Catalina Workshop on Natural Language 
Generation, (Los Angeles), June 1988 
\[MEL'(~UK 1988\] Igor A. Mel'~uk: "Para- 
phrase et lexique dans la thdorie linguis- 
tique Sens-Texte." in: Lexique 6, Lexique 
et paraphrase. Lille 1988:13-54 
\[MEL'~UK et al. 1981\] Igor A. Mel'~uk et al.: 
"Un nouveau type de dictionnaire: le 
dictionnaire explicatif et combinatoire du 
franfais contemporain (six entrdes de dic- 
tionnaire)." in: Cahiers de Lexicologie 
(28) 1981-I: 3-34 
\[MEL'CUK et al. 1984\] Igor A. Mel'~uk et al.: 
Dictionnaire explicatif et combinatoire du 
francais contemporain. Recherches Lezico- 
SOmantiques. (I), Montr6al 1984 
\[MEL'(~UK/POLGUEttE 1987\] Igor A. 
Mel'~uk, Alain Polgu~re: "A Formal Lex- 
icon in the Meaning-Text Theory (or how 
to do Lexica with Words)." in: Computa- 
tional Linguistics 13 3-4 1987:261-275 
\[NIRENBURG 1988\] Sergei Nirenburg: "Lex- 
ical selection in a blackboard-based gen- 
eration system." Paper presented at the 
Catalina Workshop on NL generation, Los 
Angeles 1988, ms. 
\[NIRENBURG et al. 1988\] Sergei Nirenburg 
et al.: "DIOG~.Nv.S-88, CMU-CMT-88- 
107." Pittsburgh: CMU, 1988, ms. 
\[NIRENBURG et al. 1988\] Sergei Nirenburg 
et al.: "Lexical Realization in Natural 
Language Generation." in : Second In- 
ternational Conference on Theoretical and 
Methodological Issues in Machine Trans- 
lation of Natural Languages. Pittsburgh, 
Pennsylvania June 12- 14, 1988, Proceed- 
ings, 1988 
\[NIRENBUttG/NIRENBURG 1988\] Sergei 
Nirenburg, Irene Nirenburg: "Choosing 
Word carefully", (Pittsburgh, Pa.: ICMT, 
Carnegie-Mellon University), 1988, inter- 
nal paper. 
\[ttAAB 1988\] Sybille Kaab: Zur Beschreibung 
fachsprachlicher Kollokationen, ms., Uni- 
versity of Stuttgart, 1988 
\[tt()SNEtt 1988\] Dietmar l~6sner: "The S~.M- 
SYN generation system", in: Proceedings 
of ACL-applied, Austin, Texas, February 
1988, 1988 
\[SINCLAIR 1987\] John McH Sinclair: "Collo- 
cation. A progress report." in: Ross Steele 
/ Terry Threadgold (Eds.): Language 
Topics. Essays in honour of Michael Hal- 
liday. (Amsterdam/Philadelphia) 1987, 
vol. 2.: 319-331 
\[STEELE/MEYER 1988\] James Steele, In- 
grid Meyer: "Lexical Functions in the 
Explanatory Combinatorial Dictionary : 
Kinds and Definitions." Internal paper, 
Universitg de Montrdal, 1988 
- 136 - 
