Idiomatic object usage and support verbs 
Pasi Tapanainen, Jussi Piitulainen and Timo J~irvinen* 
Research Unit for Multilingual Language Technology 
P.O. Box 4, FIN-00014 University of Helsinki, Finland 
http ://www. ling. helsinki, fi/ 
1 Introduction 
Every language contains complex expressions 
that are language-specific. The general prob- 
lem when trying to build automated translation 
systems or human-readable dictionaries is to de- 
tect expressions that can be used idiomatically 
and then whether the expressions can be used 
idiomatically in a particular text, or whether 
a literal translation would be preferred. It fol- 
lows from the definition of idiomatic expression 
that when a complex expression is used idiomat- 
ically, it contains at least one element which is 
semantically "out of context". In this paper, 
we discuss a method that finds idiomatic col- 
locations in a text corpus. The method detects 
semantic asymmetry by taking advantage of dif- 
ferences in syntactic distributions. 
We demonstrate the method using a spe- 
cific linguistic phenomenon, verb-object collo- 
cations. The asymmetry between a verb and its 
object is the focus in our work, and it makes the 
approach different from the methods that use 
e.g. mutual information, which is a symmetric 
measure. 
Our novel approach differs from mutual infor- 
mation and the so-called t-value measures that 
have been widely used for similar tasks, e.g., 
Church et al. (1994) and Breidt (1993) for Ger- 
man. The tasks where mutual information can 
be applied are very different in nature as we 
see in the short comparison at the end of this 
paper. The work reported in Grefenstette and 
Teufel (1995) for finding empty support verbs 
used in nominallsations is also related to the 
present work. 
* Email: Pasi.Tapanainen@ling.helsinki.fi, Jussi.Piitu- 
lainen~ling.helsinki.fi and Timo.Jarvinen@ling.helsin- 
ki.\]i, Parsers & demos: http://www.conezor.~ 
2 Semantic asymmetry 
The linguistic hypothesis that syntactic rela- 
tions, such as subject-verb and object-verb re- 
lations, are semantically asymmetric in a sys- 
tematic way (Keenan, 1979) is well-known. Mc- 
Glashan (1993, p. 213) discusses Keenan's prin- 
ciples concerning directionality of agreement re- 
lations and concludes that semantic interpreta- 
tion of functor categories varies with argument 
categories, but not vice versa. He cites Keenan 
who argues that the meaning of a transitive verb 
depends on the object, for example the mean- 
ing of the verb cut seems to vary with the direct 
object: 
• in cut finger "to make an incision on the 
surface of", 
• in cut cake "to divide into portions", 
• in cut lawn "to trim" and 
• in cut heroin "diminish the potency". 
This phenomenon is also called semantic tailor- 
ing (Allerton, 1982, p. 27). 
There are two different types of asymmetric 
expressions even if they probably form a con- 
tinuum: those in which the sense of the functor 
is modified or selected by a dependent element 
and those in which the functor is semantically 
empty. The former type is represented by the 
verb cut above: a distinct sense is selected ac- 
cording to the (type of) object. The latter type 
contains an object that forms a fixed collocation 
with a semantically empty verb. These pairings 
are usually language-specific and semantically 
unpredictable. 
Obviously, the amount of tailoring varies con- 
siderably. At one end of the continuum is id- 
iomatic usage. It is conceivable that even a 
highly idiomatic expression like taking toll can 
1289 
be used non-idiomatically. There may be texts 
where the word toll is used non-idiomatically, as 
it also may occur from time to time in any text 
as, for instance, in The Times corpus: The IRA 
could be profiting by charging a toll for cross- 
border smuggling. But when it appears in a 
sentence like Barcelona's fierce summer is tak- 
ing its toll, it is clearly a part of an idiomatic 
expression. 
3 Distributed frequency of an object 
As the discussion in the preceding chapter 
shows, we assume that when there is a verb- 
object collocation that can be used idiomati- 
cally, it is the object that is the more interesting 
element. The objects in idiomatic usages tend 
to have a distinctive distribution. If an object 
appears only with one verb (or few verbs) in a 
large corpus we expect that it has an idiomatic 
nature. The previous example of take toll is il- 
lustrative: if the word toll appears only with the 
verb take but nothing else is done with tolls, we 
may then assume that it is not the toll in the 
literary sense that the text is about. 
The task is thus to collect verb-object colloca- 
tions where the object appears in a corpus with 
few verbs; then study the collocations that are 
topmost in the decreasing order of frequency. 
The restriction that the object is always at- 
tached to the same verb is too strict. When 
we applied it to ten million words of newspaper 
text, we found out that even the most frequent 
of such expressions, make amends and take 
precedence, appeared less than twenty times, 
and the expressions have temerity, go berserk 
and go ex-dividend were even less frequent. It 
was hard to obtain more collocations because 
their frequency went very low. Then expres- 
sions like have appendix were equivalently ex- 
posed with expressions like run errand. 
Therefore, instead of taking the objects that 
occur with only one verb, we take all objects and 
distribute them over their verbs. This means 
that we are concerned with all occurrences of an 
object as a block, and give the block the score 
that is the frequency of the object divided by 
the number of different verbs that appear with 
the object. 
The formula is now as follows. Let o be an 
object and let 
(F~, V~, o), . . . , (Fn, Vn, o) 
be triples where Fj > 0 is the frequency or the 
relative frequency of the collocation of o as an 
object of the verb ~ in a corpus. Then the score 
for the object o is the sum ~--1 F~/n. 
The frequency of a given object is divided by 
the number of different verbs taking this given 
object. If the number of occurrences of a given 
object grows, the score increases. If the object 
appears with many different verbs, the score de- 
creases. Thus the formula favours common ob- 
jects that are used in a specific sense in a given 
corpus. 
This scheme still needs some parameters. 
First, the distribution of the verbs is not taken 
into account. The score is the same in the 
case where an object occurs with three different 
verbs with the frequencies, say, 100, 100, and 
100, and in the case where the frequencies of 
the three heads are 280, 10 and 10. In this case, 
we want to favour the latter object, because 
the verb-object relation seems to be more stable 
with a small number of exceptions. One way to 
do this is to sum up the squares of the frequen- 
cies instead of the frequencies themselves. 
Second, it is not clear what the optimal 
penalty is for multiple verbs with a given ob- 
ject. This may be parametrised by scaling the 
denominator of the formula. Third, we intro- 
duce a threshold frequency for collocations so 
that only the collocations that occur frequently 
enough are used in the calculations. This last 
modification is crucial when an automatic pars- 
ing system is applied because it eliminates in- 
frequent parsing errors. 
The final formula for the distributed fre- 
quency DF(o) of the object o in a corpus of 
n triples (Fj, Vj, o) with Fj > C is the sum 
4=1 nb 
where a, b and C are constants that may depend 
on the corpus and the parser. 
4 The corpora and parsing 
4.1 The syntactic parser 
We used the Conexor Functional Depen- 
dency Grammar (FDG) by Tapanainen and 
J~rvinen (1997) for finding the syntactic rela- 
tions. The new version of the syntactic parser 
can be tested at http://www, conexor.fi. 
1290 
4.2 Processing the corpora 
We analysed the corpora with the syntactic 
parser and collected the verb-object collocations 
from the output. The verb may be in the infini- 
tive, participle or finite form. A noun phrase in 
the object function is represented by its head. 
For instance, the sentence I saw a big black cat 
generates the pair (see, cat I. A verb may also 
have an infinitive clause as its object. In such a 
case, the object is represented by the infinitive, 
with the infinitive marker if present. Naturally, 
transitive nonfinite verbs can have objects of 
their own. Therefore, for instance, the sentence 
I want to visit Paris generates two verb-objects 
pairs: (want, to visit) and (visit, Paris). The 
parser recognises also clauses, e.g. that-clauses, 
as objects. 
We collect the verbs and head words of nom- 
inal objects from the parser's output. Other 
syntactic arguments are ignored. The output 
is normalised to the baseforms so that, for in- 
stance, the clause He made only three real mis- 
takes produces the normalised pair: (make, 
mistake). The tokenisation in the lexical anal- 
ysis produces some "compound nouns" like 
vice÷president, which are glued together. We 
regard these compounds as single tokens. 
The intricate borderline between an object, 
object adverbial and mere adverbial nominal is 
of little importance here, because the latter tend 
to be idiomatic anyway. More importantly, due 
to the use of a syntactic parser, the presence of 
other arguments, e.g. subject, predicative com- 
plement or indirect object, do not affect the re- 
sult. 
5 Experiments 
In our experiment, we used some ten mil- 
lion words from a The Times newspaper cor- 
pus, taken from the Bank of English corpora 
(J~irvinen, 1994). The overall quality of the re- 
sult collocations is good. The verb-object collo- 
cations with highest distributed object frequen- 
cies seem to be very idiomatic (Table 1). 
The collocations seem to have different status 
in different corpora. Some collocations appear 
in every corpus in a relatively high position. For 
example, collocations like take toll, give birth 
and make mistake are common English expres- 
sions. 
Some other collocations are corpus spe- 
DF(o) F(vo) 
37.50 73 
28.00 28 
25.00 25 
24.83 60 
22.00 22 
21.00 21 
21.00 21 
21.00 21 
20.40 93 
19.50 28 
19.25 128 
18.00 18 
18.00 18 
17.50 76 
17.50 61 
17.25 62 
17.04 817 
17.00 17 
17.00 17 
16.29 152 
16.17 319 
16.00 16 
16.00 16 
15.69 248 
15.57 84 
15.00 15 
14.57 190 
14.50 27 
14.50 16 
14.47 165 
14.14 110 
14.12 329 
14.00 133 
14.00 14 
14.00 14 
14.00 14 
14.00 14 
13.90 226 
13.63 131 
13.50 25 
verb + object 
take toll 
go bust 
make plain 
mark anniversary 
finish seventh 
make inroad 
do homework 
have hesitation 
give birth 
have a=go 
make mistake 
go so=far=as 
take precaution 
look as=though 
commit suicide 
pay tribute 
take place 
make mockery 
make headway 
take wicket 
cost £ 
have qualm 
make pilgrimage 
take advantage 
make debut 
have second=thought 
do job 
finish sixth 
suffer heartattack 
decide whether 
have impact 
have chance 
give warn 
have sexual=intercourse 
take plunge 
have misfortune 
thank goodness 
have nothing 
make money 
strike chord 
Table 1: Verb-object collocations from The 
Times 
cific. An experiment with the Wall Street 
Journal corpus contains collocations like name 
vice-/-precident and file lawsuit that are rare in 
the British corpora. These expressions could be 
categorised as cultural or area specific. They are 
1291 
F MI t-value Verb + object 
(scaled) (scaled) 
15 
12 
11 
14 
12 
13 
21 
12 
18 
10 
13 
12 
11 
17 
13 
11 
12 
11 
9.47 3.87 
8.62 3.46 
8.48 3.32 
8.42 3.74 
8.30 3.46 
8.21 3.60 
wreak havoc 
armour carrier 
grasp nettle 
firm lp 
bury Edmund 
weather storm 
8.18 4.58 
8.17 3.46 
8.10 4.24 
8.10 3.16 
8.05 3.60 
8.03 3.46 
7.92 3.31 
7.91 4.12 
7.91 3.60 
7.80 3.31 
7.72 3.46 
7.72 3.31 
bid farewell 
strut stuff 
breathe sigh 
suck toe 
incur wrath 
invade Kuwait 
protest innocence 
hole putt 
poke fun 
tighten belt 
stem tide 
heal wound 
Table 2: Collocations according to mutual in- 
formation filtered with t-value of 3 
frequency verb 
329 have 
302 
274 
256 
247 
229 
226 
210 
203 
186 
164 
155 
142 
139 
138 
135 
132 
123 
122 
119 
+ object 
chance 
have it 
have time 
have effect 
have right 
have problem 
have nothing 
have little 
have idea 
have power 
have what 
have much 
have child 
have experience 
have some 
have reason 
have one 
have advantage 
have intention 
have plan 
Table 4: What do we have? - Top-20 
position verb + object 
124 
157 
478 
770 
862 
1009 
1033 
1225 
1244 
1942 
2155 
finish seventh 
mark anniversary 
go bust 
do homework 
give birth 
make inroad 
take toll 
make mistake 
make plain 
have hesitation 
have a--go 
Table 3: The order of top collocations according 
to mutual information 
likely to appear again in other issues of WSJ or 
in other American newspapers. 
6 Mutual information 
Mutual information between a verb and its ob- 
ject was also computed for comparison with our 
method. The collocations from The Times with 
the highest mutual information and high t-value 
are listed in Table 2. See Church et al. (1994) 
for further information. We selected the t-value 
so that it does not filter out the collocations of 
Table 1. Mutual information is computed from 
a list of verb-object collocations. 
The first impression~ when comparing Ta- 
bles 1 and 2, is that the collocations in the latter 
are somewhat more marginal though clearly se- 
mantically motivated. The second observation 
is that the top collocations contain mostly rare 
words and parsing errors made by the underly- 
ing syntactic parser; three out of the top five 
pairs are parsing errors. 
We tested how the top ten pairs of Table 1 are 
rated by mutual information. The result is in 
Table 3 where the position denotes the position 
when sorted according to mutual information 
and filtered by the t-value. The t-value is se- 
lected so that it does not filter out the top pairs 
in Table 1. Without filtering, the positions are 
in range between 32 640 and 158091. The re- 
sult shows clearly how different the nature of 
mutual information is. Here it seems to favour 
pairs that we would like to rule out and vice 
versa. 
1292 
frequency verb + object 
21 
28 
16 
15 
110 
329 
14 
14 
226 
135 
117 
274 
41 
28 
256 
18 
17 
10 
10 
10 
have hesitation 
have a--go 
have qualm 
have second=thought 
have impact 
have chance 
have sexual=intercourse 
have misfortune 
have nothing 
have reason 
have choice 
have time 
have regard 
have no=doubt 
have effect 
have bedroom 
have regret 
have penchant 
have pedigree 
have clout 
Table 5: The collocations of the verb have 
sorted according to the DF function 
7 Frequency 
In a related piece of work, Hindle (1994) used a 
parser to study what can be done with a given 
noun or what kind of objects a given verb may 
get. If we collect the most frequent objects for 
the verb have, we are answering the question: 
"What do we usually have?" (see Table 4). The 
distributed frequency of the object gives a dif- 
ferent flavour to the task: if we collect the collo- 
cations in the order of the distributed frequency 
of the object, we are answering the question: 
"What do we only have?" (see Table 5). 
8 Conclusion 
This paper was concerned with the semantic 
asymmetry which appears as syntactic asym- 
metry in the output of a syntactic parser. This 
asymmetry is quantified by the presented dis- 
tributed frequency function. The function can 
be used to collect and sort the collocations so 
that the (verb-object) collocations where the 
asymmetry between the elements is the largest 
come first. Because the semantic asymmetry is 
related to the idiomaticity of the expressions, 
we have obtained a fully automated method to 
find idiomatic expressions from large corpora. 

References 
D. J. Allerton. 1982. Valency and the Engli.sh 
Verb. London: Academic Press. 
Elisabeth Breidt. 1993. Extraction of V-N- 
collocations from text corpora: A feasibility 
study for German. Proceedings of the Work- 
shop on Very Large Corpora: Academic and 
Industrial Perspectives, pages 74-83, June. 
Kenneth Ward Church, William Gale, Patrick 
Hanks, Donald Hindle, and Rosamund Moon. 
1994. Lexical substitutability. In B.T.S. 
Atkins and A Zampolli, editors, Computa- 
tional Approaches to the Lexicon, pages 153- 
177. Oxford: Clarendon Press. 
Gregory Grefenstette and Simone Teufel. 1995. 
Corpus-based method for automatic identifi- 
cation of support verbs for nominalizations. 
Proceedings of the 7th Conference of the Eu- 
ropean Chapter of the A CL, March 27-31. 
Donald Hindle. 1994. A parser for text corpora. 
In B.T.S. Atkins and A Zampolli, editors, 
Computational Approaches to the Lexicon, 
pages 103-151. Oxford: Clarendon Press. 
J~irvinen, Timo. 1994. Annotating 200 Mil- 
lion Words: The Bank of English Project. 
COLING 94. The 15th International Confer- 
ence on Computational Linguistics Proceed- 
ings. pages 565-568. Kyoto: Coling94 Orga- 
nizing Committee. 
Edward L. Keenan. 1979. On surface form and 
logical form. Studies in the Linguistic Sci- 
ences, (8):163-203. Reprinted in Edward L. 
Keenan (1987). Universal Grammar: fifteen 
essays. London: Croom Helm. 375-428. 
Scott McGlashan. 1993. Heads and lexical se- 
mantics. In Greville G. Corbett, Norman M. 
Fraser, and Scott McGlashan, editors, Heads 
in Grammatical Theory, pages 204-230. Cam- 
bridge: CUP. 
Pasi Tapanainen and Timo J~irvinen. 1997. A 
non-projective dependency parser. In Pro- 
ceedings of the 5th Conference on Applied 
Natural Language Processing, pages 64-71, 
Washington, D.C.: ACL. 
