The Semantics of Collocational Patterns for 
Reporting Verbs 
Sabine Bergler 
Computer Science Department 
Brandeis University 
Waltham, MA 02254 
e-mail: sabine@chaos.cs.brandeis.edu 
Abstract 
One of the hardest problems for knowledge extraction 
from machine readable textual sources is distinguishing 
entities and events that are part of the main story from 
those that are part of the narrative structure, hnpor- 
tantly, however, reported sl)eech in newspaper articles ex- 
plicitly links these two levels. In this paper, we illustrate 
what the lexical semantics of reporting verbs must incor- 
porate in order to contribute to the reconstruction of story 
and context. The lexical structures proposed are derived 
from the analysis of semantic collocations over large text 
corpora. 
I Motivation 
We can distinguish two levels in newspaper articles: 
the pure information, here called primary informa- 
lion, and the meta-informati0n , which embeds the 
primary information within a perspective, a belief 
context, or a modality, which we call circumstan.- 
tim information. The distinction is not limited to, 
but is best illustrated by, reported speech sentences. 
Here the matrix clause or reporting clause corre- 
sponds to the circumstantial:information, while the 
complement (whether realized as a full clause or as 
a noun phrase) corresponds t'o primary information. 
For tasks such as knowledge extraction it is the pri- 
mary information that is of interest. For example in 
the text of Figure 1 the matrix clauses (italicized) give 
the circumstantial information of the who, when and 
how of the reporting event, while what is reported (the 
primary information) is givel~ in tile complements. 
The particular reporting verb also adds important 
information about the manner of the original utter- 
ance, the preciseness of tile quote, the temporal rela- 
I, iolJship between ,uatrix clause and e(mq~h:me,l,, aml 
more. In addition, the source of tile original infor- 
mation provides information about the reliability or 
credibility of the primary information. Because the 
individual reporting verbs differ slightly but impor- 
tantly in this respect, it is the lexicai semantics that 
must account for such knowledge. 
US Advising Third Parties on 
Hostages 
(R1) The Bush administration continued to 
insist ~esterday that (CI) it is not involved 
in negotiations over the Western hostages in 
Lebanon, (R2) but acknowledged that (C2) US 
olliciais have provided advice to and have been 
kept informed by "people at all levels" who are 
holding such talks. 
(C3) "There's a lot happening, and I don't 
want to be discouraging," (R3) Marlin Fitzwa- 
let, the president's spokesman, told reporters. 
(R4) But Fitzwater stressed that (C4) he was 
not trying to fuel speculation about any im- 
pending release, (R5) and said (C5) there 
was "no reason to believe" the situation had 
changed. 
(All Nevertheless, it appears that it has .... 
Figure 1: Boston Globe, March 6, 1990 
We describe here a characterization of influences 
which the reporting clause has on the interpretation 
of the reported clause without fully analyzing the re- 
ported clause. This approach necessarily leaves many 
questions open, because the two clauses are so inti- 
mately linked that no one can be analyzed fully in 
isolation. Our goal is, however, to show a minimal 
requirement on the lexical semantics of tile words in- 
volved, thereby enabling us to attempt a solution to 
the larger problems in text analysis. 
The lexicai semantic framework we assume ill this 
paper is that of the Generative Lexicon introduced hy 
Pustejovsky \[Pustejovsky89\]. This framework allows 
o. 216 - 
us to represent explicitly even those semantic cello- Keywords 
cations which have traditionally been assumed to be insist 
presupl)ositions and not part of the lexicon itself. 
insist on 
II Semantic Collocations 
Reporting verbs carry a varying amount of informa- 
tion regarding time, manner, factivity, reliability etc. 
of the original utterance. The most unmarked report- 
ing verb is say. The only presupposition for say is 
that there was an original utterance, the assumption 
being that this utterance is represented as closely as 
possible. In this sense say is even less marked than re. 
porl, which in addition specifies an a(Iressee (usually 
implicit from the context.) 
The other members in the semantic fieM are set 
apart through their semantic collocations. Let us 
consider in depth the case of insist. One usage cart be 
found in the first part of the first sentence in Figure 1, 
repeated here as (1). 
1 The Bush administration continued to insist yes- 
terday that it is not involved in negotiations over the 
Weslern hostages in Lebanon. 
The lexical definition of insist in the Long- 
man Dictionary of Contemporary English (LDOGE) 
\[Procter78\] is 
insist 1 to declare firmly (when opposed) 
and in the Merriam Webster Pocket Dictionary 
(MWDP) \[WooJrr4\]: 
insist to take a resolute stand: PER, SIST. 
The opposition, mentioned explicitly in LDOCE 
but only hinted at in MWDP, is an important part 
of the meaning of insisl. In a careful analysis of a 
250,000 word text base of TIME magazine articles 
from 1963 (TIMEcorpus) \[Berglerg0a\] we confirmed 
that in every sentence containing insist some kind of 
opposition could be recovered and was supported by 
some other means (such as emphasis through word 
order etc.). Tire most common form of expressing 
the opposition was through negation, as in (1) above. 
In an automatic analysis of the 7 million word 
corpus containing Wall Street Journal documents 
(WSJC) \[Berglerg0b\], we found the distribution of 
patterns of opposition reported in Figure 2. This 
analysis shows that of 586 occurrences of insist 
throughout tim VVSJC, 10O were instances of the id- 
iom insisted on which does not subcategorize for a 
clausal complement. Ignoring I.hese occurrences for 
now, of the remaining 477 occurrences, 428 cooccur 
Oct 
586 
109 
insist & 
but 117 
insist & 
negation 186 
insist & 
subjunctive 159 
insist & 
but & net. 14 
insist & 
but & on 12 
insist & 
but & subj. 
Comments 
occurrences throughout 
the corpus 
these have been cleaned by 
hand and are actually oc- 
currences of the idiom in- 
sist on rather than acciden- 
tal co-occurrences. 
occurrences of both insist 
and but in the same sen- 
tence 
includes not and n'l 
includes would, could, 
should, and be 
Figure 2: Negative markers with insist in WSJC 
with such explicit markers of opposition as but (se- 
lecting for two clauses that stand in an opposition), 
not and n't, and subjunctive markers (indicating an 
opposition to factivity). While this is a rough analy- 
sis ;rod contains some "noise", it supports the findings 
of our carefid study on the TIMEcorpus, namely the 
following: 
2 A propositional opposition is implicit in the lexical 
semantics of insist. 
This is where our proposal goes beyond tra- 
ditional colloeational information, as for exam- 
ple recently argued for by Smadja and McKeown 
\[Smadja&McKeown90\]. They argue for a flexible lex- 
icon design that can accomodate both single word eu- 
tries and collocational patterns of different strength 
and rigidity. But the collocations considered in their 
proposal are all based on word cooccurrences, not 
taking advantage of the even richer layer of semantic 
collocations made use of in this proposal. Semantic 
collocations are harder to extract than cooccurrence 
patterns--the state of the art does not enable us to 
find semantic collocations automatically t. This paper 
however argues that if we take advantage of lexicai 
paradigmatic behavior underlying the lexicon, we can 
at least achieve semi-automatic extraction of seman- 
tic collocations (see also Calzolari and Bindi (1990) 
I But note the important work by Hindle \[HindlegO\] on 
extracting semantically similar nouns based on their substi- 
tutability in certain verb contexts. We see his work as very 
similar in spirit. 
- 2!7 - 
and Pustejovsky and Anick (1990) for a description 
of tools for such a semi-automatic acquisition of se- 
mantic information from a large corpus). 
Using qualia structure as a means for structuring 
different semantic fields for a word \[Pustejovsky89\], 
we can summarize the discussion of tile lexical se- 
mantics of insist with a preliminary definition, mak- 
ing explicit tile underlying opposition to the ,xssumed 
context (here denoted by ¢) and the fact that insist 
is a reporting verb. 
3 (Preliminary Lexical l)elinition) 
insist(A,B) 
\[Form: Reporting Verb\] 
\[7'elic: utter(A,B) & :1¢: opposed(B#)\] 
\[Agentive: human(A)\] 
III Logical Metonymy 
in the previous section we argued that certain se- 
mantic collocations are part of the lexical seman- 
tics of a word. In this section we will show that 
reporting verbs as a class allow logical metonymy 
\[Pustejovsky91\] \[l'ustejovsky&Anick88\]. An example 
caLL be found in (1), where the metonymy is found in 
tile subject, NP. The Bush administration is a com- 
positional object of type administration, which is de- 
fined somewhat like (4). 
4 (Lexical l)elinition) 
administration 
\[Form: + plural 
part of: institution\] 
\[Telic: execute(x, orders(y)), 
where y is a high official 
in the specific institution\] 
\[Constitutive: + human 
executives, 
officials,...\] 
\[Aoentive: appoint(y, x)\] 
In its formal role at least, i an administration does 
not fldfill the requirements for making an utterance-- 
only in its constitutive role is there the attribute \[4_ 
human\], allowing for the metonymic use. 
Although metonymy is a general device -- in that 
it can appear in almost any context and make use 
of associations never considered before 2 -- a closer 
2As the well-known examl)h." The ham sandwich ordered an- 
other coke. illustrates. 
look at the data reveals, however, that metonymy as 
used in newspaper articles is much more restricted 
and systematic, corresponding very closely to logical 
metonymy \[Pustejovsky89\]. 
Not all reporting verbs use the same kind of 
metonymy, however. Different reporting verbs select 
for different semantic features in their source NPs. 
More precisely, they seem to distinguish between a 
single person, a group of persons, and an institution. 
We confirmed this preference on the TIMEcorpus, 
extracting automatically all tile sentences containing 
one of seven reporting verbs and analyzing these data 
by hand. While the number of occurrences of each re- 
portitLg verb was much too small to deduce tile verb's 
lexical sema,Ltics, they nevertheless exhibited inter- 
esting tendencies. 
Figure 3 shows the distribution of the degree of an- 
imacy. The numbers indicate percent of total occur- 
rence of the verb, i.e. in 100 sentences that contain 
insist as a reporting verb, 57 have a single person as 
their source. 
\]person I group I instil. \[ other 
admit 64% 19% 14% 2% 
announce .... 51% 10% 31% 8% 
claim 35% 21% 38% 6% 
denied 55% 17% 17% 11% 
insist 57% 24% 16% 3% 
said 83% 6% 4% 8% 
told 69% 7% 8% 16% 
Figure 3: Degree of Animacy in Reporting Verbs 
The significance of the results in Figure 3 is that 
semantically related words have very similar distribu- 
tions and that this distribution differs from the distri- 
bution of less related words. Admit, denied and insist 
then fall ill one category that we call call here infor- 
mally \[-inst\], said and told fan in \[+person\], and claim 
• and announce fall into a not yet clearly marked cate- 
gory \[other\]. We are currently implementing statisti- 
cal methods to perform similar analyses on WSJC. 
We hope that the impreciseness of an automated 
analysis using statistical methods will be counterbal- 
anced by very clear results. 
The TIMEcorpus also exhibited a preference for 
one particular metonymy, which is of special inter- 
est for reporting verbs, namely where the name of 
a country, of a country's citizens, of a capital, or 
even of the building in which the government resides 
stands for the government itself. Examples are Great 
Britain/ The British/London/ Buckingham Palace 
announced .... Figure 4 shows the preference of the re- 
- 218- 
I)orting verbs for tiffs metonymy in subject position. 
Again the numbers are too small to say anything 
about each lexical entry, but the difference in pref- 
erence is strong enough to suggest it is not only due 
to the specific style of the magazine, but that some 
metonymies form strong collocations that should be 
reflected in the lexicon. Such results ill addition pro- 
vide interesting data for preference driven semantic 
analysis such as Wilks' \[Wilks75\]. 
Figure 
for the 
verbs. 
Verb 
admit 
allnounce 
claim 
denied 
insist 
said 
told 
percent of all occurrences 
5% 
\]8% 
25% 
33% 9% 
3% 
0% 
4: Country, countrymen, or capital standing 
government in subject l)osition of 7 reporting 
IV A Source NP Grammar 
The analysis of the subject NPs of all occurrences of 
tile 7 verbs listed ill Figure 3 displayed great regu- 
larity in tile TIMEcorpus. Not only was the logical 
metonymy discussed in the previous section perva- 
sive, but moreover a fairly rigid semanticgrammar 
for the source NPs emerged. Two rules of this se- 
mantic grammar are listed in Figure 5. 
source 
\[quant\] \[mod\] descriptor \["," name ","\] J 
\[descriptor j((a J the) rood)\] \[mod\] name J 
\[inst's I name's\] descriptor \[name\] J 
name "," \[a j the\] \[relation prep\] descriptor J 
name "," \[a \] the\] name's (descriptor 
J relation) \] 
name "," free relative clause 
descriptor , 
role I 
\[inst\] position I 
\[position (for I of)\] \[quant\] inst 
Figure 5: Two rules in a semantic grammar for source 
NPs 
The grammar exemplified in Figure 5 is partial -- it 
only captures the regularities found in the TIMEcor- 
pus. Source NPs, like all NPs, can be adorned with 
modifiers, temporal adjuncts, appositions, and rela- 
tive clauses of any shape. Tile important observation 
is that these cases are very rare in thc corpus data 
and must be dealt with by general (i.e. syntactic) 
principles. 
The value of a specialized semantic grammar for 
source NPs is that it provides a powerful interface 
between lexical semantics, syntax, and compositional 
semantics. Our source NP grammar compiles differ- 
eat kinds of knowledge. It spells out explicitly that 
logical metonymy is to be expected in the context 
of reportiog verbs. Moreover, it restricts possible 
metonymies: the ham sandwich is not a typical source 
with reporting verbs. The source gralnmar also gives 
a likely ordering of pertinent information as roughly 
COUNTRYILOCATION ALLEGIANCE INSTITU- 
TION POSITION NAME. 
This information defines esscntially the schema for 
the rei)resentation of the source in the knowledge ex- 
I.raction domain. 
We are currently applying this grammar to the 
data i,a WSJC in order to see whether it is specific to 
the TIMEcorpus. Preliminary results were encourag- 
ing: The adjustments needed so far consisted only of 
small enhancements such as adding locative PPs at 
the end of a descriptor. 
V LCPs Lexical Conceptual 
Paradigms 
The data that lead to our source NP gratmnar was 
essentially collocational materiah We extracted tile 
sul)ject NPs for a set of verbs, analyzed the iexical- 
ization of tile source and generalized the findings a. 
In this section we will justify why we think that tile 
results can properly be generalized and what impact 
this has on tile representation in the lexicon. 
It has been noted that dictionary definitions form 
a -- usually slmllow -- hierarchy \[Amsler80\]. Un- 
fortunately explicitness is often traded in for con- 
ciseness in dictionaries, and conceptual hierarchies 
cannot be automatically extracted from dictionaries 
alone. Yet for a computational lexicon, explicit de- 
pendencies in the form of lexicai inheritance are cru- 
cial \[Briscoe&al.90\] \[Pustejovsky&Boguraev91\]. Fol- 
lowing Anick and Pustejovsky (1990), we argue that 
lexical items having related, paradigmatic syntac- 
tic behavior enter into the same iezical conceptual 
paradigm. Tiffs states that items within an LCP will 
have a set ofsyntactic realization patterns for how the 
3A detailed report on the analysis can be found in 
\[BergleJX30a\] 
- 219 - 
word and its conceptual space (e.g. presuppositions) 
are realized in a text. For example, reporting verbs 
form such a paradigm. In fact the definition of an 
individual word often stresses the difl'erence between 
it and the closest synonym rather than giving a con- 
structive (decompositioual) definition (see LDOCE). 4 
Given these assumptions, we will revise our definition 
of insist in (3). We introduce an I,CP (i.e. soma,J- 
tic type), REPOffFING VERB, which spells out the 
core semantics of reporting verbs. It also makes ex- 
plicit reference to the source NI ) grammar dist'ussed 
in Section IV as the default grammar for the subject 
NP (in active voicc). This general template allows 
us to define the individval lexical entry concisely in 
a form close to norn,al dictionary d,;li,fifions: devia- 
tions and enhancements ,as well as restrictions of the 
general pattern are expressed for the i,,dividnal en- 
try, making a COml)arison betweelt two entries focus 
on the differences in eqtailments. 
5 (Definition of Semantic Type) 
REPORTING VERB 
\[Form: :IA,B,C,D: utter(A,B) 
& hear(C,B) 
& utter(C, utter(A,B)) 
& hear(D,utter(C, utter(A,B)))\] 
\[Constitutive: SU BJ ECT: type:SourceN P, 
COMPLEMENT \] 
\[Agent|re: AGENT(C), COAGENT(A)/ 
6 (i,exical Definition) 
insist(A,B) 
\[Form: ItEI)ORTING VEI(B\] 
\[Tclic: 3¢: opposed(B,~b)\] 
\[Constitutive: MANNER: vehement\] 
\[Agent|re: \[-inst\]\] 
A related word, deny, might be defined as 7. 
7 (Lexical Definition) 
deny(A,B) 
\[Form: REPORTING VERB\] 
\[T~tic: 3q,: negate(n,q,)\] 
\[Agentive: l-instil 
(6) and (7) differ in the quality of their opposition 
to the assumed proposition in the context, tb: in- 
sist only specifies an opposition, whereas deny actu- 
ally negates that proposition. The entries also reflect 
~' ll'he notion of LCPs is of course related to the idea of 
aemanlic fields \[Trier31\]. 
their common preference not to participate in the 
metonymy that allows insiitulions to appear in sub- 
jcct position. Note t, hat opposed and negate are not 
assumed to be primitives but decompositions; these 
predicates are themselves decomposed further in the 
lexicon. 
Insist (and other reporting verbs) "inherit" much 
structural inforrnation from their semantic type, i.e, 
the LCP REPOR'I3NG VERB. It is the seman- 
tic type that actual.ly provides the constructive def- 
inition, whereas the individual entries only dclinC 
refinements on the type. This follows standard 
inheritance mechanisms for inheritance hierarchies 
\[Pustciovsky&Boguraev91\] \[Evans&Gazdar90\]. 
Among other things the I,CI ) itEPOltTING VEiLB 
specilles our specialized semantic grammar for one 
of its constituents, namely the subject NP in non- 
passive usage. This not only enhances tile tools 
available to a parser in providing semantic con- 
straints useful for constituent delimiting, but also 
provides an elegant:way to explicitly state which log- 
ical metonymies are common with a given class of 
words 5. 
VI Summary 
Reported speech is an important phenomenon that 
cannot be ignored when analyzing newspaper arti- 
cles. We argue that the lexicai semantics of reportiug 
vcrbs plays all important part in extracting informa- 
tion from large on-iiine tcxt bases. 
Based oil extensive studies of two corpora, the 
250,000 word TlMEcorpus and the 7 million word 
Wall Street Journal Corpus we identified that se- 
mantic coilocalious must be represented ill the 
lexicon, expanding thus on current trends to in- 
dude syntactic collocations in a word based lexicon 
\[Smadj~d~M cKeown90\]. 
We further discovered that logical metonymy is per- 
vasive in subject position of reporting verbs, but that 
reporting verbs differ with respect to their preference 
for different kinds of logical metonymy. A careful 
analysis of seven reporting verbs in the TIMEcor- 
pus suggested that there are three features that di- 
vide the reporting verbs into classes according to the 
preference for metonymy in subject position, namely 
whether the subject NP refers to the source as a sin- 
gle person, a group of people, or an institution. 
The analysis of the source NPs of seven reporting 
verbs further allowed us to formulate a specialized se- 
SGrimshaw \[Grimshaw79\] argues that verbs also select for 
their complements on a semantic basis. \[;'or the sake of con- 
eiscncss tim whole issue of the form of the complement and its 
semantic connection has to be omitted here. 
- 220 - 
mantic grammar for source NPs, which constitutes an 
important interface between lexical semantics, syn- 
tax, and compositional semantics used by an appli- 
cation program. We are currently testing the com- 
pleteness of this grammar on a different corpus and 
are planning to implement a noun phrase parser. 
We have imbedded the findings in the framework of 
Pustejovsky's Generative Lexicon and qualia theory 
\[Pustejovsky89\] \[Pustejovsky91\]. This rich knowi- 
' edge representation scheme allows us to represent ex- 
plicitly the underlying structure of the lexicon, in- 
eluding the clustering of entries into semant.ic types 
(i.e. I,CPs) with inheritance and the representation 
of information which wa.s previously considered pre- 
suppositional and not part of the lexicai entry itself. 
In this process we observed that the analysis of se- 
mantic collocations can serve as a measure of seman- 
tic closeness of words. 
Acknowledgements: I would like to thank 
I.ily advisor, James Pustejovsky, for inspiring discus- 
sions and irlany critical readings. 
References 
\[Amsler80\] Robert A. Amsler. The Structure of the 
Merriam-Webster Pocket Dictionary. PhD the- 
. sis, University of Texas, 1980. 
\[Anick$zPustejovsky90\] Peter-Anick and James Puste- 
jovsky. Knowledge acquisition from corpora. 
In Pracecdings of the I3th International Con- 
\]crence on Computational Linguistics, 1990. 
\[\[}riscoe&al.90\] Ted Briscoe, Ann Copestake, and Bran- 
. imir Boguraev. Enjoy the paper: Lexical seman- 
tics via lexicology. In I'ro,'ccdih!lS of lhv I.'tlh In- 
"" lernational C'oufercncc on G'omputalional Lin- 
guistics, 1990. 
\[lierglerg0a\] Sabine Bergler. Collocation patterns for 
verbs of reported speech--a corpus analysis oil 
tile time Magazine corpus. Technical: report, 
Brandeis University Computer Science,. 1990. 
\[Berglerg0b\] Sabine Bcrglcr. Collocation patterns for 
verbs of reported speech--a corpus analysis on 
The Wall Street Journal. Technical: report, 
Brandeis University Computer Science, 1990. 
\[Calzolari&Bindig0\] Nicoletta Calzolari and Reran Bindi. 
Acquisition of lexical information from a large 
textual italian corpus. In Proceedings o\] the 
13th International Conference on Computa- 
tional Linguistics, 1990. 
\[Evans&Gazdarg0\] Roger Evans and Gerald Gazdar. The 
DATR papers. Cognitive Science Research Pa- 
per CSRP 139, School of Cognitive and Com- 
puting Sciences, University of Sussex, 1990. 
\[Grimshaw79\] Jane Grimshaw. Complement selection 
and the lexicon. Linguistic Inquiry, 1979. 
\[ltindle90\] Donald Hindle. Noun classification from 
predicate-argument structures. In Proceedings 
of the Association/or Computational Linguis- 
tics, 1990. 
\[Pustejovsky&Anick88\] James Pustejovsky and Peter 
Anick. The semantic interpretation of nominals. 
In Proceedings o\] the l~th International Confer- 
ence on Computational Linguistics, 1988. 
\[Pustejovsky&Bogura~cvgl\] James Pustejovsky and Bra- 
nimir Boguraev. A richer characterization of 
dictionary entries. In B. Atkins and A. Zam- 
polli, editors, Computer Assisted Dictionary 
Compiling: Theory and Practice. Oxford Unl- 
versity Press, to appear. 
\[Pustejovsky89\] James Pustejovsky. Issues in computa- 
tional'lexical semantics. In Proceedings o\] the 
European Chapter o\] the Association for Com. 
putational Linguistics, 1989. 
\[Pustejovskygl\] James Pqstejovsky. Towards a gener- 
ative lexicon. Computational Linguistics, 17, 
1991. 
\[Procter78\] Paul Procter, editor. Longman Dictionary 
o\] Contemporary English. Longman, IIarlow, 
U.K., 1978. 
\[Smadja&McKeowng0\] Frank A. Smadja and Kathleen 
R. McKeown. Automatically extracting and 
representing lcollocations for language genera- 
tion. In Proceedings o\] the Association\]or Com- 
putational Linguistics, 1990. 
\[Trier31\] Just Trier. Der deutsche Wortschatz im 
Sinnbezirk des Verstandes: Die Geschichte: 
eines sprachlichen Feldes. Bandl, Heidelberg,, 
1931. 
\[Wilks75\] Yorick Wilks. A preferential pattern-seeking 
semantics for natural language inference. Arti- 
ficial Intelligence, 6, 1975. 
\[Woolf74\] llenry B. Woolf, editor. The Merriam-Webster. 
Dictionary.. Pocket Books, New York, 1974. 
- 221 - 
