Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 9–17,
Sydney, July 2006. c©2006 Association for Computational Linguistics
An Empirical Approach to the Interpretation of Superlatives
Johan Bos
Laboratory for Computational Linguistics
Department of Computer Science
University of Rome “La Sapienza”
bos@di.uniroma1.it
Malvina Nissim
Laboratory for Applied Ontology
Institute for Cognitive Science and Technology
National Research Council (CNR), Rome
malvina.nissim@loa-cnr.it
Abstract
In this paper we introduce an empirical
approach to the semantic interpretation of
superlative adjectives. We present a cor-
pus annotated for superlatives and pro-
pose an interpretation algorithm that uses
a wide-coverage parser and produces se-
mantic representations. We achieve F-
scores between 0.84 and 0.91 for detecting
attributive superlatives and an accuracy in
the range of 0.69–0.84 for determining the
correct comparison set. As far as we are
aware, this is the first automated approach
to superlatives for open-domain texts and
questions.
1 Introduction
Although superlative noun phrases (the nation’s
largest milk producer, the most complex arms-
control talks ever attempted, etc.) receivedconsid-
erable attention in formal linguistics (Szabolcsi,
1986; Gawron, 1995; Heim, 1999; Farkas and
Kiss, 2000), this interest is not mirrored in com-
putational linguistics and NLP. On the one hand,
thisseemsremarkable, sincesuperlativesarefairly
frequently found in natural language. On the other
hand, this is probably not that surprising, given
that their semantic complexity requires deep lin-
guistic analysis that most wide-coverage NLP sys-
tems do not provide.
But even if NLP systems incorporated linguistic
insights for the automatic processing of superla-
tives, it might not be of help: the formal semantics
literature on superlatives focuses on linguistically
challenging examples (many of them artificially
constructed) which might however rarely occur in
real data and would therefore have little impact
on the performance of NLP systems. Indeed, no
corpus-based studies have been conducted to get a
comprehensive picture of the variety of configura-
tions superlatives exhibit, and their distribution in
real occurring data.
In this paper we describe our work on the anal-
ysis of superlative adjectives, which is empiri-
cally grounded and is implemented into an exist-
ing wide-coverage text understanding system. To
get an overview of the behaviour of superlatives
in text, we annotated newswire data, as well as
queries obtained from search engines logs. On
the basis of this corpus study, we propose, imple-
ment and evaluate a syntactic and semantic analy-
sis for superlatives. To the best of our knowledge,
this is the first automated approach to the interpre-
tation of superlatives for open-domain texts that
is grounded on actual corpus-evidence and thor-
oughly evaluated. Some obvious applications that
would benefit from this work are question answer-
ing, recognition of entailment, and more generally
relation extraction systems.
2 Syntax and Semantics of Superlatives
2.1 Surface Forms
In English, superlative adjectives appear in a large
variety of syntactic and morphological forms.
One-syllable adjectives and some two-syllable ad-
jectivesaredirectlyinflectedwiththesuffix“-est”.
Somewordsoftwosyllablesandallwordsofthree
or more syllables are instead introduced by “most”
(or “least”). Superlatives can be modified by ordi-
nals, cardinals or adverbs, such as intensifiers or
modals, and are normally preceeded by the defi-
nite article or a possessive. The examples below
illustrate the wide variety and uses of superlative
adjectives.
9
the tallest woman
AS Roma’s quickest player
the Big Board’s most respected floor traders
France’s third-largest chemical group
the most-recent wave of friendly takeovers
the two largest competitors
the the southern-most tip of England
its lowest possible prices
Superlative adjectives can manifest themselves
in predicative (“Mia is the tallest.”) or attributive
form (“the tallest woman”). Furthermore, there
are superlative adverbs, such as “most recently”,
and idiomatic usages.
2.2 The Comparison Set
It is well known that superlatives can be analysed
in terms of comparative constructions (Szabolcsi,
1986; Alshawi, 1992; Gawron, 1995; Heim, 1999;
Farkas and Kiss, 2000). Accordingly, “the oldest
character” can be interpreted as the character such
that there is no older character, in the given con-
text. Therefore, a correct semantic interpretation
of the superlative depends on the correct charac-
terisation of the comparison set. The comparison
set denotes the set of entities that are compared to
each other with respect to a certain dimension (see
Section 2.3). In “the oldest character in the book”,
the members of the comparison set are characters
in the book, and the dimension of comparison is
age.
The computation of the comparison set is com-
plicated by complex syntactic structure involving
the superlative. The presence of possessives for
example, as in “AS Roma’s quickest player”, ex-
tends the comparison set to players of AS Roma.
Prepositional phrases (PPs), gerunds, and relative
clauses introduce additional complexity. PPs that
are attached to the head noun of the superlative are
part of the comparison set — those that modify
the entire NP are not. Similarly, restrictive rel-
ative clause are included in the comparison set,
non-restrictive aren’t.
We illustrate this complexity in the following
examples, taken from the Wall Street Journal,
where the comparison set is underlined:
The oldest designer got to work on the dash-
board, she recalls. (WSJ02)
A spokesman for Borden Inc., the nation’s
largestmilk producer, concedesGoyamaybeon
to something. (WSJ02)
Right now, the largest loan the FHA can
insure in high-cost housing markets is $101,250.
(WSJ03)
With newspapers being the largest single
component of solid waste in our landfills ...
(WSJ02)
... questions being raised by what gen-
erally are considered the most complex
arms-control talks ever attempted. (WSJ02)
Besides syntactic ambiguities, the determina-
tion of the comparison set can be further compli-
cated by semantic ambiguities. Some occurrences
of superlatives licence a so-called “comparitive”
reading, as in the following example discussed in
the formal semantics literature (Heim, 1999; Sz-
abolcsi, 1986):
John climbed the highest mountain.
Here, in the standard interpretion, the moun-
tain referred to is the highest available in the con-
text. However, another interpretation might arise
inasituationwhereseveralpeopleclimbedseveral
mountains, and John climbed a mountain higher
than anyone else did, but not necessarily the high-
est of all mountains in the context. Our corpus
study reveals that these readings are rare, although
they tend to be more frequent in questions than in
newspaper texts.
2.3 Dimension
Part of the task of semantically interpretating su-
perlative adjectives is the selection of the dimen-
sion on which entities are compared. In “the
highestmountain”wecomparemountainswithre-
spect to the dimension height, in “the best paper”
we compare papers with respect to the dimension
quality, and so on. A well-known problem is that
some adjectives can be ambiguous or vague in
choosing their dimension. Detecting the appropri-
ate dimension is not covered in this paper, but is
orthogonal to the analysis we provide.
2.4 Superlatives and Entailment
Superlatives exhibit a non-trivial semantics. Some
examples of textual entailment make this very ev-
ident. Consider the contrasts in the following en-
tailment tests with indefinite and universally quan-
tified noun phrases:
I bought a blue car |=I bought a car
I bought a car negationslash|=I bought a blue car
I bought every blue car negationslash|=I bought every car
I bought every car |=I bought every blue car
10
Observe that the directions of entailments are
mirrorred. Now consider a similar test with su-
perlatives, where the entailments fail in both di-
rections:
I bought the cheapest blue car negationslash|=I bought the cheapest car
I bought the cheapest car negationslash|=I bought the cheapest blue car.
These entailment tests underline the point that
the meaning of superlatives is rather complicated,
and that a shallow semantic representation, say
λx.[cheapest(x) ∧ car(x)] for “cheapest car”, sim-
ply won’t suffice. A semantic represention captur-
ing the meaning of a superlative requires a more
sophisticated analysis. In particular, it is impor-
tant to explicitly represent the comparison set of
a superlative. In “the cheapest car”, the compar-
ison set is formed by the set of cars, whereas in
“the cheapest blue car”, the comparison set is the
set of blue cars. Semantically, we can represent
“cheapest blue car” as follows, where the compar-
ison set is made explicit in the antecedent of the
conditional:
λx.[car(x) ∧ blue(x) ∧
∀y((car(y) ∧ blue(y) ∧ xnegationslash=y) → cheaper(x,y))]
ParaphrasedinEnglish, thisstipulatesthatsome
blue car is cheaper than any other blue car. A
meaning representation like this will logically pre-
dict the correct entailment relations for superla-
tives.
3 Annotated Corpus of Superlatives
In order to develop and evaluate our system we
manually annotated a collection of newspaper arti-
cle and questions withoccurrences of superlatives.
The design of the corpus and its characteristics are
described in this section.
3.1 Classification and Annotation Scheme
Instances of superlatives are identified in text and
classified into one of four possible classes: at-
tributive, predicative, adverbial, or idiomatic:
its rates will be among the highest (predicative)
the strongest dividend growth (attributive)
free to do the task most quickly (adverbial)
who won the TONY for best featured actor? (idiom)
For all cases, we annotate the span of the su-
perlative adjective in terms of the position of the
tokens in the sentence. For instance, in “its1 rates2
will3 be4 among5 the6 highest7”, the superlative
span would be 7–7.
Additional information is encoded for the at-
tributive case: type of determiner (possessive, def-
inite, bare, demonstrative, quantifier), number (sg,
pl, mass), cardinality (yes, no), modification (ad-
jective, ordinal, intensifier, none). Table 1 shows
some examples from the WSJ with annotation val-
ues.
Not included in this study are adjectives such
as “next”, “past”, “last”, nor the ordinal “first”,
although they somewhat resemble superlatives in
their semantics. Also excluded are adjectives that
lexicalise a superlative meaning but are not su-
perlatives morphologically, like “main”, “princi-
pal”, and the like. For etymological reasons we
however include “foremost” and “uttermost.”
3.2 Data and Annotation
Our corpus consists of a collection of newswire
articles from the Wall Street Journal (Sections 00,
01, 02, 03, 04, 10, and 15) and the Glasgow Her-
ald (GH950110 from the CLEF evaluation forum),
and a large set of questions from the TREC QA
evaluation exercise (years 2002 and 2003) and
natural language queries submitted to the Excite
search engine (Jansen and Spink, 2000). The data
was automatically tokenised, but all typos and
extra-grammaticalities were preserved. The cor-
pus was split into a development set used for tun-
ing the system and a test set for evaluation. The
size of each sub-corpus is shown in Table 2.
Table 2: Size of each data source (in number of
sentences/questions)
source dev test total
WSJ 8,058 6,468 14,526
GH — 2,553 2,553
TREC 1,025 — 1,025
Excite — 67,140 67,140
total 9,083 76,161 85,244
The annotation was performed by two trained
linguists. One section of the WSJ was anno-
tated by both annotators independently to calcu-
late inter-annotator agreement. All other docu-
ments were first annotated by one judge and then
checked by the second, in order to ensure max-
imum correctness. All disagreements were dis-
cussed and resolved for the creation of a gold stan-
dard corpus.
Inter-annotator agreement was assessed mainly
using f-score and percentage agreement as well as
11
Table 1: Annotation examples of superlative adjectives
example sup span det num car mod comp set
The third-largest thrift institution in Puerto Rico
also [...]
2–2 def sg no ord 3–7
The Agriculture Department reported that feedlots
in the 13 biggest ranch states held [...]
9–10 def pl yes no 11–12
The failed takeover would have given UAL em-
ployees 75 % voting control of the nation ’s
second-largest airline [...]
17–17 pos sg no ord 14–18
the kappa statistics (K), where applicable (Car-
letta, 1996). In using f-score, we arbitrarily take
one of the annotators’ decisions (A) as gold stan-
dard and compare them with the other annotator’s
decisions (B). Note that here f-score is symmetric,
sinceprecision(A,B)=recall(B,A),and(balanced)
f-score is the harmonic mean of precision and re-
call (Tjong Kim Sang, 2002; Hachey et al., 2005,
see also Section 5).
We evaluated three levels of agreement on a
sample of 1967 sentences (one full WSJ section).
The first level concerns superlative detection: to
what extent different human judges can agree on
what constitutes a superlative. For this task, f-
score was measured at 0.963 with a total of 79 su-
perlative phrases agreed upon.
Thesecondlevelofagreementisrelativetotype
identification (attributive, predicative, adverbial,
idiomatic), and is only calculated on the subset
ofcasesbothannotatorsrecognisedassuperlatives
(79 instances, as mentioned). The overall f-score
for the classification task is 0.974, with 77 cases
where both annotators assigned the same type to
a superlative phrase. We also assessed agreement
for each class, and the attributive type resulted the
most reliable with an f-score of 1 (total agree-
ment on 64 cases), whereas there was some dis-
agreement in classifying predicative and adverbial
cases (0.9 and 0.8 f-score, respectively). Idiomatic
uses where not detected in this portion of the data.
To assess this classification task we also used the
kappa statistics which yielded KCo=0.922 (fol-
lowing (Eugenio and Glass, 2004) we report K
as KCo, indicating that we calculate K `a la Co-
hen (Cohen, 1960). KCo over 0.9 is considered to
signal very good agreement (Krippendorff, 1980).
The third and last level of agreement deals with
the span of the comparison set and only concerns
attributive cases (64 out of 79). Percentage agree-
ment was used since this is not a classification task
and was measured at 95.31%.
The agreement results show that the task ap-
pears quite easy to perform for linguists. Despite
thelimitednumberofinstancescompared, thishas
also emerged from the annotators’ perception of
the difficulty of the task for humans.
3.3 Distribution
The gold standard corpus comprises a total of
3,045 superlatives, which roughly amounts to one
superlative in every 25 sentences/questions. The
overwhelming majority of superlatives are attribu-
tive (89.1%), and only a few are used in a pred-
icative way (6.9%), adverbially (3.0%), or in id-
iomatic expressions (0.9%).1 Table 3 shows the
detailed distribution according to data source and
experimental sets. Although the corpus also in-
cludes annotation about determination, modifica-
tion, grammatical number, and cardinality of at-
tributive superlatives (see Section 3.1), this infor-
mation is not used by the system described in this
paper.
Table 3: Distribution of superlative types in the
development and evaluation sets.
dev test
type WSJ TREC WSJ GH Excite total
att 240 43 218 68 2,145 2,714
pre 40 3 26 17 125 211
adv 17 2 22 9 41 91
idi 6 5 1 2 15 29
total 303 53 267 96 2,326 3,045
4 Automatic Analysis of Superlatives
The system that we use to analyse superlatives is
based on two linguistic formalisms: Combinatory
Categorial Grammar (CCG), for a theory of syn-
tax; and Discourse Representation Theory (DRT)
1Percentages are rounded to the first decimal and do not
necessarily sum up to 100%.
12
foratheoryofsemantics. Inthissectionwewillil-
lustrate how we extend these theories to deal with
superlatives and how we implemented this into a
working system.
4.1 Combinatory Categorial Grammar
(CCG)
CCG is a lexicalised theory of grammar (Steed-
man, 2001). We used Clark & Curran’s wide-
coverage statistical parser (Clark and Curran,
2004) trained on CCG-bank, which in turn is de-
rived from the Penn-Treebank (Hockenmaier and
Steedman, 2002). In CCG-bank, the majority of
superlative adjective of cases are analysed as fol-
lows:
the tallest woman
NP/N N/N N
N
NP
most devastating droughts
(N/N)/(N/N) N/N N
N/N
N
third largest bank
N/N (N/N)\(N/N) N
N/N
N
Clark & Curran’s parser outputs besides a CCG
derivation of the input sentence also a part-of-
speech (POS) tag and a lemmatised form for each
input token. To recognise attributive superla-
tives in the output of the parser, we look both
at the POS tag and the CCG-category assigned
to a word. Words with POS-tag JJS and CCG-
category N/N, (N/N)/(N/N), or (N/N)\(N/N) are
considered attributive superlatives adjectives, and
so are the words “most” and “least” with CCG cat-
egory (N/N)/(N/N).
However, most hyphenated superlatives are not
recognised by the parser as JJ instead of JJS, and
are corrected in a post-processing step.2 Examples
that fall in this category are “most-recent wave”
and “third-highest”.
4.2 Discourse Representation Theory (DRT)
The output of the parser, a CCG derivation of the
input sentence, is used to construct a Discourse
Representation Structure (DRS, the semantic rep-
resentation proposed by DRT (Kamp and Reyle,
2This is due to the fact that the Penn-Treebank annotation
guidelines prescribe that all hyphenated adjectives ought to
be tagged as JJ.
1993)). We follow (Bos et al., 2004; Bos, 2005) in
automatically building semantic representation on
the basis of CCG derivations in a compositional
fashion. We briefly summarise the approach here.
The semantic representation for a word is deter-
mined by its CCG category, POS-tag, and lemma.
Consider the following lexical entries:
the: λp.λq.( x ;p(x);q(x))
tallest: λp.λx.( ( y
ynegationslash=x ;p(y))⇒ taller(x,y)
;p(x))
man: λx. man(x)
These lexical entries are combined in a compo-
sitional fashion following the CCG derivation, us-
ing the λ-calculus as a glue language:
tallest man: λx.
man(x)
y
ynegationslash=x
man(y)
⇒ taller(x,y)
the tallest man: λq.(
x
man(x)
y
ynegationslash=x
man(y)
⇒ taller(x,y)
;q(x))
In this way DRSs can be produced in a robust
way, achieving high-coverage. An example output
representation of the complete system is shown in
Figure 1.
As is often the case, the output of the parser is
notalwayswhatoneneedstoconstructameaning-
ful semantic representation. There are two cases
where we alter the CCG derivation output by the
parserinordertoimprovetheresultingDRSs. The
first case concerns modifiers following a superla-
tive construction, that are attached to the NP node
rather than N. A case in point is
... the largest toxicology lab in New
England ...
where the PP in New England has the CCG cate-
gory NP\NP rather than N\N. This would result
in a comparison set containing of toxicology labs,
rather than a set toxicology labs in New England.
The second case are possessive NPs preceding
a superlative construction. An example here is
... Jaguar’s largest shareholder ...
13
_______________________________________________
| x0 x1 x2 x3 x4 x5 x6 |
|-----------------------------------------------|
| acquisition(x1) |
| nn(x0,x1) |
| named(x0,georgia-pacific,nam) |
| named(x2,nekoosa,loc) |
| of(x1,x2) |
| company(x5) |
| nn(x3,x5) |
| forest-product(x3) |
| nn(x4,x5) |
| named(x4,us,loc) |
| ____________________ ________________ |
| | x7 x8 x9 | | | |
| |--------------------| |----------------| |
| | company(x9) | ==> | largest(x5,x9) | |
| | nn(x7,x9) | |________________| |
| | forest-product(x7) | |
| | nn(x8,x9) | |
| | named(x8,us,loc) | |
| | _________ | |
| | | | | |
| | __ |---------| | |
| | | | x5 = x9 | | |
| | |_________| | |
| |____________________| |
| create(x6) |
| agent(x6,x1) |
| patient(x6,x5) |
| event(x6) |
|_______________________________________________|
Figure 1: Example DRS output
where a correct interpretation of the superlative
requires a comparison set of shareholders from
Jaguar, rather than just any shareholder. However,
the parser outputs a derivation where “largest” is
combined with “shareholder”, and then with the
possessive construction, yielding the wrong se-
mantic interpretation. To deal with this, we anal-
ysepossessivesthatinteractwiththesuperlativeas
follows:
Rome ’s oldest church
NP ((NP/N)/(N/N)\NP N/N N
(NP/N)/(N/N)
NP/N
NP
This analysis yields the correct comparison set for
superlative that follow a possessive noun phrase,
given the following lexical semantics for the geni-
tive:
λn.λS.λp.λq.( u ;S(λx.(p(x);n(λy. of(y,x) )(u);q(u))))
For both cases, we apply some simple post-
processing rules to the output of the parser to ob-
tain the required derivations. The effect of these
rules is reported in the next section, where we as-
sess the accuracy of the semantic representations
produced for superlatives by comparing the auto-
matic analysis with the gold standard.
5 Evaluation
The automatic analysis of superlatives we present
in the following experiments consists of two se-
quential tasks: superlative detection, and compar-
ison set determination.
The first task is concerned with finding a su-
perlative in text and its exact span (“largest”,
“most beautiful”, “10 biggest”). For a found string
to to be judged as correct, its whole span must cor-
respond to the gold standard. The task is evaluated
using precision (P), recall (R), and f-score (F), cal-
culated as follows:
P = correct assignments of ctotal assignments of c
R = correct assignments of ctotal corpus instances of c
F = 2PcRcPc+Rc
The second task is conditional on the first: once
a superlative is found, its comparison set must
also be identified (“rarest flower in New Zealand”,
“New York’s tallest building”, see Section 2.2). A
selected comparison set is evaluated as correct if
it corresponds exactly to the gold standard anno-
tation: partial matches are counted as wrong. As-
signments are evaluated using accuracy (number
of correct decisions made) only on the subset of
previously correctly identified superlatives.
For both tasks we developed simple baseline
systems based on part-of-speech tags, and a more
sophisticated linguistic analysis based on CCG
and DRT (i.e. the system described in Section 4).
In the remainder of the paper we refer to the latter
system as DLA (Deep Linguistic Analysis).
5.1 Superlative Detection
Baseline system For superlative detection we
generated a baseline that solely relies on part-of-
speech information. The data was tagged using
TnT (Brants, 2000), using a model trained on the
Wall Street Journal. In the WSJ tagset, superla-
tivescanbemarkedintwodifferentways, depend-
ing on whether the adjective is inflected or modi-
fied by most/least. So, “largest”, for instance, is
tagged as JJS, whereas “most beautiful” is a se-
quence of RBS (most) and JJ (beautiful). We also
checked that they are followed by a common or
proper noun (NN.*), allowing one word to oc-
cur in between. To cover more complex cases,
we also considered pre-modification by adjectives
(JJ), and cardinals (CD). In summary, we matched
on sequences found by the following pattern:
[(CD || JJ)* (JJS || (RBS JJ)) * NN.*]
This rather simple baseline is capable of de-
tecting superlatives such as “100 biggest banks”,
“fourth largest investors”, and “most important
14
element”, but will fail on expressions such as
“fastest growing segments” or “Scotland ’s lowest
permitted 1995-96 increase”.
DLA system For evaluation, we extrapolated
superlatives from the DRSs output by the system.
Each superlative introduces an implicational DRS
condition, but not all implicational DRS condi-
tions are introduced by superlatives. Hence, for
the purposes of this experiment superlative DRS
conditions were assigned a special mark. While
traversing the DRS, we use this mark to retrieve
superlative instances. In order to retrieve the orig-
inalstringthatgaverisetothesuperlativeinterpre-
tation, we exploit the meta information encoded in
each DRS about the relation between input tokens
and semantic information. The obtained string po-
sition can in turn be evaluated against the gold
standard.
Table 4 lists the results achieved by the base-
line system and the DLA system on the detection
task. The DLA system outperforms the baseline
system on precision in all sub-corpora. However,
the baseline achieves a higher recall on the Excite
queries. This is not entirely surprising given that
the coverage of the parser is between 90–95% on
unseen data. Moreover, Excite queries are often
ungrammatical, thus further affecting the perfor-
mance of parsing.
Table 4: Detection of Attributive Superlatives, re-
porting P (precision), R (Recall) and F-score, for
WSJ sections, extracts of the Glasgow Herald,
TREC questions, and Excite queries. D indicates
development data, T test data.
Baseline DLA
Corpus P R F P R F
WSJ (D) 0.93 0.86 0.89 0.96 0.90 0.93
WSJ (T) 0.91 0.83 0.87 0.95 0.87 0.91
GH (T) 0.80 0.76 0.78 0.87 0.81 0.84
TREC (D) 0.76 0.91 0.83 0.85 0.91 0.88
Excite (T) 0.92 0.92 0.92 0.97 0.84 0.90
5.2 Comparison Set Determination
Baseline For comparison set determination we
developed two baseline systems. Both use the
same match on sequences of part-of-speech tags
described above. For Baseline 1, the beginning
of the comparison set is the first word following
the superlative. The end of the comparison set is
the first word tagged as NN.* in that sequence (the
same word could be the beginning and end of the
comparison set, as it often happens).
The second baseline takes the first word after
the superlative as the beginning of the comparison
set, and the end of the sentence (or question) as the
end (excluding the final punctuation mark). We
expect this strategy to perform well on questions,
as the following examples show.
Where is the oldest synagogue in the United States?
What was the largest crowd to ever come see Michael Jordan?
This approach is obviously likely to generate com-
parison sets much wider than required.
More complex examples that neither baseline
can tackle involve possessives, since on the sur-
face the comparison set lies at both ends of the
superlative adjective:
The nation’s largest pension fund
the world’s most corrupt organizations
DLA1 Wefirstextrapolatesuperlativesfromthe
DRS output by the system (see procedure above).
Then,weexploitthesemanticrepresentationtose-
lect the comparison set: it is determined by the in-
formation encoded in the antecedent of the DRS-
conditional introduced by the superlative. Again,
we exploit meta information to reconstruct the
original span, and we match it against the gold
standard for evaluation.
DLA2 DLA2buildsonDLA1, towhichitadds
post-processing rules to the CCG derivation, i.e.
before the DRSs are constructed. This set of rules
deal with NP post-modification of the superlative
(see Section 4).
DLA 3 In this version we include a set of post-
processing rules that apply to the CCG derivation
to deal with possessives preceding the superlative
(see Section 4).
DLA 4 This is a combination of DLA 2 and
DLA 3. This system is clearly expected to per-
form best.
Results for both baseline systems and all versions
of DLA are shown in Table 5
On text documents, DLA 2/3/4 outperform the
baseline systems. DLA 4 achieves the best per-
formance, with an accuracy of 69–83%. On ques-
tions, however, DLA 4 competes with the base-
line: whereas it is better on TREC questions, it
performs worse on Excite questions. One of the
obvious reasons for this is that the parser’s model
15
Table 5: Determination of Comparison Set of
Attributive Superlatives (Accuracy) for WSJ sec-
tions, extracts of the Glasgow Herald, TREC and
Excite questions. D indicates development data, T
test data.
Corpus Base 1 Base 2 DLA 1 DLA 2 DLA3 DLA 4
WSJ (D) 0.29 0.17 0.29 0.52 0.53 0.78
WSJ (T) 0.31 0.22 0.32 0.59 0.53 0.83
GH (T) 0.23 0.31 0.22 0.51 0.38 0.69
TREC (D) 0.10 0.69 0.13 0.69 0.23 0.82
Excite (T) 0.23 0.90 0.32 0.82 0.33 0.84
for questions was trained on TREC data. Addi-
tionally, as noted earlier, Excite questions are of-
ten ungrammatical and make parsing less likely to
succeed. However, the baseline system, by defini-
tion, does not output semantic representations, so
that its outcome is of little use for further reason-
ing, as required by question answering or general
information extraction systems.
6 Conclusions
We have presented the first empirically grounded
study of superlatives, and shown the feasibility of
their semantic interpretation in an automatic fash-
ion. Using Combinatory Categorial Grammar and
Discourse Representation Theory we have imple-
mentedasystemthatisabletorecogniseasuperla-
tive expression and its comparison set with high
accuracy.
For developing and testing our system, we have
created a collection of over 3,000 instances of su-
perlatives, both in newswire text and in natural
language questions. This very first corpus of su-
perlativesallowsusto getacomprehensivepicture
of the behaviour and distribution of superlatives in
real occurring data. Thanks to such broad view
of the phenomenon, we were able discover issues
previously unnoted in the formal semantics liter-
ature, such as the interaction of prenominal pos-
sessives and superlatives, which cause problems
at the syntax-semantics interface in the determina-
tion of the comparison set. Similarly problematic
are hyphenated superlatives, which are tagged as
normal adjectives in the Penn Treebank.
Moreover, this work provides a concrete way
of evaluating the output of a stochastic wide-
coverage parser trained on the CCGBank (Hock-
enmaier and Steedman, 2002). With respect to
superlatives, our experiments show that the qual-
ity of the raw output is not entirely satisfactory.
However, we have also shown that some sim-
ple post-processing rules can increase the perfor-
mance considerably. This might indicate that the
way superlatives are annotated in the CCGbank,
although consistent, is not fully adequate for the
purpose of generating meaningful semantic repre-
sentations, but probably easy to amend.
7 Future Work
Given the syntactic and semantic complexity of
superlative expressions, there is still wide scope
for improving the coverage and accuracy of our
system. One obvious improvement is to amend
CCGbank in order to avoid the need for postpro-
cessing rules, thereby also allowing the creation
of more accurate language models. Another as-
pect which we have neglected in this study but
want to consider in future work is the interac-
tion between superlatives and focus (Heim, 1999;
Gawron, 1995). Also, only one of the possible
typesofsuperlativewasconsidered,namelytheat-
tributive case. In future work we will consider the
interpretationofpredicativeandadverbialsuperla-
tives, as well as comparative expressions. Finally,
we would like to investigate the extent to which
existing NLP systems (such as open-domain QA
systems) can benefit from a detailed analysis of
superlatives.
Acknowledgements
We would like to thank Steve Pulman (for in-
formation on the analysis of superlatives in the
Core Language Engine), Mark Steedman (for use-
ful suggestions on an earlier draft of this paper),
and Jean Carletta (for helpful comments on anno-
tation agreement issues), as well as three anony-
mous reviewers for their comments. We are ex-
tremely grateful to Stephen Clark and James Cur-
ran for making their parser available to us. Johan
Bos is supported by a “Rientro dei Cervelli” grant
(Italian Ministry for Research); Malvina Nissim is
supported by the EU FP6 NeOn project.

References
Hiyan Alshawi, editor. 1992. The Core Language En-
gine. The MIT Press, Cambridge, Massachusetts.
J. Bos, S. Clark, M. Steedman, J.R. Curran, and Hock-
enmaier J. 2004. Wide-Coverage Semantic Rep-
resentations from a CCG Parser. In Proceedings of
the20thInternationalConferenceonComputational
Linguistics (COLING ’04), Geneva, Switzerland.
Johan Bos. 2005. Towards wide-coverage semantic in-
terpretation. In Proceedings of Sixth International
Workshop on Computational Semantics IWCS-6,
pages 42–53.
Thorsten Brants. 2000. TnT - A Statistical Part-of-
Speech Tagger. In Proceedings of the Sixth Applied
Natural Language Processing Conference ANLP-
2000, Seattle, WA.
Jean Carletta. 1996. Assessing agreement on classi-
fication tasks: the kappa statistic. Computational
Linguistics, 22(2):249–254.
S. Clark and J.R. Curran. 2004. Parsing the WSJ using
CCG and Log-Linear Models. In Proceedings of the
42nd Annual Meeting of the Association for Compu-
tational Linguistics (ACL ’04), Barcelona, Spain.
Jacob Cohen. 1960. A coefficient of agreement
for nominal scales. Educational and Psychological
Measurements, 20:37–46.
Barbara Di Eugenio and Michael Glass. 2004. The
kappa statistic: a second look. Computational Lin-
guistics, 30(1).
Donka F. Farkas and Katalin `E. Kiss. 2000. On the
comparative and absolute readings of superlatives.
Natural Language and Linguistic Theory, 18:417–
455.
Jean Mark Gawron. 1995. Comparatives, superlatives,
andresolution. Linguistics and Philosophy, 18:333–
380.
Ben Hachey, Beatrice Alex, and Markus Becker. 2005.
Investigating the effects of selective sampling on the
annotation task. In Proceedings of the 9th Confer-
ence on Computational Natural Language Learning,
Ann Arbor, Michigan, USA.
Irene Heim. 1999. Notes on superlatives. MIT.
J. Hockenmaier and M. Steedman. 2002. Genera-
tiveModelsforStatisticalParsingwithCombinatory
CategorialGrammar. In Proceedings of 40th Annual
Meeting of the Association for Computational Lin-
guistics, Philadelphia, PA.
Bernard J. Jansen and Amanda Spink. 2000. The ex-
cite research project: A study of searching by web
users. Bulletin of the American Society for Informa-
tion Science and Technology, 27(1):5–17.
H. Kamp and U. Reyle. 1993. From Discourse to
Logic; An Introduction to Modeltheoretic Seman-
tics of Natural Language, Formal Logic and DRT.
Kluwer, Dordrecht.
Klaus Krippendorff. 1980. Content Analysis: An In-
troduction to Its Methodology. Sage Publications.
M. Steedman. 2001. The Syntactic Process. The MIT
Press.
Anna Szabolcsi. 1986. Comparative superlatives. In
N. Fukui et al., editor, Papers in Theoretical Lin-
guistics, MITWPL, volume 8. MIT.
Erik F. Tjong Kim Sang. 2002. Introduction to
the conll-2002 shared task: Language-independent
named entity recognition. In Proceedings of
CoNLL-2002, pages 155–158. Taipei, Taiwan.
