Introduction to
Frontiers in Corpus Annotation
Adam Meyers
New York University
meyers@cs.nyu.edu
A new annotated corpus can have a pivotal role in the
future of computational linguistics. Corpus annotation
can de ne new NLP tasks and set new standards. This
may put many of the papers presented at this workshop
on the cutting edge of our  eld.
A standard, however, is a double edged sword. A stan-
dard corpus urges users to accept the theory of how to
represent things that underlie that corpus. For example,
a Penn Treebank theory of grammar is implicit in Penn-
Treebank-based parsers. This can be a problem if one
rejects some aspects of that theory. Also one may object
to a particular system of annotation because some theo-
ries generalize to cover new ground (e.g., new languages)
better than others. Nevertheless, advantages of accepting
a corpus as standard include the following:
a0 It is straight-forward to compare the performance of
the set of systems that produce the same form of out-
put, e.g., Penn Treebank-based parsers can be com-
pared in terms of how well they reproduce the Penn
Treebank.
a0 Alternative systems based on a standard are largely
interchangeable. Thus a system that uses one Penn-
Treebank-based parser as a component can easily
be adapted to use another better performing Penn-
Treebank-based parser.
a0 Standards can be built on. For example, if one ac-
cepts the framework of the Penn Treebank, it is easy
to move on to representations of  deeper structure
as suggested in three papers in this volume (Milt-
sakaki et al., 2004; Babko-Malaya et al., 2004; Mey-
ers et al., 2004).
It is my view that these advantages outweigh the dis-
advantages. I propose that the papers in this volume be
viewed with the following question in mind: How can the
work covered by this collection of papers be integrated to-
gether? Put differently, to what extent are these resources
mergeable?
The  rst six papers describe linguistic annotation in
four languages: Spanish (Alc·antara and Moreno, 2004),
English (Miltsakaki et al., 2004; Babko-Malaya et al.,
2004; Meyers et al., 2004), Czech (Sgall et al., 2004)
and German(Baumann et al., 2004). The sixth, seventh
and eighth papers (Baumann et al., 2004; C‚ mejrek et al.,
2004; Helmreich et al., 2004) explore questions of mul-
tilingual annotation of syntax and semantics, beginning
to answer the question of how annotation systems can be
made compatible across languages. Indeed (Helmreich
et al., 2004) explores the question of integration across
languages, as well as levels of annotation. (Baumann
et al., 2004) also describes how a number of different
linguistic levels can be related in annotation (pragmatic
and prosodic) among two languages (English and Ger-
man). The ninth and tenth papers (Langone et al., 2004;
 Zabokrtsk·y and Lopatkov·a, 2004) are respectively about
a corpus related to a lexicon and the reverse: a lexicon
related to a corpus. This opens up the wider theme of the
intergration of a number of different linguistic resources.
As the natural language community produces more and
more linguistic resources, especially corpora, it seems
important to step back and look at the larger picture. If
these resources can be  t together as part of a larger puz-
zle, this could produce a sketch of the future of our  eld.

References
M. Alc·antara and A. Moreno. 2004. Syntax to Seman-
tics Transformation: Application to Treebanking. In
HLT-NAACL 2004 Workshop: Frontiers in Corpus An-
notation, Boston, Massachusetts.
O. Babko-Malaya, M. Palmer, N. Xue, A. Joshi, and
S. Kulick. 2004. Proposition Bank II: Delving Deeper.
In HLT-NAACL 2004 Workshop: Frontiers in Corpus
Annotation, Boston, Massachusetts.
S. Baumann, C. Brinkmann, S. Hansen-Schirra, G. Krui-
jff, I. Kruijff-Korbayov·a, S. Neumann, and E. Teich.
2004. Multi-dimensional annotation of linguistic cor-
pora for investigating information structure. In HLT-
NAACL 2004 Workshop: Frontiers in Corpus Annota-
tion, Boston, Massachusetts.
M. C‚ mejrek, J. Cu r·in, and J. Havelka. 2004. Prague
Czech-English Dependency Treebank: Any Hopes for
a Common Annotation Scheme? In HLT-NAACL 2004
Workshop: Frontiers in Corpus Annotation, Boston,
Massachusetts.
S. Helmreich, D. Farwell, B. Dorr, N. Habash, L. Levin,
T. Mitamura, F. Reeder, K. Miller, E. Hovy, O. Ram-
bow, and A.Siddharthan. 2004. Interlingual annota-
tion of multilingual text corpora. In HLT-NAACL 2004
Workshop: Frontiers in Corpus Annotation, Boston,
Massachusetts.
H. Langone, B. R. Haskell, and G. A. Miller. 2004.
Annotating WordNet. In HLT-NAACL 2004 Work-
shop: Frontiers in Corpus Annotation, Boston, Mas-
sachusetts.
A. Meyers, R. Reeves, C. Macleod, R. Szekely, V. Zielin-
ska, B. Young, and R. Grishman. 2004. The NomBank
Project: An Interim Report. In HLT-NAACL 2004
Workshop: Frontiers in Corpus Annotation, Boston,
Massachusetts.
E. Miltsakaki, A. Joshi, R. Prasad, and B. Webber. 2004.
Annotating Discourse Connectives and Their Argu-
ments. In HLT-NAACL 2004 Workshop: Frontiers in
Corpus Annotation, Boston, Massachusetts.
P. Sgall, J. Panevov·a, and E. Haji cov·a. 2004. Deep Syn-
tactic Annotation: Tectogrammatical Representation
and Beyond. In HLT-NAACL 2004 Workshop: Fron-
tiers in Corpus Annotation, Boston, Massachusetts.
Z.  Zabokrtsk·y and M. Lopatkov·a. 2004. Valency Frames
of Czech Verbs in VALLEX 1.0. In HLT-NAACL 2004
Workshop: Frontiers in Corpus Annotation, Boston,
Massachusetts.
