Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 80–87,
Sydney, July 2006. c©2006 Association for Computational Linguistics
An annotation scheme for citation function
Simone Teufel Advaith Siddharthan Dan Tidhar
Natural Language and Information Processing Group
Computer Laboratory
Cambridge University, CB3 0FD, UK
{Simone.Teufel,Advaith.Siddharthan,Dan.Tidhar}@cl.cam.ac.uk
Abstract
We study the interplay of the discourse struc-
ture of a scienti c argument with formal ci-
tations. One subproblem of this is to clas-
sify academic citations in scienti c articles ac-
cording to their rhetorical function, e.g., as a
rival approach, as a part of the solution, or
as a  awed approach that justi es the cur-
rent research. Here, we introduce our anno-
tation scheme with 12 categories, and present
an agreement study.
1 Scienti c writing, discourse structure
and citations
In recent years, there has been increasing interest in
applying natural language processing technologies to
scienti c literature. The overwhelmingly large num-
ber of papers published in  elds like biology, genetics
and chemistry each year means that researchers need
tools for information access (extraction, retrieval, sum-
marization, question answering etc). There is also in-
creased interest in automatic citation indexing, e.g.,
the highly successful search tools Google Scholar and
CiteSeer (Giles et al., 1998).1 This general interest in
improving access to scienti c articles  ts well with re-
search on discourse structure, as knowledge about the
overall structure and goal of papers can guide better in-
formation access.
Shum (1998) argues that experienced researchers are
often interested in relations between articles. They
need to know if a certain article criticises another and
what the criticism is, or if the current work is based
on that prior work. This type of information is hard
to come by with current search technology. Neither
the author’s abstract, nor raw citation counts help users
in assessing the relation between articles. And even
though CiteSeer shows a text snippet around the phys-
ical location for searchers to peruse, there is no guar-
antee that the text snippet provides enough information
for the searcher to infer the relation. In fact, studies
from our annotated corpus (Teufel, 1999), show that
69% of the 600 sentences stating contrast with other
work and 21% of the 246 sentences stating research
continuation with other work do not contain the cor-
responding citation; the citation is found in preceding
1CiteSeer automatically citation-indexes all scienti c ar-
ticles reached by a web-crawler, making them available to
searchers via authors or keywords in the title.
Nitta and Niwa 94
Resnik 95
Brown et al. 90a
Rose et al. 90
Church and Gale 91
Dagan et al. 94
Li and Abe 96
Hindle 93
Hindle 90
Dagan et al 93
Pereira et al. 93
Following Pereira et al, we measure
word similarity by the relative entropy
or Kulbach-Leibler (KL) distance, bet-
ween the corresponding conditional
distributions.
His notion of similarity
seems to agree with our
intuitions in many cases,
but it is not clear how it
can  be used directly to
construct word classes
and corresponding 
models of association.
Figure 1: A rhetorical citation map
sentences (i.e., the sentence expressing the contrast or
continuation would be outside the CiteSeer snippet).
We present here an approach which uses the classi ca-
tion of citations to help provide relational information
across papers.
Citations play a central role in the process of writing
a paper. Swales (1990) argues that scienti c writing
follows a general rhetorical argumentation structure:
researchers must justify that their paper makes a con-
tribution to the knowledge in their discipline. Several
argumentation steps are required to make this justi ca-
tion work, e.g., the statement of their speci c goal in
the paper (Myers, 1992). Importantly, the authors also
must relate their current work to previous research, and
acknowledge previous knowledge claims; this is done
with a formal citation, and with language connecting
the citation to the argument, e.g., statements of usage of
other people’s approaches (often near textual segments
in the paper where these approaches are described), and
statements of contrast with them (particularly in the
discussion or related work sections). We argue that the
automatic recognition of citation function is interest-
ing for two reasons: a) it serves to build better citation
indexers and b) in the long run, it will help constrain
interpretations of the overall argumentative structure of
a scienti c paper.
Being able to interpret the rhetorical status of a ci-
tation at a glance would add considerable value to ci-
tation indexes, as shown in Fig. 1. Here differences
and similarities are shown between the example paper
(Pereira et al., 1993) and the papers it cites, as well as
80
the papers that cite it. Contrastive links are shown in
grey  links to rival papers and papers the current pa-
per contrasts itself to. Continuative links are shown in
black  links to papers that are taken as starting point of
the current research, or as part of the methodology of
the current paper. The most important textual sentence
about each citation could be extracted and displayed.
For instance, we see which aspect of Hindle (1990) the
Pereira et al. paper criticises, and in which way Pereira
et al.’s work was used by Dagan et al. (1994).
We present an annotation scheme for citations, based
on empirical work in content citation analysis, which
 ts into this general framework of scienti c argument
structure. It consists of 12 categories, which allow us
to mark the relationships of the current paper with the
cited work. Each citation is labelled with exactly one
category. The following top-level four-way distinction
applies:
 Weakness: Authors point out a weakness in cited
work
 Contrast: Authors make contrast/comparison with
cited work (4 categories)
 Positive: Authors agree with/make use of/show
compatibility or similarity with cited work (6 cat-
egories), and
 Neutral: Function of citation is either neutral, or
weakly signalled, or different from the three func-
tions stated above.
We  rst turn to the point of how to classify citation
function in a robust way. Later in this paper, we will
report results for a human annotation experiment with
three annotators.
2 Annotation schemes for citations
In the  eld of library sciences (more speci cally, the
 eld of Content Citation Analysis), the use of informa-
tion from citations above and beyond simple citation
counting has received considerable attention. Biblio-
metric measures assesses the quality of a researcher’s
output, in a purely quantitative manner, by counting
how many papers cite a given paper (White, 2004;
Luukkonen, 1992) or by more sophisticated measures
like the h-index (Hirsch, 2005). But not all citations
are alike. Researchers in content citation analysis have
long stated that the classi cation of motivations is a
central element in understanding the relevance of the
paper in the  eld. Bonzi (1982), for example, points out
that negational citations, while pointing to the fact that
a given work has been noticed in a  eld, do not mean
that that work is received well, and Ziman (1968) states
that many citations are done out of  politeness (to-
wards powerful rival approaches),  policy (by name-
dropping and argument by authority) or  piety (to-
wards one’s friends, collaborators and superiors). Re-
searchers also often follow the custom of citing some
1. Cited source is mentioned in the introduction or
discussion as part of the history and state of the
art of the research question under investigation.
2. Cited source is the speci c point of departure for
the research question investigated.
3. Cited source contains the concepts, de nitions,
interpretations used (and pertaining to the disci-
pline of the citing article).
4. Cited source contains the data (pertaining to the
discipline of the citing article) which are used
sporadically in the article.
5. Cited source contains the data (pertaining to the
discipline of the citing particle) which are used
for comparative purposes, in tables and statistics.
6. Cited source contains data and material (from
other disciplines than citing article) which is
used sporadically in the citing text, in tables or
statistics.
7. Cited source contains the method used.
8. Cited source substantiated a statement or assump-
tion, or points to further information.
9. Cited source is positively evaluated.
10. Cited source is negatively evaluated.
11. Results of citing article prove, verify, substantiate
the data or interpretation of cited source.
12. Results of citing article disprove, put into ques-
tion the data as interpretation of cited source.
13. Results of citing article furnish a new interpreta-
tion/explanation to the data of the cited source.
Figure 2: Spiegel-Rcurrency1using’s (1977) Categories for Cita-
tion Motivations
particular early, basic paper, which gives the founda-
tion of their current subject ( paying homage to pio-
neers ). Many classi cation schemes for citation func-
tions have been developed (Weinstock, 1971; Swales,
1990; Oppenheim and Renn, 1978; Frost, 1979; Chu-
bin and Moitra, 1975), inter alia. Based on such an-
notation schemes and hand-analyzed data, different in-
 uences on citation behaviour can be determined, but
annotation in this  eld is usually done manually on
small samples of text by the author, and not con rmed
by reliability studies. As one of the earliest such stud-
ies, Moravcsik and Murugesan (1975) divide citations
in running text into four dimensions: conceptual or
operational use (i.e., use of theory vs. use of techni-
cal method); evolutionary or juxtapositional (i.e., own
work is based on the cited work vs. own work is an al-
ternative to it); organic or perfunctory (i.e., work is cru-
cially needed for understanding of citing article or just
a general acknowledgement); and  nally con rmative
vs. negational (i.e., is the correctness of the  ndings
disputed?). They found, for example, that 40% of the
citations were perfunctory, which casts further doubt
on the citation-counting approach.
Other content citation analysis research which is rel-
81
evant to our work concentrates on relating textual spans
to authors’ descriptions of other work. For example, in
O’Connor’s (1982) experiment, citing statements (one
or more sentences referring to other researchers’ work)
were identi ed manually. The main problem encoun-
tered in that work is the fact that many instances of cita-
tion context are linguistically unmarked. Our data con-
 rms this: articles often contain large segments, par-
ticularly in the central parts, which describe other peo-
ple’s research in a fairly neutral way. We would thus
expect many citations to be neutral (i.e., not to carry
any function relating to the argumentation per se).
Many of the distinctions typically made in content
citation analysis are immaterial to the task considered
here as they are too sociologically orientated, and can
thus be dif cult to operationalise without deep knowl-
edge of the  eld and its participants (Swales, 1986). In
particular, citations for general reference (background
material, homage to pioneers) are not part of our an-
alytic interest here, and so are citations  in passing ,
which are only marginally related to the argumentation
of the overall paper (Ziman, 1968).
Spiegel-Rcurrency1using’s (1977) scheme (Fig. 2) is an exam-
ple of a scheme which is easier to operationalise than
most. In her scheme, more than one category can apply
to a citation; for instance positive and negative evalu-
ation (category 9 and 10) can be cross-classi ed with
other categories. Out of 2309 citations examined, 80%
substantiated statements (category 8), 6% discussed
history or state of the art of the research area (cate-
gory 1) and 5% cited comparative data (category 5).
Category Description
Weak Weakness of cited approach
CoCoGM Contrast/Comparison in Goals or Meth-
ods (neutral)
CoCoR0 Contrast/Comparison in Results (neutral)
CoCo- Unfavourable Contrast/Comparison (cur-
rent work is better than cited work)
CoCoXY Contrast between 2 cited methods
PBas author uses cited work as starting point
PUse author uses tools/algorithms/data
PModi author adapts or modi es
tools/algorithms/data
PMot this citation is positive about approach or
problem addressed (used to motivate work
in current paper)
PSim author’s work and cited work are similar
PSup author’s work and cited work are compat-
ible/provide support for each other
Neut Neutral description of cited work, or not
enough textual evidence for above cate-
gories or unlisted citation function
Figure 3: Our annotation scheme for citation function
Our scheme (given in Fig. 3) is an adaptation of the
scheme in Fig. 2, which we arrived at after an analysis
of a corpus of scienti c articles in computational lin-
guistics. We tried to rede ne the categories such that
they should be reasonably reliably annotatable; at the
same time, they should be informative for the appli-
cation we have in mind. A third criterion is that they
should have some (theoretical) relation to the particu-
lar discourse structure we work with (Teufel, 1999).
Our categories are as follows: One category (Weak)
is reserved for weakness of previous research, if it is ad-
dressed by the authors (cf. Spiegel-Rcurrency1using’s categories
10, 12, possibly 13). The next three categories describe
comparisons or contrasts between own and other work
(cf. Spiegel-Rcurrency1using’s category 5). The difference be-
tween them concerns whether the comparison is be-
tween methods/goals (CoCoGM) or results (CoCoR0).
These two categories are for comparisons without ex-
plicit value judgements. We use a different category
(CoCo-) when the authors claim their approach is bet-
ter than the cited work.
Our interest in differences and similarities between
approaches stems from one possible application we
have in mind (the rhetorical citation search tool). We
do not only consider differences stated between the cur-
rent work and other work, but we also mark citations if
they are explicitly compared and contrasted with other
work (not the current paper). This is expressed in cat-
egory CoCoXY. It is a category not typically consid-
ered in the literature, but it is related to the other con-
trastive categories, and useful to us because we think
it can be exploited for search of differences and rival
approaches.
The next set of categories we propose concerns pos-
itive sentiment expressed towards a citation, or a state-
ment that the other work is actively used in the cur-
rent work (which is the ultimate praise). Like Spiegel-
Rcurrency1using, we are interested in use of data and methods
(her categories 4, 5, 6, 7), but we cluster different us-
ages together and instead differentiate unchanged use
(PUse) from use with adaptations (PModi). Work
which is stated as the explicit starting point or intellec-
tual ancestry is marked with our category PBas (her
category 2). If a claim in the literature is used to
strengthen the authors’ argument, this is expressed in
her category 8, and vice versa, category 11. We col-
lapse these two in our category PSup. We use two
categories she does not have de nitions for, namely
similarity of (aspect of) approach to other approach
(PSim), and motivation of approach used or problem
addressed (PMot). We found evidence for prototypi-
cal use of these citation functions in our texts. How-
ever, we found little evidence for her categories 12 or
13 (disproval or new interpretation of claims in cited
literature), and we decided against a  state-of-the-art 
category (her category 1), which would have been in
con ict with our PMot de nition in many cases.
Our fourteenth category,Neut, bundles truly neutral
descriptions of other researchers’ approaches with all
those cases where the textual evidence for a citation
function was not enough to warrant annotation of that
category, and all other functions for which our scheme
did not provide a speci c category. As stated above, we
do in fact expect many of our citations to be neutral.
82
Citation function is hard to annotate because it in
principle requires interpretation of author intentions
(what could the author’s intention have been in choos-
ing a certain citation?). Typical results of earlier cita-
tion function studies are that the sociological aspect of
citing is not to be underestimated. One of our most fun-
damental ideas for annotation is to only mark explicitly
signalled citation functions. Our guidelines explicitly
state that a general linguistic phrase such as  better 
or  used by us must be present, in order to increase
objectivity in  nding citation function. Annotators are
encouraged to point to textual evidence they have for
assigning a particular function (and are asked to type
the source of this evidence into the annotation tool for
each citation). Categories are de ned in terms of cer-
tain objective types of statements (e.g., there are 7 cases
for PMot). Annotators can use general text interpreta-
tion principles when assigning the categories, but are
not allowed to use in-depth knowledge of the  eld or
of the authors.
There are other problematic aspects of the annota-
tion. Some concern the fact that authors do not al-
ways state their purpose clearly. For instance, several
earlier studies found that negational citations are rare
(Moravcsik and Murugesan, 1975; Spiegel-Rcurrency1using,
1977); MacRoberts and MacRoberts (1984) argue that
the reason for this is that they are potentially politically
dangerous, and that the authors go through lengths to
diffuse the impact of negative references, hiding a neg-
ative point behind insincere praise, or diffusing the
thrust of criticism with perfunctory remarks. In our
data we found ample evidence of this effect, illustrated
by the following example:
Hidden Markov Models (HMMs) (Huang
et al. 1990) offer a powerful statistical ap-
proach to this problem, though it is unclear
how they could be used to recognise the units
of interest to phonologists. (9410022, S-24)2
It is also sometimes extremely hard to distinguish
usage of a method from statements of similarity be-
tween a method and the own method. This happens
in cases where authors do not want to admit they are
using somebody else’s method:
The same test was used in Abney and Light
(1999). (0008020, S-151)
Uni cation of indices proceeds in the same
manner as uni cation of all other typed
feature structures (Carpenter 1992).
(0008023, S-87)
In this case, our annotators had to choose between
categories PSim and PUse.
It can also be hard to distinguish between continu-
ation of somebody’s research (i.e., taking somebody’s
2In all corpus examples, numbers in brackets correspond
to the of cial Cmp lg archive number,  S- numbers to sen-
tence numbers according to our preprocessing.
research as starting point, as intellectual ancestry, i.e.
PBas) and simply using it (PUse). In principle, one
would hope that annotation of all usage/positive cate-
gories (starting withP), if clustered together, should re-
sult in higher agreement (as they are similar, and as the
resulting scheme has fewer distinctions). We would ex-
pect this to be the case in general, but as always, cases
exist where a con ict between a contrast (CoCo) and a
change to a method (PModi) occur:
In contrast to McCarthy, Kay and Kiraz,
we combine the three components into a sin-
gle projection. (0006044, S-182)
The markable units in our scheme are a) all full cita-
tions (as recognized by our automatic citation proces-
sor on our corpus), and b) all names of authors of cited
papers anywhere in running text outside of a formal
citation context (i.e., without date). Our citation pro-
cessor recognizes these latter names after parsing the
citation list an marks them up. This is unusual in com-
parison to other citation indexers, but we believe these
names function as important referents comparable in
importance to formal citations. In principle, one could
go even further as there are many other linguistic ex-
pressions by which the authors could refer to other peo-
ple’s work: pronouns, abbreviations such as  Mueller
and Sag (1990), henceforth M & S , and names of ap-
proaches or theories which are associated with partic-
ular authors. If we could mark all of these up auto-
matically (which is not technically possible), annota-
tion would become less dif cult to decide, but techni-
cal dif culty prevent us from recognizing these other
cases automatically. As a result, in these contexts it is
impossible to annotate citation function directly on the
referent, which sometimes causes problems. Because
this means that annotators have to consider non-local
context, one markable may have different competing
contexts with different potential citation functions, and
problems about which context is  stronger may oc-
cur. We have rules that context is to be constrained to
the paragraph boundary, but for some categories paper-
wide information is required (e.g., for PMot, we need
to know that a praised approach is used by the authors,
information which may not be local in the paragraph).
Appendix A gives unambiguous example cases
where the citation function can be decided on the ba-
sis of the sentence alone, but Fig. 4 shows a more typ-
ical example where more context is required to inter-
pret the function. The evaluation of the citation Hin-
dle (1990) is contrastive; the evaluative statement is
found 4 sentences after the sentence containing the ci-
tation3. It consists of a positive statement (agreement
with authors’ view), followed by a weakness, under-
lined, which is the chosen category. This is marked on
the nearest markable (Hindle, 3 sentences after the ci-
tation).
3In Fig. 4, markables are shown in boxes, evaluative state-
ments underlined, and referents in bold face.
83
S-5 Hindle (1990)/Neut proposed dealing with the
sparseness problem by estimating the likelihood of un-
seen events from that of  similar events that have been
seen.
S-6 For instance, one may estimate the likelihood of a
particular direct object for a verb from the likelihoods
of that direct object for similar verbs.
S-7 This requires a reasonable de nition of verb simi-
larity and a similarity estimation method.
S-8 In Hindle/Weak ’s proposal, words are similar
if we have strong statistical evidence that they tend to
participate in the same events.
S-9 His notion of similarity seems to agree with our in-
tuitions in many cases, but it is not clear how it can be
used directly to construct word classes and correspond-
ing models of association. (9408011)
Figure 4: Annotation example: in uence of context
A naive view on this annotation scheme could con-
sider the  rst two sets of categories in our scheme as
 negative and the third set of categories  positive .
There is indeed a sentiment aspect to the interpretation
of citations, due to the fact that authors need to make
a point in their paper and thus have a stance towards
their citations. But this is not the whole story: many
of our  positive categories are more concerned with
different ways in which the cited work is useful to the
current work (which aspect of it is used, e.g., just a
de nition or the entire solution?), and many of the con-
trastive statements have no negative connotation at all
and simply state a (value-free) difference between ap-
proaches. However, if one looks at the distribution of
positive and negative adjectives around citations, one
notices a (non-trivial) connection between our task and
sentiment classi cation.
There are written guidelines of 25 pages, which in-
struct the annotators to only assign one category per
citation, and to skim-read the paper before annotation.
The guidelines provide a decision tree and give deci-
sion aids in systematically ambiguous cases, but sub-
jective judgement of the annotators is nevertheless nec-
essary to assign a single tag in an unseen context. We
implemented an annotation tool based on XML/XSLT
technology, which allows us to use any web browser to
interactively assign one of the 12 tags (presented as a
pull-down list) to each citation.
3 Data
The data we used came from the CmpLg (Computation
and Language archive; 320 conference articles in com-
putational linguistics). The articles are in XML format.
Headlines, titles, authors and reference list items are
automatically marked up with the corresponding tags.
Reference lists are parsed, and cited authors’ names
are identi ed. Our citation parser then applies regu-
lar patterns and  nds citations and other occurrences of
the names of cited authors (without a date) in running
text and marks them up. Self-citations are detected by
overlap of citing and cited authors. The citation pro-
cessor developped in our group (Ritchie et al., 2006)
achieves high accuracy for this task (96% of citations
recognized, provided the reference list was error-free).
On average, our papers contain 26.8 citation instances
in running text4.
4 Human Annotation: results
In order to machine learn citation function, we are
in the process of creating a corpus of scienti c arti-
cles with human annotated citations, according to the
scheme discussed before. Here we report preliminary
results with that scheme, with three annotators who are
developers of the scheme.
In our experiment, the annotators independently an-
notated 26 conference articles with this scheme, on the
basis of guidelines which were frozen once annotation
started5. The data used for the experiment contained a
total of 120,000 running words and 548 citations.
The relative frequency of each category observed in
the annotation is listed in Fig. 5. As expected, the dis-
tribution is very skewed, with more than 60% of the
citations of category Neut.6 What is interesting is the
relatively high frequency of usage categories (PUse,
PModi, PBas) with a total of 18.9%. There is
a relatively low frequency of clearly negative cita-
tions (Weak, CoCoR-, total of 4.1%), whereas the
neutral contrastive categories (CoCoGM, CoCoR0,
CoCoXY) are slightly more frequent at 7.6%. This
is in concordance with earlier annotation experiments
(Moravcsik and Murugesan, 1975; Spiegel-Rcurrency1using,
1977).
We reached an inter-annotator agreement of K=.72
(n=12;N=548;k=3)7. This is comparable to aggreement
on other discourse annotation tasks such as dialogue
act parsing and Argumentative Zoning (Teufel et al.,
1999). We consider the agreement quite good, consid-
ering the number of categories and the dif culties (e.g.,
non-local dependencies) of the task.
The annotators are obviously still disagreeing on
some categories. We were wondering to what de-
gree the  ne granularity of the scheme is a prob-
lem. When we collapsed the obvious similar cat-
egories (all P categories into one category, and
all CoCo categories into another) to give four top
level categories (Weak, Positive, Contrast,
Neutral), this only raised kappa to 0.76. This
4As opposed to reference list items, which are fewer.
5The development of the scheme was done with 40+ dif-
ferent articles.
6Spiegel-Rcurrency1using found that out of 2309 citations she ex-
amined, 80% substantiated statements.
7Following Carletta (1996), we measure agreement in
Kappa, which follows the formula K = P(A)−P(E)1−P(E) where
P(A) is observed, and P(E) expected agreement. Kappa
ranges between -1 and 1. K=0 means agreement is only as
expected by chance. Generally, Kappas of 0.8 are considered
stable, and Kappas of .69 as marginally stable, according to
the strictest scheme applied in the  eld.
84
Neut PUse CoCoGM PSim Weak CoCoXY PMot PModi PBas PSup CoCo- CoCoR0
62.7% 15.8% 3.9% 3.8% 3.1% 2.9% 2.2% 1.6% 1.5% 1.1% 1.0% 0.8%
Figure 5: Distribution of the categories
Weak CoCo- CoCoGM CoCoR0 CoCoXY PUse PBas PModi PMot PSim PSup Neut
Weak 5 3
CoCo- 1 3
CoCoGM 23 3
CoCoR0 4
CoCoXY 1
PUse 86 6 2 1 12
PBas 3 2
PModi 3
PMot 13 4
PSim 3 20 5
PSup 1 2 1
Neut 6 10 6 4 17 1 6 4 287
Figure 6: Confusion matrix between two annotators
points to the fact that most of our annotators disagreed
about whether to assign a more informative category
or Neut, the neutral fall-back category. Unfortunately,
Kappa is only partially sensitive to such specialised dis-
agreements. While it will reward agreement with in-
frequent categories more than agreement with frequent
categories, it nevertheless does not allow us to weight
disagreements we care less about (Neut vs more in-
formative category) less than disagreements we do care
a lot about (informative categories which are mutually
exclusive, such as Weak and PSim).
Fig. 6 shows a confusion matrix between the two an-
notators who agreed most with each other. This again
points to the fact that a large proportion of the confu-
sion involves an informative category and Neut. The
issue with Neut and Weak is a point at hand: au-
thors seem to often (deliberately or not) mask their in-
tended citation function with seemingly neutral state-
ments. Many statements of weakness of other ap-
proaches were stated in such caged terms that our anno-
tators disagreed about whether the signals given were
 explicit enough.
While our focus is not sentiment analysis, it is pos-
sible to con ate our 12 categories into three: positive,
weakness and neutral by the following mapping:
Old Categories New Category
Weak, CoCo- Negative
PMot, PUse, PBas, PModi, PSim, PSup Positive
CoCoGM, CoCoR0, CoCoXY, Neut Neutral
Thus negative contrasts and weaknesses are grouped
into Negative, while neutral contrasts are grouped
into Neutral. All the positive classes are con ated
into Positive. This resulted in kappa=0.75 for three
annotators.
Fig. 7 shows the confusion matrix between two an-
notators for this sentiment classi cation. Fig. 7 is par-
ticularly instructive, because it shows that annotators
Weakness Positive Neutral
Weakness 9 1 12
Positive 140 13
Neutral 4 30 339
Figure 7: Confusion matrix between two annotators;
categories collapsed to re ect sentiment
have only one case of confusion between positive and
negative references to cited work. The vast majority of
disagreements re ects genuine ambiguity as to whether
the authors were trying to stay neutral or express a sen-
timent.
Distinction Kappa
PMot v. all others .790
CoCoGM v. all others .765
PUse v. all others .761
CoCoR0 v. all others .746
Neut v. all others .742
PSim v. all others .649
PModi v. all others .553
CoCoXY v. all others .553
Weak v. all others .522
CoCo- v. all others .462
PBas v. all others .414
PSup v. all others .268
Figure 8: Distinctiveness of categories
In an attempt to determine how well each cate-
gory was de ned, we created arti cial splits of the
data into binary distinctions: each category versus a
super-category consisting of all the other collapsed cat-
egories. The kappas measured on these datasets are
given in Fig. 8. The higher they are, the better the anno-
tators could distinguish the given category from all the
other categories. We can see that out of the informa-
85
tive categories, four are de ned at least as well as the
overall distinction (i.e. above the line in Fig. 8: PMot,
PUse, CoCoGM and CoCoR0. This is encouraging,
as the application of citation maps is almost entirely
centered around usage and contrast. However, the se-
mantics of some categories are less well-understood by
our annotators: in particularPSup(where the dif culty
lies in what an annotator understands as  mutual sup-
port of two theories), and (unfortunately) PBas. The
problem with PBas is that its distinction fromPUse is
based on subjective judgement of whether the authors
use a part of somebody’s previous work, or base them-
selves entirely on this previous work (i.e., see them-
selves as following in the same intellectual framework).
Another problem concerns the low distinctivity for the
clearly negative categories CoCo- and Weak. This is
in line with MacRoberts and MacRoberts’ hypothesis
that criticism is often hedged and not clearly lexically
signalled, which makes it more dif cult to reliably an-
notate such citations.
5 Conclusion
We have described a new task: human annotation of
citation function, a phenomenon which we believe to
be closely related to the overall discourse structure of
scienti c articles. Our annotation scheme concentrates
on contrast, weaknesses of other work, similarities be-
tween work and usage of other work. One of its prin-
ciples is the fact that relations are only to be marked if
they are explicitly signalled. Here, we report positive
results in terms of interannotator agreement.
Future work on the annotation scheme will concen-
trate on improving guidelines for currently suboptimal
categories, and on measuring intra-annotator agree-
ment and inter-annotator agreement with naive annota-
tors. We are also currently investigating how well our
scheme will work on text from a different discipline,
namely chemistry. Work on applying machine learning
techniques for automatic citation classi cation is cur-
rently underway (Teufel et al., 2006); the agreement
of one annotator and the system is currently K=.57,
leaving plenty of room for improvement in comparison
with the human annotation results presented here.
6 Acknowledgements
This work was funded by the EPSRC projects
CITRAZ (GR/S27832/01,  Rhetorical Citation Maps
and Domain-independent Argumentative Zoning ) and
SCIBORG (EP/C010035/1,  Extracting the Science
from Scienti c Publications ).

References
Susan Bonzi. 1982. Characteristics of a literature as predic-
tors of relatedness between cited and citing works. JASIS,
33(4):208 216.
Jean Carletta. 1996. Assessing agreement on classi cation
tasks: The kappa statistic. Computational Linguistics,
22(2):249 254.
Daryl E. Chubin and S. D. Moitra. 1975. Content analysis
of references: Adjunct or alternative to citation counting?
Social Studies of Science, 5(4):423 441.
Carolyn O. Frost. 1979. The use of citations in literary re-
search: A preliminary classi cation of citation functions.
Library Quarterly, 49:405.
C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998.
Citeseer: An automatic citation indexing system. In Pro-
ceedings of the Third ACM Conference on Digital Li-
braries, pages 89 98.
Jorge E. Hirsch. 2005. An index to quantify an individ-
ual’s scienti c research output. Proceedings of the Na-
tional Academy of Sciences of the United Stated of Amer-
ica (PNAS), 102(46).
Terttu Luukkonen. 1992. Is scientists’ publishing behaviour
reward-seeking? Scientometrics, 24:297 319.
Michael H. MacRoberts and Barbara R. MacRoberts. 1984.
The negational reference: Or the art of dissembling. So-
cial Studies of Science, 14:91 94.
Michael J. Moravcsik and Poovanalingan Murugesan. 1975.
Some results on the function and quality of citations. So-
cial Studies of Science, 5:88 91.
Greg Myers. 1992. In this paper we report... speech acts
and scienti c facts. Journal of Pragmatics, 17(4):295 
313.
John O’Connor. 1982. Citing statements: Computer recogni-
tion and use to improve retrieval. Information Processing
and Management, 18(3):125 131.
Charles Oppenheim and Susan P. Renn. 1978. Highly cited
old papers and the reasons why they continue to be cited.
JASIS, 29:226 230.
Anna Ritchie, Simone Teufel, and Steven Robertson. 2006.
Creating a test collection for citation-based IR experi-
ments. In Proceedings of HLT-06.
Simon Buckingham Shum. 1998. Evolving the web for sci-
enti c knowledge: First steps towards an  HCI knowledge
web . Interfaces, British HCI Group Magazine, 39:16 21.
Ina Spiegel-Rcurrency1using. 1977. Bibliometric and content analy-
sis. Social Studies of Science, 7:97 113.
John Swales. 1986. Citation analysis and discourse analysis.
Applied Linguistics, 7(1):39 56.
John Swales, 1990. Genre Analysis: English in Academic
and Research Settings. Chapter 7: Research articles in En-
glish, pages 110 176. Cambridge University Press, Cam-
bridge, UK.
Simone Teufel, Jean Carletta, and Marc Moens. 1999. An
annotation scheme for discourse-level argumentation in re-
search articles. In Proceedings of the Ninth Meeting of the
European Chapter of the Association for Computational
Linguistics (EACL-99), pages 110 117.
Simone Teufel, Advaith Siddharthan, and Dan Tidhar. 2006.
Automatic classi cation of citation function. In Proceed-
ings of EMNLP-06.
Simone Teufel. 1999. Argumentative Zoning: Information
Extraction from Scienti c Text. Ph.D. thesis, School of
Cognitive Science, University of Edinburgh, UK.
Melvin Weinstock. 1971. Citation indexes. In Encyclopedia
of Library and Information Science, volume 5, pages 16 
40. Dekker, New York, NY.
Howard D. White. 2004. Citation analysis and discourse
analysis revisited. Applied Linguistics, 25(1):89 116.
John M. Ziman. 1968. Public Knowledge: An Essay Con-
cerning the Social Dimensions of Science. Cambridge
University Press, Cambridge, UK.
