Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 103–110,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Automatic classi cation of citation function
Simone Teufel Advaith Siddharthan Dan Tidhar
Natural Language and Information Processing Group
Computer Laboratory
Cambridge University, CB3 0FD, UK
{Simone.Teufel,Advaith.Siddharthan,Dan.Tidhar}@cl.cam.ac.uk
Abstract
Citation function is de ned as the author’s
reason for citing a given paper (e.g. ac-
knowledgement of the use of the cited
method). The automatic recognition of the
rhetorical function of citations in scienti c
text has many applications, from improve-
ment of impact factor calculations to text
summarisation and more informative ci-
tation indexers. We show that our anno-
tation scheme for citation function is re-
liable, and present a supervised machine
learning framework to automatically clas-
sify citation function, using both shallow
and linguistically-inspired features. We
 nd, amongst other things, a strong re-
lationship between citation function and
sentiment classi cation.
1 Introduction
Why do researchers cite a particular paper? This
is a question that has interested researchers in
discourse analysis, sociology of science, and in-
formation sciences (library sciences) for decades
(Gar eld, 1979; Small, 1982; White, 2004). Many
annotation schemes for citation motivation have
been created over the years, and the question has
been studied in detail, even to the level of in-depth
interviews with writers about each individual cita-
tion (Hodges, 1972).
Part of this sustained interest in citations can
be explained by the fact that bibliometric met-
rics are commonly used to measure the impact of
a researcher’s work by how often they are cited
(Borgman, 1990; Luukkonen, 1992). However, re-
searchers from the  eld of discourse studies have
long criticised purely quantitative citation analy-
sis, pointing out that many citations are done out
of  politeness, policy or piety (Ziman, 1968),
and that criticising citations or citations in pass-
ing should not  count as much as central cita-
tions in a paper, or as those citations where a re-
searcher’s work is used as the starting point of
somebody else’s work (Bonzi, 1982). A plethora
of manual annotation schemes for citation motiva-
tion have been invented over the years (Gar eld,
1979; Hodges, 1972; Chubin and Moitra, 1975).
Other schemes concentrate on citation function
(Spiegel-Rcurrency1using, 1977; O’Connor, 1982; Wein-
stock, 1971; Swales, 1990; Small, 1982)). One
of the best-known of these studies (Moravcsik
and Murugesan, 1975) divides citations in running
text into four dimensions: conceptual or opera-
tional use (i.e., use of theory vs. use of technical
method); evolutionary or juxtapositional (i.e., own
work is based on the cited work vs. own work is an
alternative to it); organic or perfunctory (i.e., work
is crucially needed for understanding of citing ar-
ticle or just a general acknowledgement); and  -
nally con rmative vs. negational (i.e., is the cor-
rectness of the  ndings disputed?). They found,
for example, that 40% of the citations were per-
functory, which casts further doubt on the citation-
counting approach.
Based on such annotation schemes and hand-
analyzed data, different in uences on citation be-
haviour can be determined. Nevertheless, re-
searchers in the  eld of citation content analysis
do not normally cross-validate their schemes with
independent annotation studies with other human
annotators, and usually only annotate a small num-
ber of citations (in the range of hundreds or thou-
sands). Also, automated application of the annota-
tion is not something that is generally considered
in the  eld, though White (2004) sees the future of
discourse-analytic citation analysis in automation.
Apart from raw material for bibliometric stud-
ies, citations can also be used for search purposes
in document retrieval applications. In the library
world, printed or electronic citation indexes such
as ISI (Gar eld, 1979) serve as an orthogonal
103
Nitta and Niwa 94
Resnik 95
Brown et al. 90a
Rose et al. 90
Church and Gale 91
Dagan et al. 94
Li and Abe 96
Hindle 93
Hindle 90
Dagan et al 93
Pereira et al. 93
Following Pereira et al, we measure
word similarity by the relative entropy
or Kulbach-Leibler (KL) distance, bet-
ween the corresponding conditional
distributions.
His notion of similarity
seems to agree with our
intuitions in many cases,
but it is not clear how it
can  be used directly to
construct word classes
and corresponding 
models of association.
Figure 1: A rhetorical citation map
search tool to  nd relevant papers, starting from a
source paper of interest. With the increased avail-
ability of documents in electronic form in recent
years, citation-based search and automatic citation
indexing have become highly popular, cf. the suc-
cessful search tools Google Scholar and CiteSeer
(Giles et al., 1998).1
But not all search needs are ful lled by current
citation indexers. Experienced researchers are of-
ten interested in relations between articles (Shum,
1998). They want to know if a certain article crit-
icises another and what the criticism is, or if the
current work is based on that prior work. This
type of information is hard to come by with current
search technology. Neither the author’s abstract,
nor raw citation counts help users in assessing the
relation between articles.
Fig. 1 shows a hypothetical search tool which
displays differences and similarities between a tar-
get paper (here: Pereira et al., 1993) and the pa-
pers that it cites and that cite it. Contrastive links
are shown in grey  links to rival papers and pa-
pers the current paper contrasts itself to. Continu-
ative links are shown in black  links to papers that
use the methodology of the current paper. Fig. 1
also displays the most characteristic textual sen-
tence about each citation. For instance, we can see
which aspect of Hindle (1990) our example paper
criticises, and in which way the example paper’s
work was used by Dagan et al. (1994).
Note that not even the CiteSeer text snippet
1These tools automatically citation-index all scienti c ar-
ticles reached by a web-crawler, making them available to
searchers via authors or keywords in the title, and displaying
the citation in context of a text snippet.
can ful l the relation search need: it is always
centered around the physical location of the ci-
tations, but the context is often not informative
enough for the searcher to infer the relation. In
fact, studies from our annotated corpus (Teufel,
1999) show that 69% of the 600 sentences stat-
ing contrast with other work and 21% of the
246 sentences stating research continuation with
other work do not contain the corresponding cita-
tion; the citation is found in preceding sentences
(which means that the sentence expressing the
contrast or continuation is outside the CiteSeer
snippet). A more sophisticated, discourse-aware
citation indexer which  nds these sentences and
associates them with the citation would add con-
siderable value to the researcher’s bibliographic
search (Ritchie et al., 2006b).
Our annotation scheme for citations is based
on empirical work in content citation analysis. It
is designed for information retrieval applications
such as improved citation indexing and better bib-
liometric measures (Teufel et al., 2006). Its 12 cat-
egories mark relationships with other works. Each
citation is labelled with exactly one category. The
following top-level four-way distinction applies:
 Explicit statement of weakness
 Contrast or comparison with other work (4
categories)
 Agreement/usage/compatibility with other
work (6 categories), and
 A neutral category.
In this paper, we show that the scheme can be
reliably annotated by independent coders. We also
report results of a supervised machine learning ex-
periment which replicates the human annotation.
2 An annotation scheme for citations
Our scheme (given in Fig. 2) is adapted from that
of Spiegel-Rcurrency1using (1977) after an analysis of a
corpus of scienti c articles in computational lin-
guistics. We avoid sociologically orientated dis-
tinctions ( paying homage to pioneers ), as they
can be dif cult to operationalise without deep
knowledge of the  eld and its participants (Swales,
1986). Our rede nition of the categories aims at
reliably annotation; at the same time, the cate-
gories should be informative enough for the docu-
ment management application sketched in the in-
troduction.
104
Category Description
Weak Weakness of cited approach
CoCoGM Contrast/Comparison in Goals or Meth-
ods(neutral)
CoCo- Author’s work is stated to be superior to
cited work
CoCoR0 Contrast/Comparison in Results (neutral)
CoCoXY Contrast between 2 cited methods
PBas Author uses cited work as basis or starting
point
PUse Author uses
tools/algorithms/data/de nitions
PModi Author adapts or modi es
tools/algorithms/data
PMot This citation is positive about approach
used or problem addressed (used to mo-
tivate work in current paper)
PSim Author’s work and cited work are similar
PSup Author’s work and cited work are compat-
ible/provide support for each other
Neut Neutral description of cited work, or not
enough textual evidence for above cate-
gories, or unlisted citation function
Figure 2: Annotation scheme for citation function.
Our categories are as follows: One category
(Weak) is reserved for weakness of previous re-
search, if it is addressed by the authors. The next
four categories describe comparisons or contrasts
between own and other work. The difference be-
tween them concerns whether the contrast is be-
tween methods employed or goals (CoCoGM), or
results, and in the case of results, a difference is
made between the cited results being worse than
the current work (CoCo-), or comparable or bet-
ter results (CoCoR0). As well as considering dif-
ferences between the current work and other work,
we also mark citations if they are explicitly com-
pared and contrasted with other work (i.e. not
the work in the current paper). This is expressed
in category CoCoXY. While this is not typically
annotated in the literature, we expect a potential
practical bene t of this category for our applica-
tion, particularly in searches for differences and
rival approaches.
The next set of categories we propose concerns
positive sentiment expressed towards a citation, or
a statement that the other work is actively used
in the current work (which we consider the ulti-
mate praise). We mark statements of use of data
and methods of the cited work, differentiating un-
changed use (PUse) from use with adaptations
(PModi). Work which is stated as the explicit
starting point or intellectual ancestry is marked
with our category PBas. If a claim in the liter-
ature is used to strengthen the authors’ argument,
or vice versa, we assign the category PSup. We
also mark similarity of (an aspect of) the approach
to the cited work (PSim), and motivation of ap-
proach used or problem addressed (PMot).
Our twelfth category, Neut, bundles truly neu-
tral descriptions of cited work with those cases
where the textual evidence for a citation function
was not enough to warrant annotation of that cate-
gory, and all other functions for which our scheme
did not provide a speci c category.
Citation function is hard to annotate because it
in principle requires interpretation of author inten-
tions (what could the author’s intention have been
in choosing a certain citation?). One of our most
fundamental principles is thus to only mark explic-
itly signalled citation functions. Our guidelines
explicitly state that a general linguistic phrase such
as  better or  used by us must be present; this
increases the objectivity of de ning citation func-
tion. Annotators must be able to point to textual
evidence for assigning a particular function (and
are asked to type the source of this evidence into
the annotation tool for each citation). Categories
are de ned in terms of certain objective types of
statements (e.g., there are 7 cases for PMot, e.g.
 Citation claims that or gives reasons for why
problem Y is hard ). Annotators can use general
text interpretation principles when assigning the
categories (such as anaphora resolution and par-
allel constructions), but are not allowed to use in-
depth knowledge of the  eld or of the authors.
Guidelines (25 pages,  150 rules) describe the
categories with examples, provide a decision tree
and give decision aids in systematically ambigu-
ous cases. Nevertheless, subjective judgement of
the annotators is still necessary to assign a single
tag in an unseen context, because of the many dif-
 cult cases for annotation. Some of these concern
the fact that authors do not always state their pur-
pose clearly. For instance, several earlier studies
found that negational citations are rare (Moravc-
sik and Murugesan, 1975; Spiegel-Rcurrency1using, 1977);
MacRoberts and MacRoberts (1984) argue that the
reason for this is that they are potentially politi-
cally dangerous. In our data we found ample evi-
dence of the  meekness effect. Other dif culties
concern the distinction of the usage of a method
from statements of similarity between a method
and the own method (i.e., the choice between cat-
egories PSim and PUse). This happens in cases
where authors do not want to admit (or stress)
105
that they are using somebody else’s method. An-
other dif cult distinction concerns the judgement
of whether the authors continue somebody’s re-
search (i.e., consider their research as intellectual
ancestry, i.e. PBas), or whether they simply use
the work (PUse).
The unit of annotation is a) the full citation (as
recognised by our automatic citation processor on
our corpus), and b) names of authors of cited pa-
pers anywhere in running text outside of a for-
mal citation context (i.e., without date). These
latter are marked up, slightly unusually in com-
parison to other citation indexers, because we be-
lieve they function as important referents compa-
rable in importance to formal citations.2 In prin-
ciple, there are many other linguistic expressions
by which the authors could refer to other people’s
work: pronouns, abbreviations such as  Mueller
and Sag (1990), henceforth M & S , and names of
approaches or theories which are associated with
particular authors. The fact that in these contexts
citation function cannot be annotated (because it
is not technically feasible to recognise them well
enough) sometimes causes problems with context
dependencies.
While there are unambiguous example cases
where the citation function can be decided on the
basis of the sentence alone, this is not always the
case. Most approaches are not criticised in the
same sentence where they are also cited: it is more
likely that there are several descriptive sentences
about a cited approach between its formal cita-
tion and the evaluative statement, which is often at
the end of the textual segment about this citation.
Nevertheless, the annotator must mark the func-
tion on the nearest appropriate annotation unit (ci-
tation or author name). Our rules decree that con-
text is in most cases constrained to the paragraph
boundary. In rare cases, paper-wide information
is required (e.g., for PMot, we need to know that
a praised approach is used by the authors, infor-
mation which may not be local in the paragraph).
Annotators are thus asked to skim-read the paper
before annotation.
One possible view on this annotation scheme
could consider the  rst two sets of categories as
 negative and the third set of categories  posi-
tive , in the sense of Pang et al. (2002) and Turney
(2002). Authors need to make a point (namely,
2Our citation processor can recognise these after parsing
the citation list.
that they have contributed something which is bet-
ter or at least new (Myers, 1992)), and they thus
have a stance towards their citations. But although
there is a sentiment aspect to the interpretation of
citations, this is not the whole story. Many of our
 positive categories are more concerned with dif-
ferent ways in which the cited work is useful to the
current work (which aspect of it is used, e.g., just a
de nition or the entire solution?), and many of the
contrastive statements have no negative connota-
tion at all and simply state a (value-free) differ-
ence between approaches. However, if one looks
at the distribution of positive and negative adjec-
tives around citations, it is clear that there is a non-
trivial connection between our task and sentiment
classi cation.
The data we use comes from our corpus of
360 conference articles in computational linguis-
tics, drawn from the Computation and Language
E-Print Archive (http://xxx.lanl.gov/cmp-lg). The
articles are transformed into XML format; head-
lines, titles, authors and reference list items are au-
tomatically marked up. Reference lists are parsed
using regular patterns, and cited authors’ names
are identi ed. Our citation parser then  nds cita-
tions and author names in running text and marks
them up. Ritchie et al. (2006a) report high ac-
curacy for this task (94% of citations recognised,
provided the reference list was error-free). On av-
erage, our papers contain 26.8 citation instances in
running text3. For human annotation, we use our
own annotation tool based on XML/XSLT tech-
nology, which allows us to use a web browser to
interactively assign one of the 12 tags (presented
as a pull-down list) to each citation.
We measure inter-annotator agreement between
three annotators (the three authors), who indepen-
dently annotated 26 articles with the scheme (con-
taining a total of 120,000 running words and 548
citations), using the written guidelines. The guide-
lines were developed on a different set of articles
from the ones used for annotation.
Inter-annotator agreement was Kappa=.72
(n=12;N=548;k=3)4. This is quite high, consider-
ing the number of categories and the dif culties
3As opposed to reference list items, which are fewer.
4Following Carletta (1996), we measure agreement in
Kappa, which follows the formula K = P(A)−P(E)1−P(E) where
P(A) is observed, and P(E) expected agreement. Kappa
ranges between -1 and 1. K=0 means agreement is only as
expected by chance. Generally, Kappas of 0.8 are considered
stable, and Kappas of .69 as marginally stable, according to
the strictest scheme applied in the  eld.
106
(e.g., non-local dependencies) of the task. The
relative frequency of each category observed in
the annotation is listed in Fig. 3. As expected,
the distribution is very skewed, with more than
60% of the citations of category Neut.5 What
is interesting is the relatively high frequency
of usage categories (PUse, PModi, PBas)
with a total of 18.9%. There is a relatively low
frequency of clearly negative citations (Weak,
CoCo-, total of 4.1%), whereas the neutral 
contrastive categories (CoCoR0, CoCoXY,
CoCoGM) are slightly more frequent at 7.6%.
This is in concordance with earlier annotation
experiments (Moravcsik and Murugesan, 1975;
Spiegel-Rcurrency1using, 1977).
3 Features for automatic recognition of
citation function
This section summarises the features we use for
machine learning citation function. Some of these
features were previously found useful for a dif-
ferent application, namely Argumentative Zoning
(Teufel, 1999; Teufel and Moens, 2002), some are
speci c to citation classi cation.
3.1 Cue phrases
Myers (1992) calls meta-discourse the set of ex-
pressions that talk about the act of presenting re-
search in a paper, rather than the research itself
(which is called object-level discourse). For in-
stance, Swales (1990) names phrases such as  to
our knowledge, no. . .  or  As far as we aware as
meta-discourse associated with a gap in the cur-
rent literature. Strings such as these have been
used in extractive summarisation successfully ever
since Paice’s (1981) work.
We model meta-discourse (cue phrases) and
treat it differently from object-level discourse.
There are two different mechanisms: A  nite
grammar over strings with a placeholder mecha-
nism for POS and for sets of similar words which
can be substituted into a string-based cue phrase
(Teufel, 1999). The grammar corresponds to 1762
cue phrases. It was developed on 80 papers which
are different to the papers used for our experiments
here.
The other mechanism is a POS-based recog-
niser of agents and a recogniser for speci c actions
these agents perform. Two main agent types (the
5Spiegel-Rcurrency1using found that out of 2309 citations she ex-
amined, 80% substantiated statements.
authors of the paper, and everybody else) are mod-
elled by 185 patterns. For instance, in a paragraph
describing related work, we expect to  nd refer-
ences to other people in subject position more of-
ten than in the section detailing the authors’ own
methods, whereas in the background section, we
often  nd general subjects such as  researchers in
computational linguistics or  in the literature .
For each sentence to be classi ed, its grammatical
subject is determined by POS patterns and, if pos-
sible, classi ed as one of these agent types. We
also use the observation that in sentences without
meta-discourse, one can assume that agenthood
has not changed.
20 different action types model the main verbs
involved in meta-discourse. For instance, there is
a set of verbs that is often used when the over-
all scienti c goal of a paper is de ned. These
are the verbs of presentation, such as  propose,
present, report and  suggest ; in the corpus we
found other verbs in this function, but with a lower
frequency, namely  describe, discuss, give, intro-
duce, put forward, show, sketch, state and  talk
about . There are also specialised verb clusters
which co-occur with PBas sentences, e.g., the
cluster of continuation of ideas (eg.  adopt, agree
with, base, be based on, be derived from, be orig-
inated in, be inspired by, borrow, build on,. . .  ).
On the other hand, the semantics of verbs in Weak
sentences is often concerned with failing (of other
researchers’ approaches), and often contain verbs
such as  abound, aggravate, arise, be cursed, be
incapable of, be forced to, be limited to, . . .  .
We use 20 manually acquired verb clusters.
Negation is recognised, but too rare to de ne its
own clusters: out of the 20  2 = 40 theoretically
possible verb clusters, only 27 were observed in
our development corpus. We have recently auto-
mated the process of verb object pair acquisition
from corpora for two types of cue phrases (Abdalla
and Teufel, 2006) and are planning on expanding
this work to other cue phrases.
3.2 Cues Identi ed by annotators
During the annotator training phase, the anno-
tators were encouraged to type in the meta-
description cue phrases that justify their choice of
category. We went through this list by hand and
extracted 892 cue phrases (around 75 per cate-
gory). The  les these cues came from were not
part of the test corpus. We included 12 features
107
Neut PUse CoCoGM PSim Weak PMot CoCoR0 PBas CoCoXY CoCo- PModi PSup
62.7% 15.8% 3.9% 3.8% 3.1% 2.2% 0.8% 1.5% 2.9% 1.0% 1.6% 1.1%
Figure 3: Distribution of citation categories
Weak CoCoGM CoCoR0 CoCo- CoCoXY PBas PUse PModi PMot PSim PSup Neut
P .78 .81 .77 .56 .72 .76 .66 .60 .75 .68 .83 .80
R .49 .52 .46 .19 .54 .46 .61 .27 .64 .38 .32 .92
F .60 .64 .57 .28 .62 .58 .63 .37 .69 .48 .47 .86
Percentage Accuracy 0.77
Kappa (n=12; N=2829; k=2) 0.57
Macro-F 0.57
Figure 4: Summary of Citation Analysis results (10-fold cross-validation; IBk algorithm; k=3).
that recorded the presence of cues that our annota-
tors associated with a particular class.
3.3 Other features
There are other features which we use for this
task. We know from Teufel and Moens (2002) that
verb tense and voice should be useful for recogniz-
ing statements of previous work, future work and
work performed in the paper. We also recognise
modality (whether or not a main verb is modi ed
by an auxiliary, and which auxiliary it is).
The overall location of a sentence containing
a reference should be relevant. We observe that
more PMot categories appear towards the begin-
ning of the paper, as do Weak citations, whereas
comparative results (CoCoR0, CoCoR-) appear
towards the end of articles. More  ne-grained lo-
cation features, such as the location within the
paragraph and the section, have also been imple-
mented.
The fact that a citation points to own previous
work can be recognised, as we know who the pa-
per authors are. As we have access to the infor-
mation in the reference list, we also know the last
names of all cited authors (even in the case where
an et al. statement in running text obscures the
later-occurring authors). With self-citations, one
might assume that the probability of re-use of ma-
terial from previous own work should be higher,
and the tendency to criticise lower.
4 Results
Our evaluation corpus for citation analysis con-
sists of 116 articles (randomly drawn from the part
of our corpus which was not used for guideline
development or cue phrase acquisition). The 116
articles contain 2829 citation instances. Each
citation instance was manually tagged as one
Weakness Positive Contrast Neutral
P .80 .75 .77 .81
R .49 .65 .52 .90
F .61 .70 .62 .86
Percentage Accuracy 0.79
Kappa (n=12; N=2829; k=2) 0.59
Macro-F 0.68
Figure 5: Summary of results (10-fold cross-
validation; IBk algorithm; k=3): Top level classes.
Weakness Positive Neutral
P .77 .75 .85
R .42 .65 .92
F .54 .70 .89
Percentage Accuracy 0.83
Kappa (n=12; N=2829; k=2) 0.58
Macro-F 0.71
Figure 6: Summary of results (10-fold cross-
validation; IBk algorithm; k=3): Sentiment Anal-
ysis.
of fWeak, CoCoGM, CoCo-, CoCoR0, CoCoXY,
PBas, PUse,PModi,PMot,PSim,PSup,Neutg.
The papers are then further processed (e.g. to-
kenised and POS-tagged). All other features are
automatically determined (e.g. self-citations are
detected by overlap of citing and cited authors);
then, machine learning is applied to the feature
vectors.
The 10-fold cross-validation results for citation
classi cation are given in Figure 4, comparing the
system to one of the annotators. Results are given
in three overall measures: Kappa, percentage ac-
curacy, and Macro-F (following Lewis (1991)).
Macro-F is the mean of the F-measures of all
twelve categories. We use Macro-F and Kappa be-
cause we want to measure success particularly on
the rare categories, and because Micro-averaging
techniques like percentage accuracy tend to over-
estimate the contribution of frequent categories in
108
heavily skewed distributions like ours6.
In the case of Macro-F, each category is treated
as one unit, independent of the number of items
contained in it. Therefore, the classi cation suc-
cess of the individual items in rare categories
is given more importance than classi cation suc-
cess of frequent category items. However, one
should keep in mind that numerical values in
macro-averaging are generally lower (Yang and
Liu, 1999), due to fewer training cases for the rare
categories. Kappa has the additional advantage
over Macro-F that it  lters out random agreement
(random use, but following the observed distribu-
tion of categories).
For our task, memory-based learning outper-
formed other models. The reported results use the
IBk algorithm with k = 3 (we used the Weka ma-
chine learning toolkit (Witten and Frank, 2005)
for our experiments). Fig. 7 provides a few ex-
amples from one  le in the corpus, along with the
gold standard citation class, the machine predic-
tion, and a comment.
Kappa is even higher for the top level distinc-
tion. We collapsed the obvious similar categories
(all P categories into one category, and all CoCo
categories into another) to give four top level
categories (Weak, Positive, Contrast,
Neutral; results in Fig. 5). Precision for all the
categories is above 0.75, and K=0.59. For con-
trast, the human agreement for this situation was
K=0.76 (n=3,N=548,k=3).
In a different experiment, we grouped the cate-
gories as follows, in an attempt to perform senti-
ment analysis over the classi cations:
Old Categories New Category
Weak, CoCo- Negative
PMot, PUse, PBas, PModi, PSim, PSup Positive
CoCoGM, CoCoR0, CoCoXY, Neut Neutral
Thus negative contrasts and weaknesses are
grouped into Negative, while neutral contrasts
are grouped into Neutral. All positive classes
are con ated into Positive.
Results show that this grouping raises results
to a smaller degree than the top-level distinction
did (to K=.58). For contrast, the human agree-
ment for these collapsed categories was K=.75
(n=3,N=548,k=3).
6This situation has parallels in information retrieval,
where precision and recall are used because accuracy over-
estimates the performance on irrelevant items.
5 Conclusion
We have presented a new task: annotation of ci-
tation function in scienti c text, a phenomenon
which we believe to be closely related to the over-
all discourse structure of scienti c articles. Our
annotation scheme concentrates on weaknesses of
other work, and on similarities and contrast be-
tween work and usage of other work. In this
paper, we present machine learning experiments
for replicating the human annotation (which is re-
liable at K=.72). The automatic result reached
K=.57 (acc=.77) for the full annotation scheme;
rising to Kappa=.58 (acc=.83) for a three-way
classi cation (Weak, Positive, Neutral).
We are currently performing an experiment to
see if citation processing can increase perfor-
mance in a large-scale, real-world information
retrieval task, by creating a test collection of
researchers’ queries and relevant documents for
these (Ritchie et al., 2006a).
6 Acknowledgements
This work was funded by the EPSRC projects CIT-
RAZ (GR/S27832/01,  Rhetorical Citation Maps
and Domain-independent Argumentative Zon-
ing ) and SCIBORG (EP/C010035/1,  Extracting
the Science from Scienti c Publications ).

References
Rashid M. Abdalla and Simone Teufel. 2006. A bootstrap-
ping approach to unsupervised detection of cue phrase
variants. In Proc. of ACL/COLING-06.
Susan Bonzi. 1982. Characteristics of a literature as predic-
tors of relatedness between cited and citing works. JASIS,
33(4):208 216.
Christine L. Borgman, editor. 1990. Scholarly Communica-
tion and Bibliometrics. Sage Publications, CA.
Jean Carletta. 1996. Assessing agreement on classi cation
tasks: The kappa statistic. Computational Linguistics,
22(2):249 254.
Daryl E. Chubin and S. D. Moitra. 1975. Content analysis
of references: Adjunct or alternative to citation counting?
Social Studies of Science, 5(4):423 441.
Eugene Gar eld. 1979. Citation Indexing: Its Theory and
Application in Science, Technology and Humanities. J.
Wiley, New York, NY.
C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. 1998.
Citeseer: An automatic citation indexing system. In Proc.
of the Third ACM Conference on Digital Libraries, pages
89 98.
T.L. Hodges. 1972. Citation Indexing: Its Potential for Bibli-
ographical Control. Ph.D. thesis, University of California
at Berkeley.
David D. Lewis. 1991. Evaluating text categorisation. In
Speech and Natural Language: Proceedings of the ARPA
Workshop of Human Language Technology.
Terttu Luukkonen. 1992. Is scientists’ publishing behaviour
reward-seeking? Scientometrics, 24:297 319.
Michael H. MacRoberts and Barbara R. MacRoberts. 1984.
The negational reference: Or the art of dissembling. So-
cial Studies of Science, 14:91 94.
Michael J. Moravcsik and Poovanalingan Murugesan. 1975.
Some results on the function and quality of citations. So-
cial Studies of Science, 5:88 91.
Greg Myers. 1992. In this paper we report... speech acts
and scienti c facts. Journal of Pragmatics, 17(4).
John O’Connor. 1982. Citing statements: Computer recogni-
tion and use to improve retrieval. Information Processing
and Management, 18(3):125 131.
Chris D. Paice. 1981. The automatic generation of literary
abstracts: an approach based on the identi cation of self-
indicating phrases. In R. Oddy, S. Robertson, C. van Rijs-
bergen, and P. W. Williams, editors, Information Retrieval
Research. Butterworth, London, UK.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002.
Thumbs up? Sentiment classi cation using machine
learning techniques. In Proc. of EMNLP-02.
Anna Ritchie, Simone Teufel, and Stephen Robertson.
2006a. Creating a test collection for citation-based IR ex-
periments. In Proc. of HLT/NAACL 2006, New York, US.
Anna Ritchie, Simone Teufel, and Stephen Robertson.
2006b. How to  nd better index terms through citations.
In Proc. of ACL/COLING workshop  Can Computational
Linguistics improve IR .
Simon Buckingham Shum. 1998. Evolving the web for sci-
enti c knowledge: First steps towards an  HCI knowledge
web . Interfaces, British HCI Group Magazine, 39.
Henry Small. 1982. Citation context analysis. In P. Dervin
and M. J. Voigt, editors, Progress in Communication Sci-
ences 3, pages 287 310. Ablex, Norwood, N.J.
Ina Spiegel-Rcurrency1using. 1977. Bibliometric and content analy-
sis. Social Studies of Science, 7:97 113.
John Swales. 1986. Citation analysis and discourse analysis.
Applied Linguistics, 7(1):39 56.
John Swales, 1990. Genre Analysis: English in Academic
and Research Settings. Chapter 7: Research articles in
English, pages 110 176. Cambridge University Press,
Cambridge, UK.
Simone Teufel and Marc Moens. 2002. Summarising scien-
ti c articles  experiments with relevance and rhetorical
status. Computational Linguistics, 28(4):409 446.
Simone Teufel, Advaith Siddharthan, and Dan Tidhar. 2006.
An annotation scheme for citation function. In Proc. of
SIGDial-06.
Simone Teufel. 1999. Argumentative Zoning: Information
Extraction from Scienti c Text. Ph.D. thesis, School of
Cognitive Science, University of Edinburgh, UK.
Peter D. Turney. 2002. Thumbs up or thumbs down? se-
mantic orientation applied to unsupervised classi cation
of reviews. In Proc. of ACL-02.
Melvin Weinstock. 1971. Citation indexes. In Encyclope-
dia of Library and Information Science, volume 5. Dekker,
New York, NY.
Howard D. White. 2004. Citation analysis and discourse
analysis revisited. Applied Linguistics, 25(1):89 116.
Ian H. Witten and Eibe Frank. 2005. Data Mining: Practi-
cal machine learning tools and techniques. Morgan Kauf-
mann, San Francisco.
Yiming Yang and Xin Liu. 1999. A re-examination of text
categorization methods. In Proc. of SIGIR-99.
John M. Ziman. 1968. Public Knowledge: An Essay Con-
cerning the Social Dimensions of Science. Cambridge
University Press, Cambridge, UK.
