Discourse-level argumentation in scientific articles: 
human and automatic annotation 
Simone Teufel and Marc Moens 
HCRC Language Technology Group 
Division of Informatics 
University of Edinburgh 
S. Teufel@ed. ac. uk, M. Moens@ed. ac. uk 
Abstract 
In this paper we present a rhetorically de- 
fined annotation scheme which is part of 
our corpus-based method for the summari- 
sation of scientific articles. The annotation 
scheme consists of seven non-hierarchical 
labels which model prototypical academic 
argumentation and expected intentional 
'moves'. In a large-scale experiments with 
three expert coders, we found the scheme 
stable and reproducible. We have built a 
resource consisting of 80 papers annotated 
by the scheme, and we show that this kind 
of resource can be used to train a system 
to automate the annotation work. 
1 Introduction 
Work on summarisation has suffered from a lack of 
appropriately annotated corpora that can be used 
for building, training and evaluating summarisation 
systems. Typically, corpus work in this area has 
taken as its starting point texts target summaries: 
abstracts written by the researchers, supplied by 
the original authors or provided by professional ab- 
stractors. Training a summarisation system then in- 
volves learning the properties of sentences in those 
abstracts and using this knowledge to extract simi- 
lax abstract-worthy sentences from unseen texts. In 
this scenario, system performance or development 
progress can be evaluated by taking texts in a test 
sample and comparing the sentences extracted from 
these texts with the sentences in the target abstract. 
But this approach has a number of shortcomings. 
First, sentence extraction on its own is a very gen- 
eral methodology, which can produce extracts that 
are incoherent or under-informative especially when 
used for high-compression summarisation (i.e. reduc- 
ing a document to a small percentage of its orig- 
inal size). It is difficult to overcome this prob- 
lem, because once sentences have been extracted 
from the source text, the context that is needed 
for their interpretation is not available anymore and 
cannot be used to produce more coherent abstracts 
(Spgrck Jones, 1998). 
Our proposed solution to this problem is to ex- 
tract sentences but also to classify them into one of 
a small number of possible argumentative roles, re- 
flecting whether the sentence expresses a main goal 
of the source text, a shortcoming in someone else's 
work, etc. The summarisation system can then use 
this information to generate template-like abstracts: 
Main goal of the text:... ; Builds on work by:... ; 
Contrasts with:... ; etc. 
Second, the question of what constitutes a use- 
ful gold standard has not yet been solved satisfac- 
torily. Researchers developing corpus resources for 
summarisation work have often defined their own 
gold standard, relying on their own intuitions (see, 
e.g. Luhn, 1958; Edmundson, 1969) or have used 
abstracts supplied by authors or by professional ab- 
stractors as their gold standard (e.g. Kupiec et al., 
1995; Mani and Bloedorn, 1998). Neither approach 
is very satisfactory. Relying only on your own intu- 
itions inevitably creates a biased resource; indeed, 
Rath et al. (1961) report low agreement between 
human judges carrying out this kind of task. On 
the other hand, using abstracts as targets is not 
necessarily a good gold standard for comparison of 
the systems' results, although abstracts are the only 
kind of gold standard that comes for free with the 
papers. Even if the abstracts are written by pro- 
fessional abstractors, there are considerable differ- 
ences in length, structure, and information content. 
This is due to differences in the common abstract 
presentation style in different disciplines and to the 
projected use of the abstracts (cf. Liddy, 1991). In 
the case of our corpus, an additional problem was 
the fact that the abstracts are written by the au- 
thors themselves and thus susceptible to differences 
84 
in individual writing style. 
For the task of summarisation and relevance deci- 
sion between similar papers, however, it is essential 
that the information contained in the gold standard 
is comparable between papers. In our approach, the 
vehicle for comparability of information is similarity 
in argumentative roles of the associated sentences. 
We argue that it is more difficult to find the kind of 
information that preserves similarity of argumenta- 
tive roles, and that it is not guaranteed that it will 
occur in the abstract. : .... 
A related problem concerns fair evaluation Of 
the extraction methodology. The evaluation of ex- 
tracted material necessarily consists of a comparison 
of sentences, whereas one would really want to com- 
pare the informational content of the extracted sen- 
tences and the target abstract. Thus it will often be 
the case that a system extracts a sentence which in 
that form does not appear in the supplied abstract 
(resulting in a low performance score) but which is 
nevertheless an abstract-worthy sentence. The mis- 
match often arises simply because a similar idea is 
expressed in the supplied abstract in a very differ- 
ent form. But comparison of content is difficult to 
perform: it would require sentences to be mapped 
into some underlying meaning representations and 
then comparing these to the representations of the 
sentences in the gold standard. As this is techni- 
cally not feasible, system performance is typically 
performed against a fixed gold standard (e.g. the 
aforementioned abstracts), which is ultimately un- 
desirable. 
Our proposed solution to this problem is to build 
a corpus which details not only what the abstract- 
worthy sentences are but also what their argumen- 
tative role is. This corpus can then be used as a 
resource to build a system to similarly classify sen- 
tences in unseen texts, and to evaluate that system. 
This paper reports on the development of a set of 
such argumentative roles that we have been using in 
our work. 
In particular, we employ human intuition to an- 
notate argumentatively defined information. We 
ask our annotators to classify every sentence in the 
source text in terms of its argumentative role (e.g. 
that it expresses the main goal of the source text, or 
identifies open problems in earlier work, etc). Under 
this scenario, system evaluation is no longer a com- 
parison of extracted sentences against a supplied ab- 
stract, or against a single sentence that was chosen 
as expressing (e.g.) the main goal of the source text. 
Instead, every sentence in the source text which ex- 
presses the main goal will have been identified, and 
the system's performance is evaluated against that 
classification. 
Of course, having someone annotate text in this 
way may still lead to a biased or careless annotation. 
We therefore needed an annotation scheme which is 
simple enough to be usable in a stable and intu- 
itive way for several annotators. This paper also 
reports on how we tested the stability of the anno- 
tation scheme we developed. A second design crite- 
rion for our annotation scheme was that we wanted 
the roles to be annotated automatically. This paper 
reports on preliminary results which show that the 
annotation process can indeed be automated. 
To summarise, we have argued that discourse 
structure information will improve summarisation. 
Other researchers (Ono et al., 1994; Marcu, 1997) 
have argued similarly, although most previous work 
on discourse-based summarisation follows a different 
discourse model, namely Rhetorical Structure The- 
ory (Mann and Thompson, 1987). In contrast to 
RST, we stress the importance of rhetorical moves 
which are global to the argumentation of the paper, 
as opposed to more local RST-type relations. Our 
categories are not hierarchical, and they are much 
less fine-grained than RST-relations. As mentioned 
above, we wanted them to a) provide context in- 
formation for flexible summarisation, b) provide a 
higher degree of comparability between papers, and 
c) provide a fairer evaluation of superficially differ- 
ent sentences. 
In the rest of this paper, we will first describe how 
we chose the categories (section 2). Second, we had 
to construct training and evaluation material such 
that we could be sure that the proposed categorisa- 
tion yielded a reliable resource of annotated text to 
train a system against, a gold standard. The human 
annotation experiments are reported in section 3. 
Finally, in section 4, we describe some of the auto- 
mated annotation work which we have started re- 
cently and which uses a corpus annotated according 
to our scheme as its training material. 
2 The annotation scheme 
The domain in which we work is that of scientifc re- 
search articles, in particular computational linguis- 
tics articles. We settled on this domain for a num- 
ber of reasons. One reason is that it is a domain 
we are familiar with, which helps for intermediate 
evaluation of the annotation work. The other rea- 
son is that computational linguistics is also a rather 
heterogeneous domain: the papers in our collection 
cover a wide range of subject matters, such as logic 
programming, statistical language modelling, theo- 
reticai semantics and computational psycholinguis- 
tics. This makes it a challenging test bed for our 
85 
BASIC 
SCHEME 
BACKGROUND Sentences describing some (generally accepted) background 
knowledge 
OTHER Sentences describing aspects of some specific other research in a 
neutral way (excluding contrastive or BASIS statements) 
OWN Sentences describing any aspect of the own work presented in 
this paper - except what is covered by AIM or TEXTUAL, e.g. 
details of solution (methodology), limitations, and further work. 
AIM Sentences best portraying the particular (main) research goal of 
the article 
TEXTUAL Explicit statements about the textual section structure of the 
paper 
CONTRAST Sentences contrasting own work to other work; sentences point- 
ing out weaknesses in other research; sentences stating that the 
research task of the current paper has never been done before; 
direct comparisons 
BASIS Statements that the own work uses some other work as its basis 
or starting point, or gets support from this other work 
Figure 1: Overview of the a~notation scheme 
FULL 
SCHEME 
scheme which we hope to be applicable in a range of 
disciplines. 
Despite its heterogeneity, our collection of papers 
does exhibit predictable rhetorical patterns of sci- 
entific argumentation. To analyse these patterns 
we used STales' (1990) CARS (Creating a Research 
space) model as our starting point. 
The annotation scheme we designed is sum- 
marised in Figure 1. The seven categories describe 
argumentative roles with respect to the overall com- 
municative act of the paper. They are to be read as 
mutually exclusive labels, one of which is attributed 
to each sentence in a text. There are two kinds of 
categories in this scheme: basic categories and non- 
basic categories. Basic categories are defined by at- 
tribution of intellectual ownership; they distinguish 
between: 
• statements which are presented as generally ac- 
cepted (BACKGROUND); 
• statements which are attributed to other, spe- 
cific pieces of research outside the given pa- 
per, including the authors' own previous work (OTHER); 
• statements which describe the authors' own new 
contributions (OWN). 
The four additional (non-basic) categories are 
more directly based on STales' theory. The most 
important of these is AIM, as this move on its 
own is already a good characterisation of the en- 
tire paper, and thus very useful for the generation 
of abstracts. The other categories are TEXTUAL, 
which provides information about section structure 
that might prove helpful for subsequent search steps. 
There are two moves having to do with the author's 
attitude towards previous research, namely BASIS 
and CONTRAST. We expect this kind of information 
to be useful for the creation of typed links for biblio- 
metric search tools and for the automatic determi- 
nation of rival approaches in the field and intellec- 
tual ancestry of methodologies (cf. Garfield's (1979) 
classification of the function of citation within re- 
searchers' papers). 
The structure in Figure 2, for example, displays 
a common rhetorical pattern of scientific argumen- 
tation which we found in many introductions. A 
BACKGROUND segment, in which the history and the 
importance of the task is discussed, is followed by a 
longer sequence of OTHER sentences, in which spe- 
cific prior work is described in a neutral way. This 
discussion usually terminates in a criticism of the 
prior work, thus giving a motivation for the own 
work presented in the paper. The next sentence typ- 
ically states the specific goal or contribution of the 
paper, often in a formulaic way (Myers, 1992). 
Such regularities, where the segments are contigu- 
ous, non-overlapping and non-hierarchical, can be 
86 
BACKGROUND 
OTHER 
i Recently, new methods of... I 
<REFE.RENCE> <REFERENCE> 
f 
CO 
AIM 
Figure 2: Typical rhetorical pattern in a research 
paper introduction 
expressed well with our category labels. Whereas 
non-basic categories are typically short segments of 
one or two sentences, the basic categories form much 
larger segments of sentences with the same rhetorical 
role. 
3 Human Annotation 
3.1 Annotating full texts 
To ensure that our coding scheme leads to less bi- 
ased annotation than some of the other resources 
available for building summarisation systems, and to 
ensure that other researchers besides ourselves can 
use it to replicate our results on different types of 
texts, we wanted to examine two properties of our 
scheme: stability and reproducibility (Krippendorff, 
1980). Stability is the extent to which an annota- 
tor will produce the same classifications at different 
times. Reproducibility is the extent to which differ- 
ent annotators will produce the same classification. 
We use the Kappa coefficient (Siegel and Castellan, 
1988) to measure stability and reproducibility. The 
rationale for using Kappa is explained in (Carletta, 
1996). 
The studies used to evaluate stability and repro- 
ducibility we describe in more detail in (Teufel et 
al., To Appear). In brief, 48 papers were annotated 
by three extensively trained annotators. The train- 
ing period was four weeks consisting of 5 hours of 
annotation per week. There were written instruc- 
tions (guidelines) of 17 pages. Skim-reading and 
87 
annotation of an average length (3800 word) pa- 
per typically took 20-30 minutes. The studies show 
that the training material is reliable. In particu- 
lar, the basic annotation scheme is stable (K=.82, 
.81, .76; N=1220; k=2 for all three annotators) and 
reproducible (K=.71, N=4261, k=3), where k de- 
notes the number of annotators, N the number of 
sentences annotated, and K gives the Kappa value. 
The full annotation scheme is stable (K=.83, .79, 
.81; N=1248; k-2 for all three annotators) and re- 
producible (K=.78, N=4031, k=3). Overall, repro- 
ducibility and stability for trained annotators does 
not quite reach the levels found for, for instance, 
the best dialogue act coding schemes, which typi- 
cally reach Kappa values of around K=.80 (Carletta 
et al., 1997; Jurafsky et al., 1997). Our annotation 
requires more subjective judgements and is possibly 
more cognitively complex. Our reproducibility and 
stability results are in the range which Krippendorff 
(1980) describes as giving marginally significant re- 
sults for reasonable size data sets when correlating 
two coded variables which would show a clear cor- 
relation if there were perfect agreement. As our re- 
quirements are less stringent than Krippendorff's, 
we find the level of agreement which we achieved 
acceptable. 
OWN OTHER 
BACKGROUND CONTRAST 
AIM 
BASIS TEXTUAL 
69.4% 
15.8% 
5.7% 
4.4% 
2.4% 
1.4% 
0.9% 
Figure 3: Distribution of categories 
0.8 
0.7 
0.6 
0.5 
K 0.4 
0.3 
O.2 
0.1 
0 CONTRAST AIM BASIS TEXTUAL 
Figure 4: Reproducibility diagnostics: non-basic 
categories 
Figure 3, which gives the overall distribution of 
categories, shows that OWN is by far the most fre- 
quent category. Figure 4 reports how well the four 
non-basic categories could be distinguished from all 
other categories, measured by Krippendorff's diag- 
nostics for category distinctions (i.e. collapsing all 
other distinctions). When compared to the over- 
all reproducibility of .71, we notice that the anno- 
tators were good at distinguishing AIM and TEX- 
TUAL, and less good at determining BASIS and CON- 
TRAST. This might have to do with the location of 
those types of sentences in the paper: AIM and TEX- 
TUAL are usually found at the beginning or end of 
the introduction section, whereas CONTRAST, and 
even more so BASIS, are usually interspersed within 
longer stretches of OWN. As a result, these cate- 
gories are more exposed to lapses of attention during 
annotation. 
The fact that the annotators are good at deter- 
mining AIM sentences is an important result: as AIM 
sentences constitute the best characterisation of the 
research paper for the summarisation task at a very 
high compression to 1.8% of the original text length, 
we are particularly interested in having them anno- 
tated consistently in our training material. This re- 
sult is clearly in contrast to studies which conclude 
that humans are not very reliable at this kind of task 
(Rath et al., 1961). We attribute this difference to a 
difference in our instructions. Whereas the subjects 
in Rath et al.'s experiment were asked to look for 
the most relevant sentences, our annotators had to 
look for specific argumentative roles which seems to 
have eased the task. In addition, our guidelines give 
very specific instructions for ambiguous cases. 
These reproducibility values are important be- 
cause they can act as a good evaluation measure as 
it factors random agreement out, unlike percentage 
agreement. It also provides a realistic upper bound 
on performance: if the machine is treated as another 
coder, and if reproducibity does not decrease then 
the machine has reached the theoretically best re- 
sult, considering the cognitive difficulty of the task. 
3.2 Annotating parts of texts 
Annotating texts with our scheme is time- 
consuming, so we wanted to determine if there was a 
more efficient way of obtaining hand-coded training 
material, namely by annotating only parts of the 
source texts. For example, the abstract, introduc- 
tions and conclusions of source texts are often like 
"condensed" versions of the contents of the entire pa- 
per and might be good areas to restrict annotation 
to. Alternatively, it might be a good idea to restrict 
annotation to the first 20% or the last 10% of any 
given text. Yet another possibility for restricting the 
range of sentences to be annotated is based on the 
'alignment' idea introduced in (Kupiec et al., 1995): 
a simple surface measure determines sentences in the 
document that are maximally similar to sentences in 
the abstract. 
Obviously, any of these strategies of area restric- 
tion would give us fewer gold standard sentences per 
paper, so we would have to make sure that we still 
had enough candidate sentences for all seven cate- 
gories. On the other hand, because these areas could 
well be the most clearly written and informationally 
rich sections, it might be the case that the qual- 
ity of the resulting gold standard is higher. In this 
case we would expect the reliability of the coding in 
these areas to be higher in comparison to the reli- 
ability achieved overall, which in turn would result 
in higher accuracy when this task is done automat- 
ically. 
'I 
0.8 
0.6 
K 
0.4 
02. 
0 
Figure 5: Reproducibility by annotated area 
100% 
50% - 
Figure 6: Label distribution by annotated area 
We did extensive experiments on this. Figure 5 
shows reliability values for each of the annotated 
portions of text, and Figure 6 shows the composi- 
88 
tion in terms of our labels for each of the annotated 
portions of text. The implications for corpus prepa- 
ration for abstract generation experiments can be 
summarised as follows. If one wants to avoid manu- 
ally annotating entire papers but still make all argu- 
mentative distinctions, one can restrict the annota- 
tion to sentences appearing in the introduction sec- 
tion, even though annotators will find them slightly 
harder to classify (K=.69), or to all alignable ab- 
stract sentences, even if there are not many alignable 
abstract sentences detectable overall (around 50% of 
the sentences in the abstract), or to conclusion sen- 
tences, even if the coverage of argumentative cate- 
gories is very restricted in the conclusions (mostly 
AIM and OWN sentences). 
We also examined a fall-back option of just anno- 
tating the first 10% or last 5% of a paper (as not all 
papers in our collection have an explicitly marked 
introduction and conclusion section), but the relia- 
bility results of this were far less good (K=.66 and 
K=.63, respectively). 
4 Automatic annotation 
All the annotation work is obviously in aid of de- 
velopment work, in particular for the training of a 
system. We will provide a brief description of train- 
ing results so as to show the practical viability of the 
proposed corpus preparation method. 
4.1 Data 
Our training material is a collection of 80 con- 
ference papers and their summaries, taken from 
the Computation and Language E-Print Archive 
(http://xxx. lanl. gov/cmp-lg/). The training 
material contains 330,000 word tokens. 
The data is automatically preprocessed into xml 
format, and the following structural information is 
marked up: title, summary, headings, paragraph 
structure and sentences, citations in running text, 
and reference list at the end of the paper. If one 
of the paper's authors also appears on the author 
list of a cited paper, then that citation is marked 
as self citation. Tables, equations, figures, captions, 
cross references are removed and replaced by place 
holders. Sentence boundaries are automatically de- 
tected, and the text is POS-tagged according to the 
UPenn tagset. 
Annotation of rhetorical roles for all 80 papers 
(around 12,000 sentences) was provided by one of 
our human judges during the annotation study men- 
tioned above. 
4.2 The method 
(Kupiec et al., 1995) use supervised learning to au- 
tomatically adjust feature weights. Each document 
sentence receives scores for each of the features, re- 
suiting in an estimate for the sentence's probability 
to also occur in the summary. This probability is 
calculated for each feature value as a combination 
of the probability of the feature-value pair occurring 
in a sentence which is in the summary (successful 
case) and the probability that the feature-value pair 
occurs unconditionally. 
We extend Kupiec et al.'s estimation of the proba- 
bility that a sentence is contained in the abstract, to 
the probability that it has rhetorical role R (cf. Fig- 
ure 7). 
P(seRIF'"'"Fk)~ 1-i~= p¢Fj ) 
where 
P(s e RIF1,.. 
P(s ~. R): 
P(Fjl s e R): 
P( ): 
k: 
D: 
., Fk): Probability that sentence s 
in the source text has rhetorical 
role R, given its feature values; 
relative frequency of role R (con- 
stant); 
probability of feature-value pair 
occurring in a sentence which is in 
rhetorical class R; 
probability that the feature-value 
pair occurs unconditionally; 
number of feature-value pairs; 
j-th feature-value pair. 
Figure 7: Naive Bayesian classifier 
Evaluation of the method relies on cross- 
validation: the model is trained on a training set 
of documents, leaving one document out at a time 
(the test document). The model is then used to as- 
sign each sentence a probability for each category 
R, and the category with the highest probability is 
chosen as answer for the sentence. 
4.3 Features 
The features we use in training (see Figure 8) are 
different from Kupiecet al.'s because we do not es- 
timate overall importance in one step, but instead 
guess argumentative status first and determine im- 
portance later. 
Many of our features can be read off directly from 
the way the corpus is encoded: our preprocessors 
determine sentence-boundaries and parse the refer- 
ence list at the end. This gives us a good handle 
on structural and locational features, as well as on 
features related to citations. 
89 
Type of feature 
Explicit structure 
Relative location 
Citations 
Syntactic features 
Semantic features 
Content Features 
Name 
Struct-1 
Struct-2 
Struct-3 
Cit-1 
Cit-2 
Syn-1 
Syn-2 
Syn-3 
Syn-4 
Sere-1 
Sem-2 
Sem-3 
Cont-1 
Cont-2 
Feature description 
Type of Headline of current section 
Relative position of sentence within 
paragraph 
Relative position of sentence within 
section 
Paper is segmented into 10 equally- 
sized segments 
Does the sentence contain a citation or 
the name of an author contained in the 
reference list? 
Does the sentence contain a self cita- 
tion? 
Tense (associated with first finite verb 
in sentence) 
Modal Auxiliaries 
Negation 
Action type of first verb in sentence 
Type of Agent 
Type of formulaic expression occurring 
in sentence 
Does the sentence contain keywords as 
determined by the tf/idf measure? 
Does the sentence contain words also 
occurring in the title or headlines? 
Feature values 
8 prototypical headlines or 'non- 
p.rototypical' 
initial, medial, final 
first, second or last third 
1-10 
Full Citation, Author Name or"" 
None 
Yes or No 
Present, Past, Present Perfect,' 
Past Perfect, Future or Nothing_ 
Present or Not 
Active or Passive 
Present or Not 
20 different Action Types 
(cf. Figure 9) or Nothing 
Authors or Others or Nothing 
18 different types of Formulaic 
Expressions (cf. Figure 9) or 
Nothing 
Yes or No 
Yes or No 
Figure 8: Features for supervised learning 
m 
m 
m 
m 
The syntactic features rely on determining the 
first finite verb in the sentence, which is done sym- 
bolically using POS-information. Heuristics are used 
to determine the tense and possible negation. 
The semantic features rely on template matching. 
In the feature Sem-1, a hand-crafted lexicon is used 
to classify the verb into one of 20 Action Classes 
(cf. Figure 9, left half), if it is one of the 388 verbs 
contained in the lexicon. The feature Sem-2 encodes 
whether the agent of the action is most likely to re- 
fer to the authors, or to other agents, e.g. other 
researchers (177 templates). Heuristic rules deter- 
mine that the agent is the subject in an active sen- 
tence, or the head of the by-phrase (if present) in a 
passive sentence. Sere-3 encodes various other for- 
mulaic expressions (indicator phrases (Paice, 1981), 
meta-comments (Zukerman, 1991)) in order to ex- 
ploit explicit rhetoric phrases the authors might have 
used, cf. Figure 9, right half (414 templates). 
The content features use the tf/idf method and 
title and header information for finding contentful 
words or phrases. In contrast to all other features 
they do not attempt to model the form or meta- 
discourse contained in the sentences but instead 
model their domain (object-level) contents. 
4.4 Results 
When the Naive Bayesian Model is added to the 
pool of coders, the reproducibility drops from K=.71 
to K=.55. This reproducibility value is equivalent 
to the value achieved by 6 human annotators with 
no prior training, as found in an earlier experiment 
(Teufel et al., To Appear). Compared to one of the 
annotators, Kappa is K=.37, which corresponds to 
percentage accuracy of 71.2%. This number cannot 
be directly compared to experiments like Kupiec et 
al.'s because in their experiment a compression of 
around 3% was achieved whereas we classify each 
sentence into one of the categories. 
Further analysis of our results shows the system 
performs well on the frequent category OWN, cf. the 
confusion matrix in Fig. reftab:confusion. Indeed, 
as Figure 3 shows, OWN is so frequent that choos- 
ing OWN all the time gives us a seemingly hard- 
to-beat baseline with a high percentage agreement 
of 69% (Baseline 1). However, the Kappa statistic, 
which controls for expected random agreement, re- 
veals just how bad that baseline really is: Kappa 
is K=-.12 (machine vs. one annotator). Random 
choice of categories according to the distribution of 
categories (Baseline 2) is a better baseline; Kappa 
90 
Action Types Formulaic Expression Types 
AFFECT 
ARGUMENTATION 
AWARENESS 
BETTER,SOLUTION 
CHANGE 
COMPARISON 
CONTINUATION 
CONTRAST 
FUTURE.INTEREST 
INTEREST 
NEED 
PRESENTATION 
PROBLEM 
RESEARCH 
SIMILAR 
SOLUTION 
TEXTSTRUCTURE 
USE 
COPULA 
POSSESSION 
we hop._.._~e to improve these results 
we argue against an application of 
we know of no other attempts... 
our system outperforms that of ... 
we extend < CITE/> 's algorithm 
we tested_ our system against... 
we follow X in postulating that 
our approach differs from X's ... 
we inten..d to improve our results... 
we are concerned with ... 
this approach, however, lacks... 
we present here a method for... 
thi~-~ses the problem of how to... 
we collected our data from... 
our approach resembles that of X... 
we solve this problem by... 
the paper is organized as follows... 
we employ X's method... 
our goal i...ss to... 
our approach has three advan- 
tages... 
GENERAL-AGENT 
SPECIFIC-AGENT 
GAP-INTRODUCTION 
AIM 
TEXTSTRUCTURE 
DEIXIS 
CONTINUATION 
SIMILARITY 
COMPARISON 
CONTRAST 
METHOD 
PREVIOUS_CONTEXT 
FUTURE 
AFFECT 
PROBLEM 
SOLUTION 
POSITIVE.ADJECTIVE 
NEGATIVE. ADJECTIVE 
linguists 
according to < REF'~ 
to our knowledge 
main contribution of this 
in section < CREF/> 
in this paper 
following the argument in 
bears similarity to 
when compared to our 
however 
a novel method for XX-ing 
elsewhere, we have 
avenue for improvement 
hopefully 
drawback 
insight 
appealing 
unsatisfactory 
Figure 9: Types of actions and formulaic expressions 
HUMAN 
AIM 
CONTRAST 
TEXTUAL 
OWN 
BACKGROUND 
BASIS 
OTHER 
Total 
AIM 
115 
11 
13 
75 
11 
10 
7 
242 
MACHINE 
CONTRAST TEXTUAL 
4 I0 
79 5 
4 115 
61 61 
20 3 
10 5 
35 10 
213 209 
.i ~-, .. 
OWN BACKGROUND BASIS OTHER Total 
46 15 13 4 207 
280 92 40 89 596 
71 5 3 12 223 
7666 168 125 279 8435 
286 295 21 84 720 
40 4 102 55 226 
1120 203 173 466 2014 
9509 782 477 989 12421 
Figure 10: Confusion matrix: human vs. automatic annotation 
for this baseline is K=0. 
AIM categories can be determined with a preci- 
sion of 48% and a recall of 56% (cf. Figure 11). 
These values are more directly comparable to Ku- 
piec et al.'s results of 44% co-selection of extracted 
sentences with alignable summary sentences. We 
assume that most of the sentences extracted by 
their method would have fallen into the AIM cate- 
gory. The other easily determinable category for the 
automatic method is TEXTUAL (p----55%; r=52%), 
whereas the results for the other non-basic categories 
are relatively lower - mirroring the results for hu- 
mans. 
As far as the individual features are concerned, we 
found the strongest heuristics to be location, type of 
header, citations, and the semantic classes (indicator 
phrases, agents and actions); syntactic and content- 
based heuristics are the weakest. The first column 
in Figure 12 gives the predictiveness of the feature 
AIM 
CONTRAST 
TEXTUAL 
OwN 
BACKGROUND 
BASIS 
OTHER 
Precision Recall 
48% 56% 
37% 13% 
55% 52% 
81% 91% 
38% 41% 
21% 45% 
47% 23% 
Figure 11: Precision and recall per category 
on its own, in terms of kappa between machine and 
one annotator. Some of the weaker features are not 
predictive enough on their own to break the domi- 
nance of the prior; in that case, they behave just like 
Baseline 1 (K=-.12). 
The second column gives kappa for experiments 
using all features except the given feature, i.e. the 
results if this feature is left out of the pool of fea- 
91 
Feature Code Alone Left out 
Struct-1 -.12 .37 
Struct-2 -.12 .36 
Struct-3 .16 .36 
Struct-l-3 .18 .34 
I L°c I.. .171 .34 
Cit-1 .18 .37 
Cit-2 .13 .37 
Cit-l-2 .18 -36 
Syn-1 -.12 .37 
Syn-2 -.12 .37 
Syn-3 -.12 .37 
Syn-4 -.12 .37 
Syn-l-4 .... -.12 .37 
Sere-1 -.12 ".36 
Sere-2 .07 .35 
Sere-3 -.03 .36 
Sere-l-3 " .13 .31 
Cost-1 -.12 I .37 
Cont-2 -.12 .37 
Cont-lS2 -.12 .37 
Baseline 1 (all OWN): K=-.12 
Baseline 2 (random by distr.): K=0 
Figure 12: Disambiguation potential of individual 
heuristics 
tures. These numbers show that some of the weaker 
features contribute some predictive power in combi- 
nation with others. 
While not entirely satisfactory, these results might 
be taken as an indication that we have indeed man- 
aged to identify the right kinds of features for argu- 
mentative sentence classification. Taking the con- 
text into account should further increase results, 
as preliminary experiments with n-gram modelling 
have shown. In these experiments, we replaced the 
prior P(s E R) in Figure 7 with a n-gram based 
probability of that role occurring in the given con- 
text. 
5 Conclusions 
In this paper we have presented an annotation 
scheme for corpus based summarisation. In tests, 
we have found this annotation scheme to be stable 
and reproducible. On the basis of this scheme, we 
have created a new kind of resource for training sum- 
marisation systems: a corpus annotated with labels 
which indicate the argumentative role of each sen- 
tence in the text. Results of our training work show 
that the annotation work can be automated. 

References 
Jean Carletta, Amy Isard, Stephen Isard, Jacqueline C. 
Kowtko, Gwyneth Doherty-Sneddon, and Anne H. 
Anderson. 1997. The reliability of a dialogue 
structure coding scheme. Computatiorml Linguistics, 
23(1):13-31. 

Jean Carletta. 1996. Assessing agreement on classifica- 
tion tasks: the kappa statistic. Computational Lin- 
guistics, 22(2):249-254. 

H. P. Edmundson. 1969. New methods in automatic 
extracting. Journal of the Association for Computing 
Machinery, 16(2):264-285. 

E. Garfield. 1979. Citation indezing: its theory and ap- 
plication in science, thechnology and humanities. Wi- 
ley, New York. 

Daniel Jurafsky, Elizabeth Shriberg, and Debra Bi- 
asca, 1997. Switchboard SWBD-DAMSL Shallow- 
Discourse-Function Annotation Coders Manual. Uni- 
versity of Colorado, Institute of Cognitive Science. 
TR-97-02. 

Klaus Krippendorff. 1980. Content analysis: an intro- 
duction to its methodology. Sage Commtext series; 5. 
Sage, Beverly Hills London. 

Julian Kupiec, Jan O. Pedersen, and Francine Chen. 
1995. A trainable document summarizer. In Pro- 
ceedings of the 18th ACM-SIGIR Conference, pages 
68-73. 

Elizabeth DuRoss Liddy. 1991. The discourse-level 
structure of empirical abstracts: an exploratory study. 
Information Processing and Management, 27(1):55-81. 

H. P. Luhn. 1958. The automatic creation of literature 
abstracts. IBM Journal of Research and Development, 
2(2):159-165. 

Inderjeet Mani and Eric Bloedorn. 1998. Machine learn- 
ing of generic and user-focused summarization. In 
Proceedings of the Fifteenth National Conference on 
AI (AAAI-98), pages 821-826. 

William C. Mann and Sandra A. Thompson. 1987. 
Rhetorical structure theory: description and construc- 
tion of text structures. In G. Kempen, editor, Natural 
Langua9 e Generation: New Results in Artificial In- 
telligence, Psychology and Linguistics, pages 85-95, 
Dordrecht. Nijhoff. 

Daniel Marcu. 1997. From discourse structures to text 
summaries. In Proceedings of the ACL/EACL Work- 
shop on Intelligent Scalable Text Summarization. 

Greg Myers. 1992. In this paper we report... - speech 
acts and scientific facts. Journal of Pragmatics, 
17(4):295-313. 

Kenji Ono, Ka~uo Sumita, and Seijii Miike. 1994. Ab- 
stract generation based on rhetorical structure extrac- 
tion. In Proceedings of the 15th International confer- 
ence on Computational Linguistics (COLING-94). 

Chris D. Paice. 1981. The automatic generation of lit- 
erary abstracts: an approach based on the identifi- 
cation of self-indicating phrases. In Robert Norman 
Oddy, S. E. Robertson, C. J. van Rijsbergen, and 
P. W. Williams, editors, Information Retrieval Re- 
search, pages 172-191. Butterworth, London. 

G.J Path, A. Resnick, and T. R. Savage. 1961. The 
formation of abstracts by the selection of sentences. 
American Documentation, 12(2):139-143. 

Sidney Siegel and N.J. Jr. Castellan. 1988. Nonparamet- 
tic statistics for the Behavioral Sciences. McGraw- 
Hill, second edition. 

Karen Sp~rck Jones. 1998. Automatic summarising: 
factors and directions. In AAAI Spring Symposium 
on Intelligent Text Summarization. 

John Swales. 1990. Genre analysis: English in academic 
and research settings. Cambridge University Press. 

Simone Teufel, Jean Carletta, and Marc Moens. To Ap- 
pear. An annotation scheme for discourse-level argu- 
mentation in research articles. In Proceedings of the 
Ninth Conference of the European Chapter of the As- 
sociation for Computational Linguistics (EA CL- 99). 

Ingrid Zukerman. 1991. Using meta-comments to gener- 
ate fluent text in a technical domain. Computational 
Intelligence: Special Issue on Natural Language Gen- 
eration, 7(4):276. 
