What's yours and what's mine: Determining Intellectual 
Attribution in Scientific Text 
Simone Teufel t 
Computer Science Department 
Columbia University 
t eufel©cs, columbia, edu 
Marc Moens 
HCRC Language Technology Group 
University of Edinburgh 
Marc. Moens©ed. ac. uk 
Abstract 
We believe that identifying the structure of scien- 
tific argumentation in articles can help in tasks 
such as automatic summarization or the auto- 
mated construction of citation indexes. One par- 
ticularly important aspect of this structure is the 
question of who a given scientific statement is at- 
tributed to: other researchers, the field in general, 
or the authors themselves. 
We present the algorithm and a systematic eval- 
uation of a system which can recognize the most 
salient textual properties that contribute to the 
global argumentative structure of a text. In this 
paper we concentrate on two particular features, 
namely the occurrences of prototypical agents and 
their actions in scientific text. 
1 Introduction 
When writing an article, one does not normally 
go straight to presenting the innovative scien- 
tific claim. Insteacl, one establishes other, well- 
known scientific facts first, which are contributed 
by other researchers. Attribution of ownership of- 
ten happens explicitly, by phrases such as "Chom- 
sky (1965) claims that". The question of intel- 
lectual attribution is important for researchers: 
not understanding the argumentative status of 
part of the text is a common problem for non- 
experts reading highly specific texts aimed at ex- 
perts (Rowley, 1982). In particular, after reading 
an article, researchers need to know who holds the 
"knowledge claim" for a certain fact that interests 
them. 
We propose that segmentation according to in- 
tellectual ownership can be done automatically, 
and that such a segmentation has advantages for 
various shallow text understanding tasks. At the 
heart of our classification scheme is the following 
trisection: 
* BACKGROUND (generally known work) 
* OWN, new work and 
. specific OTHER work. 
The advantages of a segmentation at a rhetori- 
cal level is that rhetorics is conveniently constant 
tThis work was done while the first author was at the 
HCRC Language Technology Group, Edinburgh. 
BACKGROUND: 
Researchers in knowledge representa- 
tion agree that one of the hard problems of 
understanding narrative is the representation 
of temporal information. Certain facts of nat- 
ural  make it hard to capture tempo- 
ral information \[...\] 
OTHER WORK: 
Recently, Researcher-4 has suggested the 
following solution to this problem \[...\]. 
WEAKNESS/CONTRAST: 
But this solution cannot be used to inter- 
pret the following Japanese examples: \[...\] 
OWN CONTRIBUTION: 
We propose a solution which circumvents 
this prow while retaining the explanatory 
power of Researcher-4's approach. 
Figure h Fictional introduction section 
across different articles. Subject matter, on the 
contrary, is not constant, nor are writing style and 
other factors. 
We work with a corpus of scientific pa- 
pers (80 computational linguistics conference ar- 
ticles (ACL, EACL, COLING or ANLP), de- 
posited on the CMP_LG archive between 1994 
and 1996). This is a difficult test bed due to 
the large variation with respect to different fac- 
tors: subdomain (theoretical linguistics, statisti- 
cal NLP, logic programming, computational psy- 
cholinguistics), types of research (implementa- 
tion, review, evaluation, empirical vs. theoreti- 
cal research), writing style (formal vs. informal) 
and presentational styles (fixed section structure 
of type Introduction-Method-Results-Conclusion 
vs. more idiosyncratic, problem-structured presen- 
tation). 
One thing, however, is constant across all arti- 
cles: the argumentative aim of every single article 
is to show that the given work is a contribution to 
science (Swales, 1990; Myers, 1992; Hyland, 1998). 
Theories of scientific argumentation in research ar- 
ticles stress that authors follow well-predictable 
stages of argumentation, as in the fictional intro- 
duction in figure 1. 
9 
Are the scientific statements expressed 
in this sentence attributed to the 
authors, the general field, or specific other 
n work / Other Work 
Does this sentence contain material 
that describes the specific aim 
of the paper? 
Does this sentence make reference to the external 
structure of the paper? 
I SACKCRO D I 
D.~s it describe.a negative aspect of me omer worK, or a contzast 
or comparison of the own work to it? 
I CONTRAST I Does this sentence mention the other work as basis of 
or support for own work? 
Figure 2: Annotation Scheme for Argumentative Zones 
Our hypothesis is that a segmentation based on 
regularities of scientific argumentation and on at- 
tribution of intellectual ownership is one of the 
most stable and generalizable dimensions which 
contribute to the structure of scientific texts. In 
the next section we will describe an annotation 
scheme which we designed for capturing these ef- 
fects. Its categories are based on Swales' (1990) 
CARS model. 
1.1 The scheme 
As our corpus contains many statements talking 
about relations between own and other work, we 
decided to add two classes ("zones") for express- 
ing relations to the core set of OWN, OTHER 
and BACKGROUND, namely contrastive statements 
(CONTRAST; comparable to Swales' (1990) move 
2A/B) and statements of intellectual ancestry 
(BAsis; Swales' move 2D). The label OTHER is 
thus reserved for neutral descriptions of other 
work. OWN segments are further subdivided to 
mark explicit aim statements (AIM; Swales' move 
3.1A/B), and explicit section previews (TEXTUAL; 
Swales' move 3.3). All other statements about the 
own work are classified as OwN. Each of the seven 
category covers one sentence. 
Our classification, which is a further develop- 
ment of the scheme in Teufel and Moens (1999), 
can be described procedurally as a decision tree 
(Figure 2), where five questions are asked about 
each sentence, concerning intellectual attribution, 
author stance and continuation vs. contrast. Fig- 
ure 3 gives typical example sentences for each zone. 
The intellectual-attribution distinction we make 
is comparable with Wiebe's (1994) distinction into 
subjective and objective statements. Subjectivity 
is a property which is related to the attribution of 
authorship as well as to author stance, but it is 
just one of the dimensions we consider. 
1.2 Use of Argumentative Zones 
Which practical use would segmenting a paper into 
argumentative zones have? 
Firstly, rhetorical information as encoded in 
these zones should prove useful for summariza- 
tion. Sentence extracts, still the main type of 
summarization around, are notoriously context- 
insensitive. Context in the form of argumentative 
relations of segments to the overall paper could 
provide a skeleton by which to tailor sentence ex- 
tracts to user expertise (as certain users or certain 
tasks do not require certain types of information). 
A system which uses such rhetorical zones to pro- 
duce task-tailored extracts for medical articles, al- 
beit on the basis of manually-segmented texts, is 
given by Wellons and Purcell (1999). 
Another hard task is sentence extraction from 
long texts, e.g. scientific journal articles of 20 
pages of length, with a high compression. This 
task is hard because one has to make decisions 
about how the extracted sentences relate to each 
other and how they relate to the overall message 
of the text, before one can further compress them. 
Rhetorical context of the kind described above is 
very likely to make these decisions easier. 
Secondly, it should also help improve citation 
indexes, e.g. automatically derived ones like 
Lawrence et al.'s (1999) and Nanba and Oku- 
mura's (1999). Citation indexes help organize sci- 
entific online literature by linking cited (outgoing) 
and citing (incoming) articles with a given text. 
But these indexes are mainly "quantitative", list- 
ing other works without further qualifying whether 
a reference to another work is there to extend the 
10 
AIM "We have proposed a method of clustering words based on large corpus data." 
TEXTUAL "Section $ describes three unification-based parsers which are... " 
OWN "We also compare with the English  and draw some conclusions on the benefits 
of our approach." 
BACKGROUND "Part-of-speech tagging is the process of assigning grammatical categories to individual 
words in a corpus." 
CONTRAST "However, no method for extracting the relationships from superficial linguistic ex- 
pressions was described in their paper." 
BASIS "Our disambiauation method is based on the similaritu of context vectors, which was 
OTHER 
C :g method is based on the similarity of context vectors, which was 
originated by Wilks et al. 1990." 
"Strzalkowski's Essential Arguments Approach (EAA) is a top-down approach to gen- 
eration... " 
Figure 3: Examples for Argumentative Zones 
earlier work, correct it, point out a weakness in 
it, or just provide it as general background. This 
"qualitative" information could be directly con- 
tributed by our argumentative zones. 
In this paper, we will describe the algorithm of 
an argumentative zoner. The main focus of the 
paper is the description of two features which are 
particularly useful for attribution determination: 
prototypical agents and actions. 
2 Human Annotation of 
Argumentative Zones 
We have previously evaluated the scheme empiri- 
cally by extensive experiments with three subjects, 
over a range of 48 articles (Teufel et al., 1999). 
We measured stability (the degree to which the 
same annotator will produce an annotation after 
6 weeks) and reproducibility (the degree to which 
two unrelated annotators will produce the same 
annotation), using the Kappa coefficient K (Siegel 
and Castellan, 1988; Carletta, 1996), which con- 
trols agreement P(A) for chance agreement P(E): 
K = P{A)-P(E) 1-P(Z) 
Kappa is 0 for if agreement is only as would be 
expected by chance annotation following the same 
distribution as the observed distribution, and 1 for 
perfect agreement. Values of Kappa surpassing 
.8 are typically accepted as showing a very high 
level of agreement (Krippendorff, 1980; Landis and 
Koch, 1977). 
Our experiments show that humans can distin- 
guish own, other specific and other general work 
with high stability (K=.83, .79, .81; N=1248; k=2, 
where K stands for the Kappa coefficient, N for 
the number of items (sentences) annotated and k 
for the number of annotators) and reproducibil- 
ity (K=.78, N=4031, k=3), corresponding to 94%, 
93%, 93% (stability) and 93% (reproducibility) 
agreement. 
The full distinction into all seven categories of 
the annotation scheme is slightly less stable and 
reproducible (stability: K=.82, .81, .76; N=1220; 
k=2 (equiv. to 93%, 92%, 90% agreement); repro- 
ducibility: K=.71, N=4261, k=3 (equiv. to 87% 
agreement)), but still in the range of what is gener- 
ally accepted as reliable annotation. We conclude 
from this that humans can distinguish attribution 
and full argumentative zones, if trained. Human 
annotation is used as trMning material in our sta- 
tistical classifier. 
3 Automatic Argumentative 
Zoning 
As our task is not defined by topic coherence 
like the related tasks of Morris and Hirst (1991), 
Hearst (1997), Kan et al. (1998) and Reynar 
(1999), we predict that keyword-based techniques 
for automatic argumentative zoning will not work 
well (cf. the results using text categorization as 
described later). We decided to perform machine 
learning, based on sentential features like the ones 
used by sentence extraction. Argumentative zones 
have properties which help us determine them on 
the surface: 
• Zones appear in typical positions in the article 
(Myers, 1992); we model this with a set of 
location features. 
• Linguistic features like tense and voice cor- 
relate with zones (Biber (1995) and Riley 
(1991) show correlation for similar zones like 
"method" and "introduction"). We model 
this with syntactic features. 
• Zones tend to follow particular other zones 
(Swales, 1990); we model this with an ngram 
model operating over sentences. 
• Beginnings of attribution zones are linguisti- 
cally marked by meta-discourse like "Other 
researchers claim that" (Swales, 1990; Hy- 
land, 1998); we model this with a specialized 
agents and actions recognizer, and by recog- 
nizing formal citations. 
• Statements without explicit attribution are 
interpreted as being of the same attribution 
as previous sentences in the same segment of 
attribution; we model this with a modified 
agent feature which keeps track of previously 
recognized agents. 
11 
3.1 Recognizing Agents and Actions 
Paice (1981) introduces grammars for pattern 
matching of indicator phrases, e.g. "the 
aim/purpose of this paper/article/study" and "we 
conclude/propose". Such phrases can be useful 
indicators of overall importance. However, for 
our task, more flexible meta-diiscourse expressions 
need to be determined. The ,description of a re- 
search tradition, or the stateraent that the work 
described in the paper is the continuation of some 
other work, cover a wide range of syntactic and 
lexical expressions and are too hard to find for a 
mechanism like simple pattern matching. 
Agent Type Example 
US-AGENT 
THEM_AGENT 
GENERAL_AGENT 
US_PREVIOUS. AGENT 
OUR_AIM_AGENT 
REF_US_AGENT 
REF._AGENT 
THEM_PRONOUN_AGENT 
AIM_I:LEF_AGENT 
GAP_AGENT 
PROBLEM_AGENT 
SOLUTION_AGENT 
TEXTSTRUCTURE_AGENT 
we 
his approach 
traditional methods 
the approach given in 
X (99) 
the point o\] this study 
thia paper 
the paper 
they 
its goal 
none of these papers 
these drawbacks 
a way out o\] this 
dilemma 
the concluding chap- 
ter 
Figure 4: Agent Lexicon: 168 Patterns, 13 Classes 
We suggest that the robust recognition of pro- 
totypical agents and actions is one way out of this 
dilemma. The agents we propose to recognize de- 
scribe fixed role-players in the argumentation. In 
Figure 1, prototypical agents are given in bold- 
face ("Researchers in knowledge representation, 
"Researcher-4" and "we"). We also propose pro- 
totypical actions frequently occurring in scientific 
discourse (shown underlined in Figure 1): the re- 
searchers "agree", Researcher-4 "suggested" some- 
thing, the solution "cannot be used". 
We will now describe an algorithm which rec- 
ognizes and classifies agents and actions. We 
use a manually created lexicon for patterns for 
agents, and a manually clustered verb lexicon for 
the verbs. Figure 4 lists the agent types we dis- 
tinguish. The main three types are US_aGENT, 
THEM-AGENT and GENERAL.AGENT. A fourth 
type is US.PREVIOUS_AGENT (the authors, but in 
a previous paper). 
Additional agent types include non-personal 
agents like aims, problems, solutions, absence of 
solution, or textual segments. There are four 
equivalence classes of agents with ambiguous 
reference ("this system"), namely REF_US_AGENT, 
THEM-PRONOUN_AGENT, AIM.-REF-AGENT, 
REF_AGENT. The total of 168 patterns in the 
lexicon expands to many more as we use a replace 
mechanism (@WORK_NOUN is expanded to 
"paper, article, study, chapter" etc). 
For verbs, we use a manually created the ac- 
tion lexicon summarized in Figure 6. The verb 
classes are based on semantic concepts such as 
similarity, contrast, competition, presentation, ar- 
gumentation and textual structure. For ex- 
ample, PRESENTATION..ACTIONS include commu- 
nication verbs like "present", "report", "state" 
(Myers, 1992; Thompson and Yiyun, 1991), RE- 
SEARCH_ACTIONS include "analyze", "conduct" 
and "observe", and ARGUMENTATION_ACTIONS 
"argue", "disagree", "object to". Domain-specific 
actions are contained in the classes indicating 
a problem ( ".fail", "degrade", "overestimate"), 
and solution-contributing actions (" "circumvent', 
solve", "mitigate"). 
The main reason for using a hand-crafted, genre-- 
specific lexicon instead of a general resource such 
as WordNet or Levin's (1993) classes (as used in 
Klavans and Kan (1998)), was to avoid polysemy 
problems without having to perform word sense 
disambiguation. Verbs in our texts often have a 
specialized meaning in the domain of scientific ar- 
gumentation, which our lexicon readily encodes. 
We did notice some ambiguity problems (e.g. "fol- 
low" can mean following another approach, or it 
can mean follow in a sense having nothing to do 
with presentation of research, e.g. following an 
arc in an algorithm). In a wider domain, however, 
ambiguity would be a much bigger problem. 
Processing of the articles includes transforma- 
tion from I~TEX into XML format, recognition 
of formal citations and author names in running 
text, tokenization, sentence separation and POS- 
tagging. The pipeline uses the TTT software pro- 
vided by the HCRC Language Technology Group 
(Grover et al., 1999). The algorithm for deter- 
mining agents in subject positions (or By-PPs in 
passive sentences) is based on a finite automaton 
which uses POS-input; cf. Figure 5. 
In the case that more than one finite verb is 
found in a sentence, the first finite verb which has 
agents and/or actions in the sentences is used as 
a value for that sentence. 
4 Evaluation 
We carried out two evaluations. Evaluation A 
tests whether all patterns were recognized as in- 
tended by the algorithm, and whether patterns 
were found that should not have been recognized. 
Evaluation B tests how well agent and action 
recognition helps us perform argumentative zon- 
ing automatically. 
4.1 Evaluation A: Correctness 
We first manually evaluated the error level of the 
POS-Tagging of finite verbs, as our algorithm cru- 
cially relies on finite verbs. In a random sample of 
100 sentences from our corpus (containing a total 
of 184 finite verbs), the tagger showed a recall of 
12 
1. Start from the first finite verb in the sentence. 
2. Check right context of the finite verb for verbal forms of interest which might make up more 
complex tenses. Remain within the assumed clause boundaries; do not cross commas or other 
finite verbs. Once the main verb of that construction (the "semantic" verb) has been found, 
a simple morphological analysis determines its lemma; the tense and voice of the construction 
follow from the succession of auxiliary verbs encountered. 
3. Look up the lemma of semantic verb in Action Lexicon; return the associated Action Class if 
successful. Else return Action 0. 
4. Determine if one of the 32 fixed negation words contained in the lexicon (e.g. "not, don't, 
neither") is present within a fixed window of 6 to the right of the finite verb. 
5. Search for the agent either as a by-PP to the right, or as a subject-NP to the left, depending on 
the voice of the construction as determined in step 2. Remain within assumed clause boundaries. 
6. If one of the Agent Patterns matches within that area in the sentence, return the Agent Type. 
Else return Agent 0. 
7. Repeat Steps 1-6 until there are no more finite verbs left. 
Figure 5: Algorithm for Agent and Action Detection 
Action Type Example Action Type Example 
AFFECT 
ARGUMENTATION 
AWARENESS 
BETTER_SOLUTION 
CHANGE 
COMPARISON 
CONTINUATION 
CONTRAST 
FUTURE_INTEREST 
INTEREST 
we hope to improve our results 
we argue against a model of 
we are not aware of attempts 
our system outperforms ... 
we extend <CITE/>'s algo- 
rithm 
we tested our system against... 
we follow <REF/> ... 
our approach differs from ... 
we intend to improve ... 
we are concerned with ... 
NEED 
PRESENTATION 
PROBLEM 
RESEARCH 
SIMILAR 
SOLUTION 
TEXTSTRUCTURE 
USE 
COPULA 
POSSESSION 
this approach, however, lacks... 
we present here a method for.. . 
this approach fails... 
we collected our data from... 
our approach resembles that of 
we solve this problem by... 
the paper is organize&.. 
we employ <REF/> 's method... 
our goal ~ to... 
we have three goals... 
Figure 6: Action Lexicon: 366 Verbs, 20 Classes 
95% and a precision of 93%. 
We found that for the 174 correctly determined 
finite verbs (out of the total 184), the heuristics for 
negation worked without any errors (100% accu- 
racy). The correct semantic verb was determined 
in 96% percent of all cases; errors are mostly due 
to misrecognition of clause boundaries. Action 
Type lookup was fully correct, even in the case 
of phrasal verbs and longer idiomatic expressions 
("have to" is a NEED..ACTION; "be inspired by" is 
a, CONTINUE_ACTION). There were 7 voice errors, 
2 of which were due to POS-tagging errors (past 
participle misrecognized). The remaining 5 voice 
errors correspond to a 98% accuracy. Figure 7 
gives an example for a voice error (underlined) in 
the output of the action/agent determination. 
Correctness of Agent Type determination was 
tested on a random sample of 100 sentences con- 
taining at least one agent, resulting in 111 agents. 
No agent pattern that should have been identi- 
fied was missed (100% recall). Of the 111 agents, 
105 cases were completely correct: the agent pat- 
tern covered the complete grammatical subject or 
by-PP intended (precision of 95%). There was one 
complete error, caused by a POS-tagging error. In 
5 of the 111 agents, the pattern covered only part 
At the point where John <ACTION 
TENSE=Pi~SENT VOICE=ACTIVE 
MODAL=NOMODAL NEGATION=0 
ACTIONTYPE=0> knows </ACTION> the truth 
has been <FINITE TENSE=PRESENT_PERFECT 
VOICE=PASSIVE MODAL=NOMODAL NEGA- 
TION=0 ACTIONTYPE=0> processed 
</ACTION> , a complete clause will have 
been <ACTION TENSE=FUTURE.PERFECT 
VOICE=ACTIVE MODAL=NOMODAL NEGA- 
TION=0 ACTIONTYPE=0> built </ACTION> 
Figure 7: Sample Output of Action Detection 
of a subject NP (typically the NP in a postmodify- 
ing PP), as in the phrase "the problem with these 
approaches" which was classified as REF_AGENT. 
These cases (counted as errors) indeed constitute 
no grave errors, as they still give an indication 
which type of agents the nominal phrase is associ- 
ated with. 
13 
4.2 Evaluation B: Usefulness for 
Argumentative Zoning 
We evaluated the usefulness of the Agent and Ac- 
tion features by measuring if they improve the 
classification results of our stochastic classifier for 
argumentative zones. 
We use 14 features given in figure 8, some of 
which are adapted from sentence extraction tech- 
niques (Paice, 1990; Kupiec et eL1., 1995; Teufel and 
Moens, 1999). 
. 
2. 
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 
11. 
12. 
13. 
14. 
Absolute location of sentence in document 
Relative location of sentence in section 
Location of a sentence in paragraph 
Presence of citations 
Location of citations 
Type of citations (self citation or not) 
Type of headline 
Presence of tf/idf key words 
Presence of title words 
Sentence length 
Presence of modal auxiliaries 
Tense of the finite verb 
Voice of the finite verb 
Presence of Formulaic Expressions 
Figure 8: Other features used 
All features except Citation Location and 
Citation Type proved helpful for classification. 
Two different statistical models were used: a Naive 
Bayesian model as in Kupiec et al.'s (1995) exper- 
iment, cf. Figure 9, and an ngram model over sen- 
tences, cf. Figure 10. Learning is supervised and 
training examples are provided by our previous hu- 
man annotation. Classification preceeds sentence 
by sentence. The ngram model combines evidence 
from the context (Cm-1, Cm-2) and from I senten- 
tiai features (F,~,o...Fmj-t), assuming that those 
two factors are independent of each other. It uses 
the same likelihood estimation as the Naive Bayes, 
but maximises a context-sensitive prior using the 
Viterbi algorithm. We received best results for 
n=2, i.e. a bigram model. 
The results of stochastic classification (pre- 
sented in figure 11) were compiled with a 10-fold 
cross-validation on our 80-paper corpus, contain- 
ing a total of 12422 sentences (classified items). 
As the first baseline, we use a standard text cat- 
egorization method for classification (where each 
sentence is considered as a document*) Baseline 1 
has an accuracy of 69%, which is low considering 
that the most frequent category (OWN) also coy- 
errs 69% of all sentences. Worse still, the classifier 
classifies almost all sentences as OWN and OTHER 
segments (the most frequent categories). Recall on 
the rare categories but important categories AIM, 
TEXTUAL, CONTRAST and BASIS is zero or very 
low. Text classification is therefore not a solution. 
*We used the Rainbow implementation of a Naive Bayes 
tf/idf method, 10-fold cross-validation. 
Baseline 2, the most frequent category (OWN), 
is a particularly bad baseline: its recall on all cate- 
gories except OWN is zero. We cannot see this bad 
performance in the percentage accuracy values, 
but only in the Kappa values (measured against 
one human annotator, i.e. k=2). As Kappa takes 
performance on rare categories into account more, 
it is a more intuitive measure for our task. 
In figure 11, NB refers to the Naive Bayes model, 
and NB+ to the Naive Bayes model augmented 
with the ngram model. We can see that the 
stochastic models obtain substantial improvement 
over the baselines, particularly with respect to pre- 
cision and recall of the rare categories, raising re- 
call considerably in all cases, while keeping preci- 
sion at the same level as Baseline 1 or improving 
it (exception: precision for BASIS drops; precision 
for AIM is insignificantly lower). 
If we look at the contribution of single features 
(reported for the Naive Bayes system in figure 12), 
we see that Agent and Action features improve 
the overall performance of the system by .02 and 
.04 Kappa points respectively (.36 to .38/.40). 
This is a good performance for single features. 
Agent is a strong feature beating both baselines. 
Taken by itself, its performance at K=.08 is still 
weaker than some other features in the pool, e.g. 
the Headline feature (K=.19), the Citation fea- 
ture (K=.I8) and the Absolute Location Fea- 
ture (K=.17). (Figure 12 reports classification re- 
sults only for the stronger features, i.e. those who 
are better than Baseline 2). The Action feature, 
if considered on its own, is rather weak: it shows 
a slightly better Kappa value than Baseline 2, but 
does not even reach the level of random agreement 
(K=0). Nevertheless, if taken together with the 
other features, it still improves results. 
Building on the idea that intellectual attribu- 
tion is a segment-based phenomena, we improved 
the Agent feature by including history (feature 
SAgent). The assumption is that in unmarked sen- 
tences the agent of the previous attribution is still 
active. Wiebe (1994) also reports segment-based 
agenthood as one of the most successful features. 
SAgent alone achieved a classification success of 
K=.21, which makes SAgent the best single fea- 
tures available in the entire feature pool. Inclusion 
of SAgent to the final model improved results to 
K=.43 (bigram model). 
Figure 12 also shows that different features are 
better at disambiguating certain categories. The 
Formulaic feature, which is not very strong on 
its own, is the most diverse, as it contributes to 
the disambiguation of six categories directly. Both 
Agent and Action features disambiguate cate-, 
gories which many of the other 12 features cannot 
disambiguate (e.g. CONTRAST), and SAgent addi- 
tionally contributes towards the determination of 
BACKGROUND zones (along with the Fo~ulaic 
and the Absolute Location feature). 
14 
P(CIFo, ..., F,~_,) ~ P(C) Nj~---°l P(FyIC) 
n--1 I'Ij=o P(Fj) 
P(CIFo .... , F.-i ): 
P(C): P(FjIC): 
P(FA: 
Probability that a sentence has target category C, given its feature values F0, ..., 
F.-i; 
(OveraU) probability of category C); 
Probability of feature-value pair Fj, given that the sentence is of target category C; 
Probability of feature value Fj; 
Figure 9: Naive Bayesian Classifier 
I--I F C P(CmlFm,o,. .,F~,~-i,C0,. .,6~-1) ~ P(V,~lCm-l,C~-2) l-I~=°P( ~,~1 ,~) 
• " l--1 FI~=o P(Fm,~) 
m: 
l: 
P( C,~IF~,o, . . . , F,~,~-t, Co,..., C,~-l ): 
P(C,~IC~-,,C~-2): 
P(F,~j\[C,~): 
P(F~,j): 
index of sentence (ruth sentence in text) 
number of features considered 
target category associated with sentence at index m 
Probability that sentence rn has target category Cm, given its 
feature values Fro,o, ..., Fmj-1 and given its context Co, ...C,~-1; 
Probability that sentence rn has target category C, given the cat- 
egories of the two previous sentences; 
Probability of feature-value pair Fj occu~ing within target cate- 
gory C at position m; 
Probability of feature value Fmj; 
Figure 10: Bigram Model 
5 Discussion 
The result for automatic classification is in agree- 
ment with our previous experimental results for 
human classification: humans, too, recognize the 
categories AIM and TEXTUAL most robustly (cf. 
Figure 11). AIM and TEXTUAL sentences, stating 
knowledge claims and organizing the text respec- 
tively, are conventionalized to a high degree. The 
system's results for AIM sentences, for instance, 
compares favourably to similar sentence extraction 
experiments (cf. Kupiec et al.'s (1995) results of 
42%/42% recall and precision for extracting "rel- 
evant" sentences from scientific articles). BASIS 
and CONTRAST sentences have a less prototypical 
syntactic realization, and they also occur at less 
predictable places in the document. Therefore, it 
is far more difficult for both machine and human 
to recognize such sentences. 
While the system does well for AIM and TEX- 
TUAL sentences, and provides substantial improve- 
ment over both baselines, the difference to human 
performance is still quite large (cf. figure 11). We 
attribute most of this difference to the modest size 
of our training corpus: 80 papers are not much for 
machine learning of such high-level features. It is 
possible that a more sophisticated model, in com- 
bination with more training material, would im- 
prove results significantly. However, when we ran 
them on our data as it is now, different other sta- 
tistical models, e.g. Ripper (Cohen, 1996) and a 
Maximum Entropy model, all showed similar nu- 
merical results. 
Another factor which decreases results are in- 
consistencies in the training data: we discovered 
that 4% of the sentences with the same features 
were classified differently by the human annota- 
tion. This points to the fact that our set of fea- 
tures could be made more distinctive. In most 
of these cases, there were linguistic expressions 
present, such as subtle signs of criticism, which 
humans correctly identified, but for which the fea- 
tures are too coarse. Therefore, the addition of 
"deeper" features to the pool, which model the se- 
mantics of the meta-discourse shallowly, seemed 
a promising avenue. We consider the automatic 
and robust recognition of agents and actions, as 
presented here, to be the first incarnations of such 
features. 
6 Conclusions 
Argumentative zoning is the task of breaking a 
text containing a scientific argument into linear 
zones of the same argumentative status, or zones 
of the same intellectual attribution. We plan to 
use argumentative zoning as a first step for IR and 
shallow document understanding tasks like sum- 
marization. In contrast to hierarchical segmenta- 
tion (e.g. Marcu's (1997) work, which is based on 
RST (Mann and Thompson, 1987)), this type of 
segmentation aims at capturing the argumentative 
status of a piece of text in respect to the overall 
argumentative act of the paper. It does not deter- 
15 
I Method Acc. K Precision/recall per category (in %) I 
(~) AIM CONTR. TXT. OWN BACKG. BASIS OTHER 
I Human Performance 87 .71 72/56 50/55 79/79 94/92 68/75 82/34 74/83 \] 
I NB+ (best results) 71 .43 40/53 33/20 62/57 85/85 30/58 28/31 50/38 I 
I NB (best results) 7'2 .41 42/60 34/22 61/60 82/90 40/43 27/41 53/29 I . I BasoL 1: Text catog 69 13 44/9 32/42 58/14 77/90 20/5 47/12 31/16 I 
I Basel. 2: Most freq. cat. 69 -.12 0/0 0/0 0/0 69/100 0/0 0/0 0/0 I 
Figure 11: Accuracy, Kappa, Precision and Recall of Human and Automatic Processing, in comparison 
to baselines 
Features used Acc. K Precision/recallper category(in%) 
(Naive Bayes System) (%) AIM CONTR. TXT. OWN BACKG. BASIS OTHER 
Action alone 68 -.II 0/0 43/1 0/0 68/99 0/0 0/0 0/0 
Agent alone 67 .08 0/0 0/0 0/0 71/93 0/0 0/0 36/23 
Shgent alone 70 .21 0/0 17/0 0/0 74/94 53/16 0/0 46/33 
Abs. Locationalone 70 .17 0/0 0/0 0/0 74/97 40/36 0/0 28/9 
Headlinesalone 69 .19 0/0 0/0 0/0 75/95 0/0 0/0 29/25 
CitaCionalone 70 .18 0/0 0/0 0/0 73/96 0/0 0/0 43/30 
Citat2on Type alone 70 .13 0/0 0/0 0/0 72/98 0/0 0/0 43/24 
Citation Locat. alone 70 .13 0/0 0/0 0/0 72/97 0/0 0/0 43/24 
Foz~mlaicalone 70 .07 40/2 45/2 75/39 71/98 0/0 40/1 47/13 
12 other features 71 .36 37/53 32/17 54/47 81/91 39/41 22/32 45/22 
12 fea.+hction 71 .38 38/57 34/22 58/59 81/91 39/40 25/38 48/22 
12fea.+hgent 72 .40 40/57 35/18 59/51 82/91 39/43 25/34 52/29 
12fea.+SAgent 73 .40 39/57 33/19 61/51 81/91 42/43 25/33 52/29 
12 ~a.+Action+hgent 71 .43 40/53 33/20 62/57 85/85 30/58 28/31 50/38 
12 fea.+Action+Shgen~ 73 .41 41/59 34/22 62/61 82/91 41/42 27/39 51/29 
Figure 12: Accuracy, Kappa, 
individual features 
Precision and Recall of Automatic Processing (Naive Bayes system), per 
mine the rhetorical structure within zones. Sub- 
zone structure is most likely related to domain- 
specific rhetorical relations which are not directly 
relevant to the discourse-level relations we wish to 
recognize. 
We have presented a fully implemented proto- 
type for argumentative zoning. Its main inno- 
vation are two new features: prototypical agents 
and actions -- semi-shallow representations of the 
overall scientific argumentation of the article. For 
agent and action recognition, we use syntactic 
heuristics and two extensive libraries of patterns. 
Processing is robust and very low in error. We 
evaluated the system without and with the agent 
and action features and found that the features im- 
prove results for automatic argumentative zoning 
considerably. History-aware agents are the best 
single feature in a large, extensively tested feature 
pool. 

References 
Biber, Douglas. 1995. Dimensions of Register Varia- 
tion: A Cross-linguistic Comparison. Cambridge, 
England: Cambridge University Press. 
Carletta, Jean. 1996. Assessing agreement on classi- 
fication tasks: The kappa statistic. Computational 
Linguistics 22(2): 249-.-254. 
Cohen, William W. 1996. Learning trees and rules 
with set-valued features. In Proceedings of AAAL 
96. 
Grocer, Claire, Andrei Mikheev, and Colin Mathe- 
son. 1999. LT TTT Version 1.0: Text Tokenisa- 
tion Software. Technical report, Human Commu- 
nication Research Centre, University of Edinburgh. 
http ://~w. ltg. ed. ac. uk/software/ttt/. 
Hearst, Marti A. 1997. TextTiling: Segmenting text 
into multi-paragraph subtopic passages. Computa- 
tional Linguistics 23(1): 33---64. 
Hyland, Ken. 1998. Persuasion and context: The prag- 
matics of academic metadiscourse. Journal o\] Prag- 
matics 30(4): 437-455. 
Kan, Min-Yen, Judith L. Klavans, and Kathleen R. 
McKeown. 1998. Linear Segmentation and Segment 
Significance. In Proceedings o~ the Sixth Workshop 
on Very Large Corpora (COLIN G/ACL-98), 197- 
205. 
Klavans, Judith L., and Min-Yen Kan. 1998. Role 
of verbs in document analysis. In Proceedings 
of 36th Annual Meeting o\] the Association /or 
Computational Linguistics and the 17th Interna- 
tional Conference on Computational Linguistics 
(,4 CL/COLING-gS), 68O--686. 
Krippendorff, Klaus. 1980. Content Analysis: An In- 
troduction to its Methodology. Beverly Hills, CA: 
Sage Publications. 
Kupiee, Julian, Jan O. Pedersen, and Franeine Chela. 
1995. A trainable document summarizer. In Pro- 
ceedings of the 18th Annual International Confer- 
ence on Research and Development in Information 
Retrieval (SIGIR-95), 68--73. 
Landis, J.R., and G.G. Koch. 1977. The Measurement 
of Observer Agreement for Categorical Data. Bio- 
metrics 33: 159-174. 
Lawrence, Steve, C. Lee Giles, and Ku_t Bollaeker. 
1999. Digital libraries and autonomous citation in- 
dexing. IEEE Computer 32(6): 67-71. 
Levin, Beth. 1993. English Verb Classes and Alterna- 
tions. Chicago, IL: University of Chicago Press. 
Mann, William C., and Sandra A. Thompson. 1987. 
Rhetorical Structure Theory: Description and Con- 
struction of text structures. In Gerard Kempen, 
ed., Natural Language Generation: New Results in 
Artificial Intelligence, Psychology, and Linguistics, 
85-95. Dordrecht, NL: Marinus Nijhoff Publishers. 
Marcu, Daniel. 1997. From Discourse Structures to 
Text Summaries. In Inderjeet Mani and Mark T. 
Maybury, eds., Proceedings of the ACL/EACL-97 
Workshop on Intelligent Scalable Text Summariza- 
tion, 82-88. 
Morris, Jane, and Graeme Hirst. 1991. Lexical cohe- 
sion computed by thesau.ral relations as an indicator 
of the structure of text. Computational Linguistics 
17: 21-48. 
Myers, Greg. 1992. In this paper we report...---speech 
acts and scientific facts. Journal of Pragmatics 
17(4): 295-313. 
:Nanba, I:Iidetsugu, and Manabu Okumura. 1999. To- 
wards multi-paper summarization using reference 
in.formation. In Proceedings of IJCAI-99, 926- 
931. http://galaga, jaist, ac. jp: 8000/'nanba/ 
study/papers .html. 
Paice, Chris D. 1981. The automatic generation of 
literary abstracts: an approach based on the iden- 
tification of self-indicating phrases. In Robert Nor- 
man Oddy, Stephen E. Robertson, Cornelis Joost 
van Pdjsbergen, and P. W. Williams, eds., Infor- 
mation Retrieval Research, 172-191. London, UK: 
Butterworth. 
Paice, Chris D. 1990. Constructing literature abstracts 
by computer: techniques and prospects. Informa- 
tion Processing and Management 26: 171-186. 
Reynar, Jeffrey C. 1999. Statistical models for topic 
segmentation. In Proceedings of the 37th Annual 
Meeting of the Association for Computational Lin- 
guistics (A CL-99), 357-364. 
Riley, Kathryn. 1991. Passive voice and rhetorical role 
in scientific writing. Journal of Technical Writing 
and Communication 21(3): 239--257. 
Rowley, Jennifer. 1982. Abstracting and Indexing. 
London, UK: Bingley. 
Siegel, Sidney, and N. John Jr. CasteUan. 1988. Non- 
parametric Statistics for the Behavioral Sciences. 
Berkeley, CA: McGraw-Hill, 2nd edn. 
Swales, John. 1990. Genre Analysis: English in Aca- 
demic and Research Settings. Chapter 7: Research 
articles in English, 110-.-176. Cambridge, UK: Cam- 
bridge University Press. 
Teufel, Simone, Jean Carletta, and Marc Moens. 1999. 
An annotation scheme for discourse-level argumen- 
tation in research articles. In Proceedings of the 8th 
Meeting of the European Chapter of the Association 
for Computational Linguistics (EA CL-99), 110-117. 
Teufel, Simone, and Marc Moens. 1999. Argumenta- 
tive classification of extracted sentences as a first 
step towards flexible abstracting. In Inderjeet Mani 
and Mark T. Maybury, eds., Advances in Auto- 
matic Text Summarization, 155-171. Cambridge, 
MA: MIT Press. 
Thompson, Geoff, and Ye Yiyun. 1991. Evaluation in 
the reporting verbs used in academic papers. Ap- 
plied Linguistics 12(4): 365-382. 
Wellons, M. E., and G. P. Purcell. 1999. Task-specific 
extracts for using the medical literature. In Pro- 
ceedings of the American Medical Informatics Sym- 
posium, 1004-1008. 
Wiebe, Janyce. 1994. Tracking point of view in narra- 
tive. Computational Linguistics 20(2): 223-287. 
