Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 65–72,
New York City, June 2006. c©2006 Association for Computational Linguistics
Generative Content Models for Structural Analysis of Medical Abstracts
Jimmy Lin1,2, Damianos Karakos3, Dina Demner-Fushman2, and Sanjeev Khudanpur3
1College of Information Studies 3Center for Language and
2Institute for Advanced Computer Studies Speech Processing
University of Maryland Johns Hopkins University
College Park, MD 20742, USA Baltimore, MD 21218, USA
jimmylin@umd.edu, demner@cs.umd.edu (damianos, khudanpur)@jhu.edu
Abstract
The ability to accurately model the con-
tent structure of text is important for
many natural language processing appli-
cations. This paper describes experi-
ments with generative models for analyz-
ing the discourse structure of medical ab-
stracts, which generally follow the pattern
of “introduction”, “methods”, “results”,
and “conclusions”. We demonstrate that
Hidden Markov Models are capable of ac-
curately capturing the structure of such
texts, and can achieve classification ac-
curacy comparable to that of discrimina-
tive techniques. In addition, generative
approaches provide advantages that may
make them preferable to discriminative
techniques such as Support Vector Ma-
chines under certain conditions. Our work
makes two contributions: at the applica-
tion level, we report good performance
on an interesting task in an important do-
main; more generally, our results con-
tribute to an ongoing discussion regarding
the tradeoffs between generative and dis-
criminative techniques.
1 Introduction
Certain types of text follow a predictable structure,
the knowledge of which would be useful in many
natural language processing applications. As an
example, scientific abstracts across many different
fields generally follow the pattern of “introduction”,
“methods”, “results”, and “conclusions” (Salanger-
Meyer, 1990; Swales, 1990; Or˘asan, 2001). The
ability to explicitly identify these sections in un-
structured text could play an important role in ap-
plications such as document summarization (Teufel
and Moens, 2000), information retrieval (Tbahriti
et al., 2005), information extraction (Mizuta et al.,
2005), and question answering. Although there is
a trend towards analysis of full article texts, we
believe that abstracts still provide a tremendous
amount of information, and much value can still be
extracted from them. For example, Gay et al. (2005)
experimented with abstracts and full article texts in
the task of automatically generating index term rec-
ommendations and discovered that using full article
texts yields at most a 7.4% improvement in F-score.
Demner-Fushman et al. (2005) found a correlation
between the quality and strength of clinical conclu-
sions in the full article texts and abstracts.
This paper presents experiments with generative
content models for analyzing the discourse struc-
ture of medical abstracts, which has been con-
firmed to follow the four-section pattern discussed
above (Salanger-Meyer, 1990). For a variety of rea-
sons, medicine is an interesting domain of research.
The need for information systems to support physi-
cians at the point of care has been well studied (Cov-
ell et al., 1985; Gorman et al., 1994; Ely et al.,
2005). Retrieval techniques can have a large im-
pact on how physicians access and leverage clini-
cal evidence. Information that satisfies physicians’
needscanbefoundintheMEDLINEdatabasemain-
tained by the U.S. National Library of Medicine
65
(NLM), which also serves as a readily available
corpus of abstracts for our experiments. Further-
more, the availability of rich ontological resources,
in the form of the Unified Medical Language Sys-
tem (UMLS) (Lindberg et al., 1993), and the avail-
ability of software that leverages this knowledge—
MetaMap(Aronson,2001)forconceptidentification
and SemRep (Rindflesch and Fiszman, 2003) for re-
lation extraction—provide a foundation for studying
the role of semantics in various tasks.
McKnight and Srinivasan (2003) have previously
examined the task of categorizing sentences in med-
ical abstracts using supervised discriminative ma-
chine learning techniques. Building on the work of
Ruch et al. (2003) in the same domain, we present a
generative approach that attempts to directly model
the discourse structure of MEDLINE abstracts us-
ing Hidden Markov Models (HMMs); cf. (Barzilay
and Lee, 2004). Although our results were not ob-
tained from the same exact collection as those used
byauthorsofthesetwopreviousstudies,comparable
experiments suggest that our techniques are compet-
itive in terms of performance, and may offer addi-
tional advantages as well.
Discriminative approaches (especially SVMs)
have been shown to be very effective for many
supervised classification tasks; see, for exam-
ple, (Joachims, 1998; Ng and Jordan, 2001). How-
ever, theirhighcomputationalcomplexity(quadratic
inthenumberoftrainingsamples)rendersthempro-
hibitive for massive data processing. Under certain
conditions, generative approaches with linear com-
plexity are preferable, even if their performance is
lower than that which can be achieved through dis-
criminative training. Since HMMs are very well-
suited to modeling sequences, our discourse model-
ingtasklendsitselfnaturallytothisparticulargener-
ative approach. In fact, we demonstrate that HMMs
are competitive with SVMs, with the added advan-
tageoflowercomputationalcomplexity. Inaddition,
generative models can be directly applied to tackle
certain classes of problems, such as sentence order-
ing, in ways that discriminative approaches cannot
readily. In the context of machine learning, we see
our work as contributing to the ongoing debate be-
tween generative and discriminative approaches—
weprovideacasestudyinaninterestingdomainthat
begins to explore some of these tradeoffs.
2 Methods
2.1 Corpus and Data Preparation
Our experiments involved MEDLINE, the biblio-
graphicaldatabaseofbiomedicalarticlesmaintained
by the U.S. National Library of Medicine (NLM).
We used the subset of MEDLINE that was extracted
for the TREC 2004 Genomics Track, consisting of
citations from 1994 to 2003. In total, 4,591,008
records (abstract text and associated metadata) were
extracted using the Date Completed (DCOM) field
for all references in the range of 19940101 to
20031231.
Viewing structural modeling of medical abstracts
as a sentence classification task, we leveraged the
existence of so-called structured abstracts (see Fig-
ure 1 for an example) in order to obtain the appro-
priate section label for each sentence. The use of
section headings is a device recommended by the
Ad Hoc Working Group for Critical Appraisal of the
Medical Literature (1987) to help humans assess the
reliability and content of a publication and to facil-
itate the indexing and retrieval processes. Although
structured abstracts loosely adhere to the introduc-
tion, methods, results, and conclusions format, the
exact choice of section headings varies from ab-
stract to abstract and from journal to journal. In our
test collection, we observed a total of 2688 unique
section headings in structured abstracts—these were
manually mapped to the four broad classes of “intro-
duction”, “methods”, “results”, and “conclusions”.
All sentences falling under a section heading were
assigned the label of its appropriately-mapped head-
ing (naturally, the actual section headings were re-
moved in our test collection). As a concrete exam-
ple, in the abstract shown in Figure 1, the “OBJEC-
TIVE” section would be mapped to “introduction”,
the “RESEARCH DESIGN AND METHODS” sec-
tion to “methods”. The “RESULTS” and “CON-
CLUSIONS” sections map directly to our own la-
bels. In total, 308,055 structured abstracts were ex-
tracted and prepared in this manner, serving as the
complete dataset. In addition, we created a reduced
collection of 27,075 abstracts consisting of only
Randomized Controlled Trials (RCTs), which rep-
resent definitive sources of evidence highly-valued
in the clinical decision-making process.
Separately, we manually annotated 49 unstruc-
66
Integrating medical management with diabetes self-management training: a randomized control trial of the Diabetes
Outpatient Intensive Treatment program.
OBJECTIVE– This study evaluated the Diabetes Outpatient Intensive Treatment (DOIT) program, a multiday group educa-
tion and skills training experience combined with daily medical management, followed by case management over 6 months.
Using a randomized control design, the study explored how DOIT affected glycemic control and self-care behaviors over a
short term. The impact of two additional factors on clinical outcomes were also examined (frequency of case management
contacts and whether or not insulin was started during the program). RESEARCH DESIGN AND METHODS– Patients
with type 1 and type 2 diabetes in poor glycemic control (A1c ¿8.5%) were randomly assigned to DOIT or a second con-
dition, entitled EDUPOST, which was standard diabetes care with the addition of quarterly educational mailings. A total
of 167 patients (78 EDUPOST, 89 DOIT) completed all baseline measures, including A1c and a questionnaire assessing
diabetes-related self-care behaviors. At 6 months, 117 patients (52 EDUPOST, 65 DOIT) returned to complete a follow-up
A1c and the identical self-care questionnaire. RESULTS– At follow-up, DOIT evidenced a significantly greater drop in A1c
than EDUPOST. DOIT patients also reported significantly more frequent blood glucose monitoring and greater attention to
carbohydrate and fat contents (ACFC) of food compared with EDUPOST patients. An increase in ACFC over the 6-month
period was associated with improved glycemic control among DOIT patients. Also, the frequency of nurse case manager
follow-up contacts was positively linked to better A1c outcomes. The addition of insulin did not appear to be a significant
contributor to glycemic change. CONCLUSIONS– DOIT appears to be effective in promoting better diabetes care and posi-
tively influencing glycemia and diabetes-related self-care behaviors. However, it demands significant time, commitment, and
careful coordination with many health care professionals. The role of the nurse case manager in providing ongoing follow-up
contact seems important.
Figure 1: Sample structured abstract from MEDLINE.
tured abstracts of randomized controlled trials re-
trieved to answer a question about the manage-
ment of elevated low-density lipoprotein cholesterol
(LDL-C). We submitted a PubMed query (“elevated
LDL-C”) and restricted results to English abstracts
of RCTs, gathering 49 unstructured abstracts from
26 journals. Each sentence was annotated with its
section label by the third author, who is a medical
doctor—this collection served as our blind held-out
testset. Note that the annotation process preceded
our experiments, which helped to guard against
annotator-introduced bias. Of 49 abstracts, 35 con-
tained all four sections (which we refer to as “com-
plete”), while 14 abstracts were missing one or more
sections (which we refer to as “partial”).
Two different types of experiments were con-
ducted: the first consisted of cross-validation on the
structured abstracts; the second consisted of train-
ing on the structured abstracts and testing on the
unstructured abstracts. We hypothesized that struc-
tured and unstructured abstracts share the same un-
derlying discourse patterns, and that content models
trained with one can be applied to the other.
2.2 Generative Models of Content
Following Ruch et al. (2003) and Barzilay and
Lee (2004), we employed Hidden Markov Models
to model the discourse structure of MEDLINE ab-
stracts. The four states in our HMMs correspond
to the information that characterizes each section
(“introduction”, “methods”, “results”, and “conclu-
sions”) and state transitions capture the discourse
flow from section to section.
Using the SRI language modeling toolkit, we
first computed bigram language models for each
of the four sections using Kneser-Ney discounting
and Katz backoff. All words in the training set
were downcased, all numbers were converted into
a generic symbol, and all singleton unigrams and bi-
grams were removed. Using these results, each sen-
tence was converted into a four dimensional vector,
where each component represents the log probabil-
ity, divided by the number of words, of the sentence
under each of the four language models.
We then built a four-state Hidden Markov Model
that outputs these four-dimensional vectors. The
transition probability matrix of the HMM was ini-
tialized with uniform probabilities over a fully
connected graph. The output probabilities were
modeled as four-dimensional Gaussians mixtures
with diagonal covariance matrices. Using the sec-
tion labels, the HMM was trained using the HTK
toolkit (Young et al., 2002), which efficiently per-
forms the forward-backward algorithm and Baum-
Welch estimation. For testing, we performed a
Viterbi (maximum likelihood) estimation of the la-
bel of each test sentence/vector (also using the HTK
toolkit).
67
In an attempt to further boost performance, we
employed Linear Discriminant Analysis (LDA) to
find a linear projection of the four-dimensional vec-
tors that maximizes the separation of the Gaussians
(corresponding to the HMM states). Venables and
Ripley (1994) describe an efficient algorithm (of lin-
ear complexity in the number of training sentences)
for computing the LDA transform matrix, which en-
tails computing the within- and between-covariance
matricesoftheclasses,andusingSingularValueDe-
composition (SVD) to compute the eigenvectors of
the new space. Each sentence/vector is then mul-
tiplied by this matrix, and new HMM models are
re-computed from the projected data.
An important aspect of our work is modeling con-
tent structure using generative techniques. To as-
sess the impact of taking discourse transitions into
account, we compare our fully trained model to
one that does not take advantage of the Markov
assumption—i.e., it assumes that the labels are in-
dependently and identically distributed.
To facilitate comparison with previous work, we
also experimented with binary classifiers specifi-
cally tuned to each section. This was done by creat-
ing a two-state HMM: one state corresponds to the
label we want to detect, and the other state corre-
sponds to all the other labels. We built four such
classifiers, one for each section, and trained them in
the same manner as above.
3 Results
We report results on three distinct sets of experi-
ments: (1) ten-fold cross-validation (90/10 split) on
all structured abstracts from the TREC 2004 MED-
LINE corpus, (2) ten-fold cross-validation (90/10
split) on the RCT subset of structured abstracts from
the TREC 2004 MEDLINE corpus, (3) training on
the RCT subset of the TREC 2004 MEDLINE cor-
pus and testing on the 49 hand-annotated held-out
testset.
The results of our first set of experiments are
shown in Tables 1(a) and 1(b). Table 1(a) reports
the classification error in assigning a unique label to
every sentence, drawn from the set {“introduction”,
“methods”, “results”, “conclusions”}. For this task,
we compare the performance of three separate mod-
els: one that does not make the Markov assumption,
Model Error
non-HMM .220
HMM .148
HMM + LDA .118
(a)
Section Acc Prec Rec F
Introduction .957 .930 .840 .885
Methods .921 .810 .875 .843
Results .921 .898 .898 .898
Conclusions .963 .898 .896 .897
(b)
Table 1: Ten-fold cross-validation results on all
structured abstracts from the TREC 2004 MED-
LINE corpus: multi-way classification on complete
abstract structure (a) and by-section binary classifi-
cation (b).
the basic four-state HMM, and the improved four-
state HMM with LDA. As expected, explicitly mod-
eling the discourse transitions significantly reduces
the error rate. Applying LDA further enhances clas-
sification performance. Table 1(b) reports accuracy,
precision, recall, and F-measure for four separate bi-
nary classifiers specifically trained for each of the
sections (one per row in the table). We only dis-
play results with our best model, namely HMM with
LDA.
The results of our second set of experiments (with
RCTs only) are shown in Tables 2(a) and 2(b).
Table 2(a) reports the multi-way classification er-
ror rate; once again, applying the Markov assump-
tion to model discourse transitions improves perfor-
mance, and using LDA further reduces error rate.
Table 2(b) reports accuracy, precision, recall, and F-
measure for four separate binary classifiers (HMM
with LDA) specifically trained for each of the sec-
tions (one per row in the table). The table also
presents the closest comparable experimental re-
sults reported by McKnight and Srinivasan (2003).1
McKnight and Srinivasan (henceforth, M&S) cre-
ated a test collection consisting of 37,151 RCTs
from approximately 12 million MEDLINE abstracts
dated between 1976 and 2001. This collection has
1After contacting the authors, we were unable to obtain the
same exact dataset that they used for their experiments.
68
Model Error
non-HMM .238
HMM .212
HMM + LDA .209
(a)
Present study McKnight and Srinivasan
Section Acc Prec Rec F Acc Prec Rec F
Introduction .931 .898 .715 .807 .967 .920 .970 .945
Methods .904 .812 .847 .830 .895 .810 .830 .820
Results .902 .902 .831 .867 .860 .810 .830 .820
Conclusions .929 .772 .790 .781 .970 .880 .910 .820
(b)
Table 2: Ten-fold cross-validation results on the structured RCT subset of the TREC 2004 MEDLINE
corpus: multi-way classification (a) and binary classification (b). Table (b) also reproduces the results from
McKnight and Srinivasan (2003) for a comparable task on a different RCT-subset of structured abstracts.
Model Complete Partial
non-HMM .247 .371
HMM .226 .314
HMM + LDA .217 .279
(a)
Complete Partial McKnight and Srinivasan
Section Acc Prec Rec F Acc Prec Rec F Acc Prec Rec F
Introduction .923 .739 .723 .731 .867 .368 .636 .502 .896 .630 .450 .524
Methods .905 .841 .793 .817 .859 .958 .589 .774 .897 .880 .730 .799
Results .899 .913 .857 .885 .892 .942 .830 .886 .872 .840 .880 .861
Conclusions .911 .639 .847 .743 .884 .361 .995 .678 .941 .830 .750 .785
(b)
Table 3: Training on the structured RCT subset of the TREC 2004 MEDLINE corpus, testing on corpus of
hand-annotated abstracts: multi-way classification (a) and binary classification (b). Unstructured abstracts
with all four sections (complete), and with missing sections (partial) are shown. Table (b) again repro-
duces the results from McKnight and Srinivasan (2003) for a comparable task on a different subset of 206
unstructured abstracts.
69
significantlymoretrainingexamplesthanourcorpus
of 27,075 abstracts, which could be a source of per-
formance differences. Furthermore, details regard-
ing their procedure for mapping structured abstract
headings to one of the four general labels was not
discussed in their paper. Nevertheless, our HMM-
based approach is at least competitive with SVMs,
perhaps better in some cases.
The results of our third set of experiments (train-
ing on RCTs and testing on a held-out testset of
hand-annotated abstracts) is shown in Tables 3(a)
and 3(b). Mirroring the presentation format above,
Table 3(a) shows the classification error for the four-
way label assignment problem. We noticed that
some unstructured abstracts are qualitatively differ-
ent from structured abstracts in that some sections
are missing. For example, some unstructured ab-
stracts lack an introduction, and instead dive straight
into methods; other unstructured abstracts lack a
conclusion. As a result, classification error is higher
in this experiment than in the cross-validation ex-
periments. We report performance figures for 35 ab-
stracts that contained all four sections (“complete”)
and for 14 abstracts that had one or more miss-
ing sections (“partial”). Table 3(b) reports accu-
racy, precision, recall, and F-measure for four sep-
arate binary classifiers (HMM with LDA) specifi-
cally trained for each section (one per row in the
table). The table also presents the closest compa-
rable experimental results reported by M&S—over
206 hand-annotated unstructured abstracts. Interest-
ingly, M&S did not specifically note missing sec-
tions in their testset.
4 Discussion
An interesting aspect of our generative approach
is that we model HMM outputs as Gaussian vec-
tors (log probabilities of observing entire sentences
based on our language models), as opposed to se-
quences of terms, as done in (Barzilay and Lee,
2004). This technique provides two important ad-
vantages. First, Gaussian modeling adds an ex-
tra degree of freedom during training, by capturing
second-order statistics. This is not possible when
modeling word sequences, where only the probabil-
ity of a sentence is actually used in the HMM train-
ing. Second, using continuous distributions allows
ustoleverageavarietyoftools(e.g., LDA)thathave
been shown to be successful in other fields, such as
speech recognition (Evermann et al., 2004).
Table 2(b) represents the closest head-to-head
comparison between our generative approach
(HMM with LDA) and state-of-the-art results
reported by M&S using SVMs. In some ways, the
results reported by M&S have an advantage because
they use significantly more training examples. Yet,
we can see that generative techniques for the model-
ing of content structure are at least competitive—we
even outperform SVMs on detecting “methods”
and “results”. Moreover, the fact that the training
and testing of HMMs have linear complexity (as
opposed to the quadratic complexity of SVMs)
makes our approach a very attractive alternative,
given the amount of training data that is available
for such experiments.
Although exploration of the tradeoffs between
generative and discriminative machine learning
techniques is one of the aims of this work, our ul-
timate goal, however, is to build clinical systems
that provide timely access to information essential
to the patient treatment process. In truth, our cross-
validation experiments do not correspond to any
meaningful naturally-occurring task—structured ab-
stracts are, after all, already appropriately labeled.
The true utility of content models is to struc-
ture abstracts that have no structure to begin with.
Thus, our exploratory experiments in applying con-
tent models trained with structured RCTs on un-
structured RCTs is a closer approximation of an
extrinsically-valid measure of performance. Such a
component would serve as the first stage of a clin-
ical question answering system (Demner-Fushman
and Lin, 2005) or summarization system (McKe-
own et al., 2003). We chose to focus on randomized
controlled trials because they represent the standard
benchmark by which all other clinical studies are
measured.
Table 3(b) shows the effectiveness of our trained
content models on abstracts that had no explicit
structure to begin with. We can see that although
classification accuracy is lower than that from our
cross-validation experiments, performance is quite
respectable. Thus, our hypothesis that unstructured
abstracts are not qualitatively different from struc-
tured abstracts appears to be mostly valid.
70
5 Related Work
Although not the first to employ a generative ap-
proach to directly model content, the seminal work
of Barzilay and Lee (2004) is a noteworthy point
of reference and comparison. However, our study
differs in several important respects. Barzilay and
Lee employed an unsupervised approach to building
topic sequence models for the newswire text genre
using clustering techniques. In contrast, because
the discourse structure of medical abstracts is well-
defined and training data is relatively easy to ob-
tain, we were able to apply a supervised approach.
Whereas Barzilay and Lee evaluated their work in
the context of document summarization, the four-
part structure of medical abstracts allows us to con-
duct meaningful intrinsic evaluations and focus on
the sentence classification task. Nevertheless, their
work bolsters our claims regarding the usefulness of
generative models in extrinsic tasks, which we do
not describe here.
Although this study falls under the general topic
of discourse modeling, our work differs from previ-
ous attempts to characterize text in terms of domain-
independent rhetorical elements (McKeown, 1985;
Marcu and Echihabi, 2002). Our task is closer to the
work of Teufel and Moens (2000), who looked at the
problem of intellectual attribution in scientific texts.
6 Conclusion
We believe that there are two contributions as a re-
sult of our work. From the perspective of machine
learning, the assignment of sequentially-occurring
labels represents an underexplored problem with re-
spect to the generative vs. discriminative debate—
previous work has mostly focused on stateless clas-
sification tasks. This paper demonstrates that Hid-
den Markov Models are capable of capturing dis-
course transitions from section to section, and are
at least competitive with Support Vector Machines
from a purely performance point of view.
The other contribution of our work is that it con-
tributes to building advanced clinical information
systems. From an application point of view, the abil-
ity to assign structure to otherwise unstructured text
represents a key capability that may assist in ques-
tion answering, document summarization, and other
natural language processing applications.
Much research in computational linguistics has
focused on corpora comprised of newswire articles.
We would like to point out that clinical texts provide
another attractive genre in which to conduct experi-
ments. Such texts are easy to acquire, and the avail-
ability of domain ontologies provides new opportu-
nities for knowledge-rich approaches to shine. Al-
though we have only experimented with lexical fea-
tures in this study, the door is wide open for follow-
on studies based on semantic features.
7 Acknowledgments
The first author would like to thank Esther and Kiri
for their loving support.

References
Ad Hoc Working Group for Critical Appraisal of the
Medical Literature. 1987. A proposal for more infor-
mative abstracts of clinical articles. Annals of Internal
Medicine, 106:595–604.
Alan R. Aronson. 2001. Effective mapping of biomed-
ical text to the UMLS Metathesaurus: The MetaMap
program. In Proceeding of the 2001 Annual Sympo-
sium of the American Medical Informatics Association
(AMIA 2001), pages 17–21.
Regina Barzilay and Lillian Lee. 2004. Catching the
drift: Probabilistic content models, with applications
to generation and summarization. In Proceedings
of the 2004 Human Language Technology Confer-
ence and the North American Chapter of the Associ-
ation for Computational Linguistics Annual Meeting
(HLT/NAACL 2004).
David G. Covell, Gwen C. Uman, and Phil R. Manning.
1985. Information needs in office practice: Are they
being met? Annals of Internal Medicine, 103(4):596–
599, October.
Dina Demner-Fushman and Jimmy Lin. 2005. Knowl-
edge extraction for clinical question answering: Pre-
liminary results. In Proceedings of the AAAI-05 Work-
shop on Question Answering in Restricted Domains.
Dina Demner-Fushman, Susan E. Hauser, and George R.
Thoma. 2005. The role of title, metadata and ab-
stract in identifying clinically relevant journal arti-
cles. In Proceeding of the 2005 Annual Symposium of
the American Medical Informatics Association (AMIA
2005), pages 191–195.
John W. Ely, Jerome A. Osheroff, M. Lee Chambliss,
Mark H. Ebell, and Marcy E. Rosenbaum. 2005. An-
swering physicians’ clinical questions: Obstacles and
potential solutions. Journal of the American Medical
Informatics Association, 12(2):217–224, March-April.
Gunnar Evermann, H. Y. Chan, Mark J. F. Gales, Thomas
Hain, Xunying Liu, David Mrva, Lan Wang, and Phil
Woodland. 2004. Development of the 2003 CU-HTK
Conversational Telephone Speech Transcription Sys-
tem. In Proceedings of the 2004 International Con-
ference on Acoustics, Speech and Signal Processing
(ICASSP04).
Clifford W. Gay, Mehmet Kayaalp, and Alan R. Aronson.
2005. Semi-automatic indexing of full text biomedi-
cal articles. In Proceeding of the 2005 Annual Sympo-
sium of the American Medical Informatics Association
(AMIA 2005), pages 271–275.
Paul N. Gorman, Joan S. Ash, and Leslie W. Wykoff.
1994. Can primary care physicians’ questions be an-
swered using the medical journal literature? Bulletin
of the Medical Library Association, 82(2):140–146,
April.
Thorsten Joachims. 1998. Text categorization with Sup-
port Vector Machines: Learning with many relevant
features. In Proceedings of the European Conference
on Machine Learning (ECML 1998).
Donald A. Lindberg, Betsy L. Humphreys, and Alexa T.
McCray. 1993. The Unified Medical Language Sys-
tem. Methods of Information in Medicine, 32(4):281–
291, August.
Daniel Marcu and Abdessamad Echihabi. 2002. An
unsupervised approach to recognizing discourse rela-
tions. In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics (ACL
2002).
Kathleen McKeown, Noemie Elhadad, and Vasileios
Hatzivassiloglou. 2003. Leveraging a common rep-
resentation for personalized search and summarization
in a medical digital library. In Proceedings of the
3rd ACM/IEEE Joint Conference on Digital Libraries
(JCDL 2003).
Kathleen R. McKeown. 1985. Text Generation: Using
Discourse Strategies and Focus Constraints to Gen-
erate Natural Language Text. Cambridge University
Press, Cambridge, England.
Larry McKnight and Padmini Srinivasan. 2003. Catego-
rization of sentence types in medical abstracts. In Pro-
ceeding of the 2003 Annual Symposium of the Ameri-
can Medical Informatics Association (AMIA 2003).
Yoko Mizuta, Anna Korhonen, Tony Mullen, and Nigel
Collier. 2005. Zone analysis in biology articles as a
basisforinformationextraction. InternationalJournal
of Medical Informatics, in press.
Andrew Y. Ng and Michael Jordan. 2001. On discrim-
inative vs. generative classifiers: A comparison of lo-
gisticregressionandnaiveBayes. InAdvancesinNeu-
ral Information Processing Systems 14.
Constantin Or˘asan. 2001. Patterns in scientific abstracts.
In Proceedings of the 2001 Corpus Linguistics Confer-
ence.
Thomas C. Rindflesch and Marcelo Fiszman. 2003. The
interaction of domain knowledge and linguistic struc-
ture in natural language processing: Interpreting hy-
pernymic propositions in biomedical text. Journal of
Biomedical Informatics, 36(6):462–477, December.
Patrick Ruch, Christine Chichester, Gilles Cohen, Gio-
vanni Coray, Fr´ed´eric Ehrler, Hatem Ghorbel, Hen-
ning M¨uller, and Vincenzo Pallotta. 2003. Report
on the TREC 2003 experiment: Genomic track. In
Proceedings of the Twelfth Text REtrieval Conference
(TREC 2003).
Franc¸oise Salanger-Meyer. 1990. Discoursal movements
in medical English abstracts and their linguistic expo-
nents: A genre analysis study. INTERFACE: Journal
of Applied Linguistics, 4(2):107–124.
John M. Swales. 1990. Genre Analysis: English in Aca-
demic and Research Settings. Cambridge University
Press, Cambridge, England.
Imad Tbahriti, Christine Chichester, Fr´ed´erique Lisacek,
and Patrick Ruch. 2005. Using argumentation to re-
trieve articles with similar citations: An inquiry into
improving related articles search in the MEDLINE
digital library. International Journal of Medical In-
formatics, in press.
Simone Teufel and Marc Moens. 2000. What’s yours
and what’s mine: Determining intellectual attribu-
tion in scientific text. In Proceedings of the Joint
SIGDAT Conference on Empirical Methods in Nat-
ural Language Processing and Very Large Corpora
(EMNLP/VLC-2000).
William N. Venables and Brian D. Ripley. 1994. Modern
Applied Statistics with S-Plus. Springer-Verlag.
Steve Young, Gunnar Evermann, Thomas Hain, Dan Ker-
shaw, Gareth Moore, Julian Odell, Dave Ollason, Dan
Povey, Valtcho Valtchev, and Phil Woodland. 2002.
The HTK Book. Cambridge University Press.
