Analysis of Link Grammar on Biomedical Dependency Corpus
Targeted at Protein-Protein Interactions
Sampo Pyysalo, Filip Ginter, Tapio Pahikkala,
Jorma Boberg, Jouni J˜arvinen, Tapio Salakoski
Turku Centre for Computer Science (TUCS)
and Dept. of computer science, University of Turku
Lemmink˜aisenkatu 14A
20520 Turku, Finland,
flrst name.last name@it.utu.fl
Jeppe Koivula
MediCel Ltd.,
Haartmaninkatu 8
00290 Helsinki, Finland,
jeppe.koivula@medicel.com
Abstract
In this paper, we present an evaluation of the
Link Grammar parser on a corpus consisting
of sentences describing protein-protein interac-
tions. We introduce the notion of an interac-
tion subgraph, which is the subgraph of a de-
pendencygraphexpressingaprotein-proteinin-
teraction. We measure the performance of the
parserforrecoveryofdependencies,fullycorrect
linkages and interaction subgraphs. We analyze
the causes of parser failure and report speciflc
causes of error, and identify potential modiflca-
tions to the grammar to address the identifled
issues. We also report and discuss the efiect of
an extension to the dictionary of the parser.
1 Introduction
Thechallengesofprocessingthevastamountsof
biomedical publications available in databases
such as MEDLINE have recently attracted a
considerable interest in the Natural Language
Processing (NLP) research community. The
task of information extraction, commonly tar-
geting entity relationships, such as protein-
protein interactions, is an often studied prob-
lem to which various NLP methods have been
applied, ranging from keyword-based methods
(see, e.g., Ginter et al. (2004)) to full syntactic
analysis as employed, for example, by Craven
and Kumlien (1999), Temkin and Gilder (2003)
and Daraselia et al. (2004).
In this paper, we focus on the syntactic anal-
ysis component of an information extraction
system targeted to flnd protein-protein inter-
actions from the dependency output produced
by the Link Grammar1 (LG) parser of Sleator
andTemperley(1991). Tworecentpapersstudy
LGinthecontextofbiomedicalNLP.Thework
by Szolovits (2003) proposes a fully automated
method to extend the dictionary of the LG
1http://www.link.cs.cmu.edu/link/
parser with the UMLS Specialist2 lexicon, and
Ding et al. (2003) perform a basic evaluation
ofLGperformanceonbiomedicaltext. Asboth
paperssuggest, LGwillrequiremodiflcationsin
order to provide a correct analysis of grammat-
ical phenomena that are rare in general English
text, but common in biomedical language. Im-
plementing such modiflcations is a major efiort
that requires a careful analysis of the perfor-
mance of the LG parser to identify the most
common causes of parsing failures and to target
modiflcation efiorts.
While Szolovits (2003) does not attempt to
evaluate parser performance at all and Ding
et al. (2003) provide only an informal evalua-
tion on manually simplifled sentences, we focus
on a more formal evaluation of the LG parser.
For the purpose of this study and also for sub-
sequent research of biomedical information ex-
traction with the LG parser, we have developed
a hand-annotated corpus consisting of unmod-
ifled sentences from publications. We use this
corpus to evaluate the performance of the LG
parser and to identify problems and potential
improvements to the grammar and parser.
2 Link Grammar and parser
The Link Grammar and its parser represent an
implementation of a dependency-based compu-
tationalgrammar. TheresultofLGanalysisfor
a sentence is a labeled undirected simple graph,
whosenodesrepresentthewordsofthesentence
and whose edges and their labels express the
grammatical relationships between the words.
In LG terminology, the graph is called a link-
age, and its edges are called links. The linkage
must be planar (i.e., links must not cross) when
drawn above the words of the sentence, and the
labels of the links must satisfy the linking con-
straintsspecifledforeachwordinthegrammar.
A connected linkage is termed complete.
2http://www.nlm.nih.gov/research/umls/
15
findings suggest that PIP2 binds to proteins such as profilin
Figure 1: Annotation example. The interaction of two proteins, PIP2 and profllin, is stated by
the words binds to. The links joining these words form the interaction subgraph (drawn with solid
lines).
Due to the structural ambiguity of natural
language, several linkages can typically be con-
structed for an input sentence. In such cases,
the LG parser enumerates all linkages allowed
by the grammar. Apost-processing step isthen
employedtoenforceanumberofadditionalcon-
straints. The number of linkages for some sen-
tencescanbeveryhigh,makingpost-processing
and storage prohibitively expensive. This prob-
lem is addressed in the LG parser by deflning
kmax, the maximal number of linkages to be
post-processed. If the parsing algorithm pro-
duces more than kmax linkages, the output is
reduced to kmax linkages by random sampling.
Thelinkagesarethenorderedfrombesttoworst
using heuristic goodness criteria.
In order to be usable in practice, a parser is
typically required to provide a partial analysis
ofasentenceforwhichitcannotconstructafull
analysis. If the LG parser cannot construct a
complete linkage for a sentence, the connected-
ness requirement is relaxed so that some words
do not belong to the linkage at all. The LG
parser is also time-limited. If the full set of
linkages cannot be constructed in a given time
tmax, theparserentersa panic mode,inwhichit
performsane–cientbutconsiderablyrestricted
parse, resulting in reduced performance. The
parameters tmax and kmax set the trade-ofi be-
tween the qualitative performance and the re-
source e–ciency of the parser.
3 Corpus annotation and interaction
subgraphs
To compile a corpus of sentences describing
protein-protein interactions, we flrst selected
pairs of proteins that are known to interact
from the Database of Interacting Proteins3. We
entered these pairs as search terms into the
PubMed retrieval system. We then split the
publication abstracts returned by the searches
into sentences and included titles. These were
again searched for the protein pairs. This gave
us a set of 1927 sentences that contain the
3http://dip.doe-mbi.ucla.edu/
names of at least two proteins that are known
to interact. A domain expert annotated these
sentences for protein names and for words stat-
ing their interactions. Of these sentences, 1114
described at least one protein-protein interac-
tion.
Thereafter, weperformedadependencyanal-
ysis and produced annotation of dependencies.
To minimize the amount of mistakes, each sen-
tence was independently annotated by two an-
notators and difierences were then resolved by
discussion. The assigned dependency structure
was produced according to the LG linkage con-
ventions. Link types were not included in the
annotation, and no cycles were introduced in
the dependency graphs. All ambiguities where
the LG parser is capable of at least enumerat-
ingallalternatives(suchasprepositionalphrase
attachment) were enforced in the annotation.
A random sample consisting of 300 sentences,
including 28 publication titles, has so far been
fully annotated, giving 7098 word-to-word de-
pendencies. This set of sentences is the corpus
we refer to in the following sections.
An information extraction system targeted
at protein-protein interactions and their types
needstoidentifythreeconstituentsthatexpress
an interaction in a sentence: the proteins in-
volved and the word or phrase that states their
interaction and suggests the type of this inter-
action. To extract this information from a LG
linkage, the links connecting these items must
be recovered correctly by the parser. The fol-
lowing deflnition formalizes this notion.
Deflnition 1 (Interaction subgraph) The
interaction subgraph for an interaction between
two proteins A and B in a linkage L is the
minimal connected subgraph of L that contains
A, B, and the word or phrase that states their
interaction.
The recovery of a connected component con-
taining the protein names and the interaction
word is not su–cient: by the deflnition of a
complete linkage, such a component is always
present. Consequently, the exact set of links
16
thatformstheinteractionsubgraphmustbere-
covered.
For each interaction stated in a sentence,
the corpus annotation specifles the proteins in-
volved and the interaction word. The interac-
tion subgraph for each interaction can thus be
extracted automatically from the corpus. Be-
cause the corpus does not contain cyclic depen-
dencies, the interaction subgraphs are unique.
366 interaction subgraphs were identifled from
the corpus, one for each described interaction.
Theinteractionsubgraphscanbepartiallyover-
lapping, because a single link can be part of
more than one interaction subgraph. Figure 1
shows an example of an annotated text frag-
ment.
4 Evaluation criteria
We evaluated the performance of the LG parser
accordingtothefollowingthreequantitativecri-
teria:
† Number of dependencies recovered
† Number of fully correct linkages
† Numberofinteractionsubgraphsrecovered
The number of recovered dependencies gives an
estimate of the probability that a dependency
will be correctly identifled by the LG parser
(this criterion is also employed by, e.g., Collins
etal. (1999)). Thenumberoffullycorrectlink-
ages, i.e. linkages where all annotated depen-
dencies are recovered, measures the fraction of
sentences that are parsed without error. How-
ever, a fully correct linkage is not necessary to
extract protein-protein interactions from a sen-
tence; to estimate how many interactions can
potentially be recovered, we measure the num-
ber of interaction subgraphs for which all de-
pendencies were recovered.
For each criterion, we measure the perfor-
mance for the flrst linkage returned by the
parser. However, the flrst linkage as ordered
by the heuristics of the LG parser was often not
the best (according to the criteria above) of the
linkages returned by the parser. To separate
the efiect of the heuristics from overall LG per-
formance, we identify separately for each of the
three criteria the best linkage among the link-
ages returned by the parser, and we also report
performance for the best linkages.
We further divide the parsed sentences into
three categories: (1) sentences for which the
time tmax for producing a normal parse was ex-
hausted and the parser entered panic mode, (2)
sentences where linkages were sampled because
more than kmax linkages were produced, and
(3) stable sentences for which neither of these
occurred. A full analysis of all linkages that
the grammar allows is only possible for stable
sentences. For sentences in the other two cat-
egories, random efiects may afiect the results:
sentencesforwhichmorethan kmax linkagesare
produced are subject to randomness in sam-
pling, and sentences where the parser enters
panic mode were always subject to subsequent
sampling in our experiments.
5 Evaluation
To evaluate the ability of the LG parser to pro-
duce correct linkages, we increased the number
of stable sentences by setting the tmax param-
eter to 10 minutes and the kmax parameter to
10000 instead of using the defaults tmax = 30
seconds and kmax = 1000. When parsing the
corpus using these parameters, 28 sentences fell
into the panic category, 61 into the sampled
category, and 211 were stable. The measured
parser performance for the corpus is presented
in Table 1.
While the fraction of sentences that have a
fully correct linkage as the flrst linkage is quite
low (approximately 7%), for 28% of sentences
the parser is capable of producing a fully cor-
rect linkage. Performance was especially poor
for the publication titles in the corpus. Because
titles are typically fragments not containing a
verb, and LG is designed to model full clauses,
the parser failed to produce a fully correct link-
age for any of the titles.
The performance for recovered interaction
subgraphs is more encouraging, as 25% of the
subgraphswererecoveredintheflrstlinkageand
morethanhalfinthebestlinkage. Yetmanyin-
teraction subgraphs remain unrecovered by the
parser: the results suggest an upper limit of
approximately 60% to the fraction of protein-
protein interactions that can be recovered from
any linkage produced by the unmodifled LG.
In the following sections we further analyze the
reasons why the parser fails to recover all de-
pendencies.
5.1 Panics
No fully correct linkages and very few interac-
tion subgraphs were found in the panic mode.
This efiect may be partly due to the complex-
ity of the sentences for which the parser en-
17
Category
Criterion Linkage Stable Sampled Panic Overall
Dependency First linkage 3242 (80.0%) 1376 (74.3%) 569 (52.3%) 5187 (73.1%)
Best linkage 3601 (86.6%) 1576 (85.0%) 620 (57.0%) 5797 (81.7%)
Total 4157 1853 1088 7098
Fully correct First linkage 22 (10.4%) 0 (0.0%) 0 (0.0%) 22 (7.3%)
Best linkage 79 (37.4%) 6 (9.8%) 0 (0.0%) 85 (28.3%)
Total 211 61 28 300
Interaction First linkage 75 (30.5%) 16 (20.2%) 0 (0.0%) 91 (24.9%)
subgraph Best linkage 156 (63.4%) 49 (62.0%) 4 (9.8%) 209 (57.1%)
Total 246 79 41 366
Table1: Parserperformance. Thefractionoffulfllledcriteriaisshownbycategory(thecriteriaand
categories are explained in Section 4). The total rows give the number of criteria for each category,
and the overall column gives combined results for all categories.
tered panic mode. The efiect of panics can
be better estimated by forcing the parser to
bypass standard parsing and to directly apply
panic options. For the 272 sentences where the
parser did not enter the panic mode, 77% of
dependencies were recovered in the flrst link-
age. Whenthesesentenceswereparsedinforced
panic mode, 67% of dependencies were recov-
ered, suggesting that on average parses in panic
mode recover approximately 10% fewer depen-
dencies than in standard parsing mode. Simi-
larly, the number of fully correct flrst linkages
decreased from 22 to 6 and the number of inter-
action subgraphs recovered in the flrst linkage
from 91 to 65. These numbers indicate that
panics are a signiflcant cause of error.
Experiments indicate than on a 1GHz ma-
chine approximately 40% of sentences can be
fully parsed in under a second, 80% in under 10
secondsand90%within10minutes;yetapprox-
imately5%ofsentencestakemorethananhour
to fully parse. With tmax set to 10 minutes, the
total parsing time was 165 minutes.
Long parsing times are caused by ambiguous
sentencesforwhichtheparsercreatesthousands
or even millions of alternative linkages. In ad-
dition to simply increasing the time limit, the
fractionofsentenceswheretheparserentersthe
panic mode could therefore be reduced by re-
ducing the ambiguity of the sentences, for ex-
ample,byextendingthedictionaryoftheparser
(see Section 7).
5.2 Heuristics
When several linkages are produced for a sen-
tence, the LG parser applies heuristics to or-
der the sentences so that linkages that are more
likely to be correct are presented flrst. The
heuristics are based on examination and intu-
itions on general English, and may not be opti-
mal for biomedical text. Note in Table 1 that
both for recovered full linkages and interaction
subgraphs,thenumberofitemsthatwererecov-
ered in the best linkage is more than twice the
numberrecoveredintheflrstlinkage,suggesting
that a better ordering heuristic could dramat-
ically improve the performance of the parser.
Such improvements could perhaps be achieved
by tuning the heuristics to the domain or by
adopting a probabilistic ordering model.
6 Failure analysis
A signiflcant fraction of dependencies were not
recovered in any linkage, even in sentences
whereresourceswerenotexhausted. Inorderto
identify reasons for the parser failing to recover
the correct dependencies, we analyze sentences
for which it is certain that the grammar cannot
produce a fully correct linkage. We thus ana-
lyzed the 132 stable sentences for which some
dependencies were not recovered.
For each sentence, we attempt to identify the
reason for the failure of the parser. For each
identifledreason, wemanuallyeditthesentence
to remove the source of failure. We repeat this
procedure until the parser is capable of pro-
ducing a correct parse for the sentence. Note
that this implies that also the interaction sub-
graphs in the sentence are correctly recovered,
and therefore the reasons for failures to recover
interaction subgraphs are a subset of the iden-
tifled issues. The results of the analysis are
18
Reason for failure Cases
Unknown grammatical structure 72(34.4%)
Dictionary issue 54(25.8%)
Unknown word handling 35(16.7%)
Sentence fragment 27(12.9%)
Ungrammatical sentence 17(8.1%)
Other 4(1.9%)
Table 2: Results of failure analysis
summarized in Table 2. In many of the sen-
tences, more than one reason for parser failure
was found; in total 209 issues were identifled in
the 132 sentences. The results are described in
more detail in the following sections.
6.1 Fragments and ungrammatical
sentences
As some of the analyzed sentences were taken
from publication titles, not all of them were
full clauses. To identify further problems when
parsing fragments not containing a verb, the
phrase \is explained" and required determiners
wereaddedtothesefragments,atechniqueused
also by Ding et al. (2003). The completed frag-
ments were then analyzed for potential further
problems.
A number of other ungrammatical sentences
were also encountered. The most common
problem was the omission of determiners, but
some other issues such as missing possessive
markers and errors in agreement (e.g., \expres-
sions...has") were also encountered.
Ungrammatical sentences pose interesting
challenges for parsing. Because many authors
are not native English speakers, a greater toler-
ance for grammatical mistakes should allow the
parser to identify the intended parse for more
sentences. Similarly, the ability to parse publi-
cation titles would extend the applicability of
the parser; in some cases it may be possible
to extract information concerning the key flnd-
ings of a publication from the title. However,
while relaxing completeness and correctness re-
quirements,suchasmandatorydeterminersand
subject-predicate agreement, would allow the
parsertocreateacompletelinkageformoresen-
tences, it would also be expected to lead to in-
creased ambiguity for all sentences, and subse-
quent di–culties in identifying the correct link-
age. If the ability to parse titles is considered
important, a potential solution not incurring
this cost would be to develop a separate version
of the grammar for parsing titles.
capping protein and actin genes
capping protein and actin genes
Figure 2: Multiple modifler coordination prob-
lem. Above: correct linkage disallowed by the
LG parser. Below: solution by chaining modi-
flers.
6.2 Unknown grammatical structures
ThemethodoftheLGimplementationforpars-
ing coordinations was found to be a frequent
cause of failures. A speciflc coordination prob-
lem occurs with multiple noun-modiflers: the
parser assumes that coordinated constituents
can be connected to the rest of the sentence
through exactlyoneword, andthegrammarat-
taches all noun-modiflers to the head. Biomed-
ical texts frequently contain phrases that cause
these requirements to con ict: for example, in
the phrase \capping protein and actin genes"
(where \capping protein genes" and \actin
genes" is the intended parse), the parser allows
only one of the words \capping" and \protein"
to connect to the word \genes", and is thus un-
able to produce the correct linkage (for illustra-
tion, see Figure 2(a)).
This multiple modifler coordination issue
could be addressed by modifying the grammar
to chain modiflers (Figure 2(b)). This alterna-
tive model is adopted by another major depen-
dency grammar, the EngCG-based Connexor
Machinese. The problem could also be ad-
dressed by altering the coordination handling
system in the parser.
Other identifled grammatical structures not
known to the parser were number postmodiflers
to nouns (e.g., \serine 38"), speciflers in paren-
theses (e.g., \profllin mutant (H119E)"), coor-
dinationwiththephrase\butnot", andvarious
unknown uses of colons and quotes. Single in-
stances of several distinct unknown grammati-
cal structures were also noted (e.g., \5 to 10",
\as expected from", \most concentrated in").
Most of these issues can be addressed by local
modiflcations to the grammar.
6.3 Unknown word handling
The LG parser assigns unknown words to cate-
gories based on morphological or other surface
clues when possible. For remaining unknown
19
words, parses are attempted by assigning the
words to the generic noun, verb and adjective
types in all possible combinations.
Some problems with the unknown word pro-
cessing method were encountered during analy-
sis; for example, the assumption that unknown
capitalizedwordsarepropernounsoftencaused
failures, especially in sentences beginning with
an unknown word. Similarly, the assumption
that words containing a hyphen behave as ad-
jectives was violated by a number of unknown
verbs (e.g., \cross-links").
Another problem that was noted occurred
with lowercase unknown words that should be
treated as proper nouns: because LG does not
allowunknownlowercasewordstoactasproper
nouns, the parser assigns incorrect structure to
a number of phrases containing words such as
\actin". Improving unknown word handling re-
quires some modiflcations to the LG parser.
6.4 Dictionary issues
CaseswheretheLGdictionarycontainsaword,
but not in the sense in which it appears in a
sentence, almost always lead to errors. For ex-
ample, the LG dictionary does not contain the
word \assembly" in the sense \construction",
causing the parser to erroneously require a de-
terminer for \protein assembly"4. A related
frequent problem occurred with proper names
headedbyacommonnoun,wheretheparserex-
pectsadeterminerforsuchnames(e.g.,\myosin
heavychain"),andfailswhenoneisnotpresent.
These issues are mostly straightforward to ad-
dress in the grammar, but di–cult to identify
automatically.
6.5 Biomedical entity names
Many of the causes for parser failure discussed
above are related to the presence of biomed-
ical entity names. While the causes for fail-
ures relating to names can be addressed in the
grammar,theexistenceofbiomedicalnameden-
tity (NE) recognition systems (for a recent sur-
vey, see, e.g., Bunescu et al. (2004)) suggests
an alternative solution: NEs could be identifled
in preprocessing, and treated as single (proper
noun) tokens during the parse. During failure
analysis, 59 cases (28% of all cases) were noted
wherethisprocedurewouldhaveeliminatedthe
error, assuming that no errors are made in NE
430 distinct problematic word deflnitions were iden-
tifled, including \breakdown", \composed", \factor",
\half", \independent", \localized", \parallel", \pro-
moter", \segment", \upstream" and \via".
recognition. However, the performance of cur-
rent NE recognition systems is not perfect, and
it is not clear what the efiect of adopting such
a method would be on parser performance.
7 Dictionary extension
Szolovits(2003)describesanautomaticmethod
for mapping lexical information from one lexi-
con to another, and applies this method to aug-
menttheLGdictionarywithtermsfromtheex-
tensiveUMLSSpecialistlexicon. Theextension
introduces more than 125,000 new words into
the LG dictionary, more than tripling its size.
Weevaluatedtheefiectofthisdictionaryexten-
siononLGparserperformanceusingthecriteria
describedabove. Thefractionofdistincttokens
in the corpus found in the parser dictionary in-
creased from 52% to 72% with the dictionary
extension, representing a signiflcant reduction
in uncertainty. This decrease was coupled with
a 32% reduction in total parsing time.
Because the LG parser is unable to produce
any linkage for sentences where it cannot iden-
tifyaverb(evenincorrectly),extendingthedic-
tionary signiflcantly reduced the ability of LG
toextractdependenciesintitles,wherethefrac-
tion of recovered dependencies fell from the al-
ready low value of 67% to 55%.
Forthesentencesexcludingtitles,thebeneflts
ofthedictionaryextensionweremostsigniflcant
for sentences that were in the panic category
when using the unextended LG dictionary; 12
of these 28 sentences could be parsed without
panic with the dictionary extension. In the flrst
linkage of these sentences, the fraction of re-
covered dependencies increased by 8%, and the
fraction of recovered interaction subgraphs in-
creased from zero to 15% with the dictionary
extension.
The overall efiect of the dictionary extension
was positive but modest, with no more than
2.5% improvement for either the flrst or best
linkages for any criterion, despite the threefold
increase in dictionary size. This result agrees
with the failure analysis: most problems can-
notberemovedbyextendingthedictionaryand
must instead be addressed by modiflcations of
the grammar or parser.
8 Conclusion
We have presented an analysis of Link Gram-
mar performance using a custom dependency
corpus targeted at protein-protein interactions.
We introduced the concept of the interaction
20
subgraph and reported parser performance for
three criteria: recovery of dependencies, in-
teraction subgraphs and fully correct linkages.
WhileLGwasabletorecover73%ofdependen-
ciesintheflrstlinkage,only7%ofsentenceshad
a fully correct flrst linkage. However, fully cor-
rectlinkagesarenotrequiredforinformationex-
traction, and we found that 25% of interaction
subgraphs were recovered in the flrst linkage.
Resourceexhaustionwasfoundtobeasignif-
icant cause of poor performance. Furthermore,
an evaluation of performance in the case when
optimal heuristics for ordering linkages are ap-
pliedindicatedthatthefractionofrecoveredin-
teractionsubgraphscouldbemorethandoubled
(to 57%) by optimal heuristics.
To further analyze the cases where the parser
cannot produce a correct linkage, we carefully
examined the sentences and were able to iden-
tify flve problem types. For each identifled
case, we discussed potential modiflcations for
addressing the problems. We also considered
the possibility of using a named entity recogni-
tion system to improve parser performance and
foundthat28%ofLGfailureswouldbeavoided
by a  awless named entity recognition system.
We evaluated the efiect of the dictionary ex-
tensionproposedbySzolovits(2003),andfound
that while it signiflcantly reduced ambiguity
andimprovedperformanceforthemostambigu-
ous sentences, overall improvement was only
2.5%. This indicates that extending the dic-
tionary is not su–cient to address the perfor-
mance problems and that modiflcations to the
grammar and parser are necessary.
The quantitative analysis of LG performance
conflrmsthat,initscurrentstate,LGisnotwell
suitedtotheIEtaskdiscussed. However, inthe
failure analysis we have identifled a number of
speciflc issues and problematic areas for LG in
parsing biomedical publications, and suggested
improvements for adapting the parser to this
domain. The examination and implementation
of these improvements is a natural follow-up of
this study. Our initial experiments suggest that
it is indeed possible to implement general so-
lutions to many of the discussed problems, and
suchmodiflcationswouldbeexpectedtoleadto
improved applicability of LG to the biomedical
domain.
9 Acknowledgments
This work has been supported by Tekes, the
Finnish National Technology Agency.

References
Razvan Bunescu, Ruifang Ge, Rohit J. Kate,
Edward M. Marcotte, Raymond J. Mooney,
Arun Kumar Ramani, and Yuk Wah Wong. 2004
(to appear). Comparative experiments on learn-
ing information extractors for proteins and their
interactions. Artiflcial Intelligence in Medicine.
Special Issue on Summarization and Information
Extraction from Medical Documents.
Michael Collins, Jan Hajic, Lance Ramshaw, and
Christoph Tillmann. 1999. A statistical parser
for Czech. In 37th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 505{
512. Association for Computational Linguistics,
Somerset, New Jersey.
Mark Craven and Johan Kumlien. 1999. Construct-
ing biological knowledge bases by extracting in-
formation from text sources. In T. Lengauer,
R. Schneider, P. Bork, D. Brutlag, J. Glasgow,
H HW Mewes, and Zimmer R., editors, Proceed-
ings of the 7th International Conference on Intel-
ligent Systems in Molecular Biology, pages 77{86.
AAAI Press, Menlo Park, CA.
Nikolai Daraselia, Anton Yuryev, Sergei Egorov,
Svetalana Novichkova, Alexander Nikitin, and
Ilya Mazo. 2004. Extracting human protein in-
teractions from MEDLINE using a full-sentence
parser. Bioinformatics, 20(5):604{611.
Jing Ding, Daniel Berleant, Jun Xu, and Andy W.
Fulmer. 2003. Extracting biochemical interac-
tions from medline using a link grammar parser.
In B. Werner, editor, Proceedings of the 15th
IEEE International Conference on Tools with Ar-
tiflcial Intelligence, pages 467{471. IEEE Com-
puter Society, Los Alamitos, CA.
Filip Ginter, Tapio Pahikkala, Sampo Pyysalo,
Jorma Boberg, Jouni J˜arvinen, and Tapio
Salakoski. 2004. Extracting protein-protein inter-
action sentences by applying rough set data anal-
ysis. In S. Tsumoto, R. Slowinski, J. Komorowski,
and J.W. Grzymala-Busse, editors, Lecture Notes
in Computer Science 3066. Springer, Heidelberg.
Daniel D. Sleator and Davy Temperley. 1991. Pars-
ing english with a link grammar. Technical Re-
port CMU-CS-91-196, Department of Computer
Science, Carnegie Mellon University, Pittsburgh,
PA.
Peter Szolovits. 2003. Adding a medical lexicon to
an english parser. In Mark Musen, editor, Pro-
ceedings of the 2003 AMIA Annual Symposium,
pages 639{643. American Medical Informatics As-
sociation, Bethesda, MD.
Joshua M. Temkin and Mark R. Gilder. 2003. Ex-
traction of protein interaction information from
unstructured text using a context-free grammar.
Bioinformatics, 19(16):2046{2053.
