Combining labelled and unlabelled data: a case study on Fisher
kernels and transductive inference for biological entity recognition
Cyril Goutte, HerveDejean, Eric Gaussier,
Nicola Cancedda and Jean-Michel Renders
Xerox Research Center Europe
6, chemin de Maupertuis
38240 Meylan, France
Abstract
We address the problem of using partially la-
belled data, eg large collections were only little
data is annotated, for extracting biological en-
tities. Our approach relies on a combination of
probabilistic models, whichwe use to model the
generation ofentities and their context, and ker-
nel machines, which implementpowerful cate-
gorisers based on a similarity measure and some
labelled data. This combination takes the form
of the so-called Fisher kernels which implement
asimilarity based on an underlying probabilistic
model. Suchkernels are compared with trans-
ductive inference, an alternative approachto
combining labelled and unlabelled data, again
coupled with Support Vector Machines. Exper-
iments are performed on a database of abstracts
extracted from Medline.
1 Introduction
The availability of electronic databases of ra-
pidly increasing sizes has encouraged the de-
velopmentofmethodsthatcantapinto these
databases to automatically generate knowledge,
for example by retrieving relevant information
or extracting entities and their relationships.
Machine learning seems especially relevantin
this context, because it helps performing these
tasks with a minimum of user interaction.
Anumber of problems likeentity extraction
or ltering can be mapped to supervised tech-
niques like categorisation. In addition, modern
supervised classication methods like Support
Vector Machines haveproven to be ecientand
versatile. They do, however, rely on the avail-
ability of labelled data, where labels indicate
eg whether a document is relevant or whether
a candidate expression is an interesting entity.
This causes two important problems that mo-
tivate our work: 1) annotating data is often a
dicult and costly task involving a lot of hu-
man work
1
, such that large collections of la-
belled data are dicult to obtain, and 2) inter-
annotator agreement tends to be lowinegge-
nomics collections (Krauthammer et al., 2000),
thus calling for methods that are able to deal
with noise and incomplete data.
On the other hand, unsupervised techniques
do not require labelled data and can thus be
applied regardless of the annotation problems.
Unsupervised learning, however, tend to be less
data-ecient than its supervised counterpart,
requiring many more examples to discover sig-
nicant features in the data, and is incapable
of solving the same kinds of problems. For ex-
ample, an ecient clustering technique maybe
able to distribute documents in a number of
well-dened clusters. However, it will be unable
to decide which clusters are relevant without a
minimum of supervision.
This motivates our study of techniques that
rely on a combination of supervised and unsu-
pervised learning, in order to leverage the avail-
ability oflargecollections ofunlabelled dataand
use a limited amount of labelled documents.
The focus of this study is on a particular
application to the genomics literature. In ge-
nomics, a vast amountofknowledge still resides
in large collections of scientic papers suchas
Medline, and several approaches have been pro-
posed toextract,(semi-)automatically, informa-
tion from such papers. These approaches range
from purely statistical ones to symbolic ones
relying on linguistic and knowledge processing
tools (Ohta et al., 1997;; Thomas et al., 2000;;
Proux et al., 2000, for example). Furthermore,
due to the nature of the problem at hand, meth-
1
If automatic annotation was available, wewould ba-
sically have solved our Machine Learning problem
ods derived from machine learning are called
for, (Craven and Kumlien, 1999), whether su-
pervised, unsupervised or relying on a combi-
nation of both.
Let us insist on the fact that our work is pri-
marily concerned with combining labelled and
unlabelled data, and entity extraction is used
as an application in this context. As a conse-
quence, it is not our purpose at this pointto
compare our experimental results to those ob-
tained by specic machine learning techniques
applied to entity extraction (Cali, 1999). Al-
though we certainly hope that our work can be
useful for entity extraction, we rather think of
it as a methodological study which can hope-
fully be applied to dierent applications where
unlabelled data may be used to improve the re-
sults of supervised learning algorithms. In addi-
tion, performing a fair comparison of our work
on standard information extraction benchmarks
is not straightforward: either wewould need to
obtain a large amount of unlabelled data that is
comparable tothe benchmark, orwewould need
to \un-label" a portion of the data. In both
cases, comparing to existing results is dicult
as the amount of information used is dierent.
2 Classication for entity extraction
We formulate the following (binary) classica-
tion problem: given an input space X, and from
a dataset of N input-output pairs (x
k
;;y
k
) 2
Xf;1;;+1g,wewant to learn a classier
h : X!f;1;;+1g so as to maximise the proba-
bility P (h(x)=y)over the xed but unknown
joint input-output distribution of (x;;y) pairs.
In this setting, binary classication is essentially
a supervised learning problem.
In order to map this to the biological en-
tity recognition problem, we consider for each
candidate term, the following binary decision
problem: is the candidate a biological entity
2
(y =1)ornot(y = ;1). The input space is a
high dimensional feature space containing lexi-
cal, morpho-syntactic and contextual features.
In order to assess the validityofcombining
labelled and unlabelled data for the particular
task of biological entity extraction, we use the
following tools. First we rely on Suport Vec-
tor Machines together with transductive infer-
2
In our case, biological entities are proteins, genes
and RNA, cf. section 6.
ence (Vapnik, 1998;; Joachims, 1999), a train-
ing technique that takes both labelled and unla-
belled data into account. Secondly,we develop
a Fisher kernel (Jaakkola and Haussler, 1999),
whichderives the similarity from an underlying
(unsupervised) model ofthe data, used as asim-
ilarity measure (akakernel) within SVMs. The
learning process involves the following steps:
 Transductive inference: learn aSVM classi-
er h(x) using the combined (labelled and
unlabelled) dataset, using traditional ker-
nels.
 Fisher kernels:
1. Learn aprobabilistic model of the data
P(xj) using combined unlabelled and
labelled data;;
2. Derive the Fisher kernel K(x;;z) ex-
pressing the similarityinX-space;;
3. Learn a SVM classier h(x) using this
Fisher kernel and inductive inference.
3 Probabilistic models for
co-occurence data
In (Gaussier et al., 2002) we presented a gen-
eral hierarchical probabilistic model whichgen-
eralises several established models likeNave
Bayes (Yang and Liu, 1999),probabilistic latent
semantic analysis (PLSA) (Hofmann, 1999) or
hierarchical mixtures (Toutanova et al., 2001).
In this model, data result from the observation
of co-occuring objects. For example, a docu-
ment collection is expressed as co-occurences
between documents and words;; in entity extrac-
tion, co-occuring objects may be potential en-
tities and their context, for example. For co-
occuring objects i and j, the model is expressed
as follows:
P(i;;j)=
X

P()P(ij)
X

P(j)P(jj)
(1)
where  are latent classes for co-occurrences
(i;;j)and arelatentnodes in ahierarchy gener-
ating objects j. In the case where no hierarchy
is needed (ie P(j)=( = )), the model
reduces to PLSA:
P(i;;j)=
X

P()P(ij)P(jj) (2)
where  are now latent concepts overboth i and
j.Parameters of the model (class probabilities
P() and class-conditional P(ij)andP(jj))
are learned using a deterministic annealing ver-
sion of the expectation-maximisation (EM) al-
gorithm (Hofmann, 1999;; Gaussier et al., 2002).
4 Fisher kernels
Probabilistic generative models like PLSA and
hierarchical extensions (Gaussier et al., 2002)
provide a natural way to model the generation
of the data, and allow the use of well-founded
statistical tools to learn and use the model.
In addition, they may be used to derivea
model-based measure of similaritybetween ex-
amples, using the so-called Fisher kernels pro-
posed byJaakkola and Haussler (1999). The
idea behind this kernel is that using the struc-
ture implied by the generative model will give
a more relevant similarity estimate, and allow
kernel methods like the support vectormachines
or nearest neighbours to leverage the probabilis-
tic model and yield improved performance (Hof-
mann, 2000).
The Fisher kernel is obtained using the log-
likelihood of the model and the Fisher informa-
tion matrix. Let us consider our collection of
documents fx
k
g
k=1:::N
, and denote by `(x)=
logP(xj) the log-likelihood of the model for
data x. The expression of the Fisher kernel
(Jaakkola and Haussler, 1999) is then:
K(x
1
;;x
2
)=r`(x
1
)
>
I
F
;1
r`(x
2
) (3)
The Fisher information matrix I
F
can be seen
as a waytokeep the kernel expression inde-
pendent of parameterisation and is dened as
I
F
= E

r`(x)r`(x)
>

, where the gradient
is w.r.t.  and the expectation is taken over
P(xj). With a suitable parameterization, the
information matrixIis usually approximated by
the identity matrix (Hofmann, 2000), leading
to the simpler kernel expression: K(x
1
;;x
2
)=
r`(x
1
)
>
r`(x
2
).
Depending on the model, the various log-
likelihoods and their derivatives will yield dif-
ferent Fisher kernel expressions. For PLSA (2),
the parameters are  =[P();;P(ij);;P(jj)].
From the derivatives of the likelihood `(x)=
P
(i;;j)2x
logP(i;;j), we derive the following sim-
ilarity (Hofmann, 2000):
K(x
1
;;x
2
)=
X

P(jd
i
)P(jd
j
)
P()
(4)
+
X
w
b
P
wd
i
b
P
wd
j
X

P(jd
i
;;w)P(jd
j
;;w)
P(wj)
with
b
P
wd
i
,
b
P
wd
j
theempirical worddistributions
in documents d
i
, d
j
.
5 Transductive inference
In standard, inductive SVM inference, the an-
notated data is used to infer a model, whichis
then applied to unannotated test data. The in-
ference consists in a trade-o between the size
of the margin (linked to generalisation abilities)
and the number of training errors. Transductive
inference (Gammerman et al., 1998;; Joachims,
1999) aims at maximising the margin between
positives and negatives, while minimising not
only the actual number of incorrect predictions
on labelled examples, but also the expected
number of incorrect predictions on the set of
unannotated examples.
This is done by including the unknown la-
bels as extra variables in the original optimisa-
tion problem. In the linearly separable case, the
new optimisation problem amounts now to nd
a labelling of the unannotated examples and a
hyperplane which separates all examples (anno-
tatedand unannotated) with maximum margin.
In the non-separable case, slackvariables are
also associated to unannotated examples and
the optimisation problem is now to nd a la-
belling and a hyperplane which optimally solves
the trade-o between maximising the margin
and minimising the number of misclassied ex-
amples (annotated and unannotated).
With the introduction of unknown labels as
supplementary optimisation variables, the con-
straints of the quadratic optimisation problem
are now nonlinear, whichmakes solving more
dicult. However, approximated iterative algo-
rithms exist which can eciently train Trans-
ductive SVMs. They are based on the principle
ofgradually improving the solution byswitching
the labels of unnannotated examples whichare
misclassied at the current iteration, starting
from an initial labelling given by the standard
(inductive) SVM.
WUp Is the word capitalized?
WAllUp Is the word alls capitals?
WNum Does the word contain digits?
Table 1: Spelling features
6 Experiments
Forourexperiments, weused 184abstractsfrom
the Medline site. In these articles, genes, pro-
teins and RNAs were manually annotated bya
biologist as part of the BioMIRE project. These
articles contain 1405occurrences of gene names,
792 of protein names and 81 of RNA names. All
these entities are considered relevant biological
entities. We focus here on the task of identify-
ing names corresponding to suchentities in run-
ning texts, without dierentiating genes from
proteins or RNAs. Once candidates for bio-
logical entity names have been identied, this
task amounts to a binary categorisation, rele-
vant candidates corresponding to biological en-
tity names. We divided these abstracts in a
training and development set (122 abstracts),
and a test set (62 abstracts). We then retained
dierent portions of the training labels, to be
used as labelled data, whereas the rest of the
data is considered unlabelled.
6.1 Denition of features
First of all, the abstracts are tokenised, tagged
and lemmatized. Candidates for biological en-
tity names are then selected on the basis of the
following heuristics: a token is consideredacan-
didate if it appears in one of the biological lexicons
we have at our diposal, or if it does not belong to
our general English lexicon. This simple heuris-
tics allows us to retain 93% (1521 out of 1642)
of biological names in the training set (90% in
the test set), while considering only 21% of all
possible candidates (5845 out of 27350 tokens).
It thus provides a good pre-lter which signif-
icantly improves the performance, in terms of
speed, of our system. The biological lexicons
we use were provided by the BioMIRE project,
and were derived from the resources available
at: http://iubio.bio.indiana.edu/.
For each candidate, three types of features
were considered. We rst retained the part-of-
speech and some spelling information (table 1).
These features were chosen based on the inspec-
tion of gene and protein names in our lexicons.
LexPROTEIN Protein lexicon
LexGENE Gene lexicon
LexSPECIES Biological species lexicon
LEXENGLISH General English lexicon
Table 2: Features provided by lexicons.
The second type of features relates to the pres-
ence of the candidate in our lexical resources
3
(table 2). Lastly, the third type of features de-
scribes contextual information. The context we
consider contains the four preceding and the
four following words. However, we did not take
into account the position of the words in the
context, but only their presence in the rightor
left context, and in addition we replaced, when-
ever possible, eachword by a feature indicating
(a) whether the word was part of the gene lex-
icon, (b) if not whether it was part of the pro-
tein lexicon, (c) if not whether it was part of
the species lexicon, (d) and if not, whenever the
candidate was neither a noun, an adjective nor
averb, we replaced it by its part-of-speech.
For example, the word hairless is associated
with the features given in Table 3, when en-
countered in the following sentence: Inhibition
of the DNA-binding activity of Drosophila sup-
pressor of hairless and of its human homolog,
KBF2/RBP-J kappa, by direct protein{protein
interaction with Drosophila hairless. The word
hairless appears in the gene lexicon and is
wrongly recognized as an adjectiveby our tag-
ger.
4
The word human, the fourth word of
the rightcontext of hairless, belongs to the
species lexicon, ans is thus replaced by the fea-
ture RC SPECIES.Neither Drosophila nor sup-
pressor belong to the specialized lexicons we
use, and, since they are both tagged as nouns,
they are left unchanged. Prepositions and con-
junctions are replaced by their part-of-speech,
and prexes LC and RC indicate whether they
were found in left or rightcontext. Note that
since twoprepositions appear in the left context
of hairless, the value of the LC PREP feature
is 2.
Altogether, this amounts to a total of 3690
possible features in the input space X.
3
Using these lexicons alone, the same task with the
same test data, yields: precision = 22%, recall = 76%.
4
Note that no adaptaion work has been conducted on
our tagger, which explains this error.
Feature Value
LexGENE 1
ADJ 1
LC drosophila 1
LC suppressor 1
LC PREP 2
RC CONJ 1
RC SPECIES 1
RC PRON 1
RC PREP 1
Table 3: Featuresof hairless in \...of Drosophila
suppressor of hairless and of its human...".
6.2 Results
In our experiments, wehave used the following
methods:
 SVM trained with inductive inference, and
using a linear kernel, a polynomial kernel of
degree d = 2 and the so-called \radial ba-
sis function" kernel (Scholkopf and Smola,
2002).
 SVM trained with transductive inference,
and using a linear kernel or a polynomial
kernel of degree d =2.
 SVM trained with inductive inference us-
ing Fisher kernels estimated fromthe whole
training data (without using labels), with
dierentnumber of classes c in the PLSA
model (4).
The proportion of labelled data is indicated
in the tables of results. For SVM with induc-
tive inference, only the labelled portion is used.
For transductive SVM (TSVM), the remaining,
unlabelled portion is used (without the labels).
For the Fisher kernels (FK), an unsupervised
model is estimated on the full dataset using
PLSA, and a SVM is trained with inductive
inference on the labelled data only, using the
Fisher kernel as similarity measure.
6.3 Transductive inference
Table 4 gives interesting insightinto the ef-
fect of transductive inference. As expected, in
the limit where little unannotated data is used
(100% in the table), there is little to gain from
using transductive inference. Accordingly,per-
formance is roughly equivalent
5
for SVM and
% annotated: 1.5% 6% 24% 100%
SVM (lin) 41.22 45.34 49.67 62.97
SVM (d=2) 40.97 46.78 52.12 62.69
SVM (rbf) 42.51 49.53 51.11 63.96
TSVM (lin) 38.63 51.64 61.84 62.91
TSVM (d=2) 43.88 52.38 55.36 62.72
Table 4: F
1
scores(in %) using dierent propor-
tionsofannotateddataforthefollowing models:
SVM with inductive inference (SVM) and lin-
ear (lin) kernel, second degree polynomial ker-
nel (d=2), and RBF kernel (rbf);; SVM with
transductive inference (TSVM) and linear (lin)
kernel or second degree polynomial (d=2) ker-
nel.
TSVM, with a slightadvantage for RBF kernel
trained with inductive inference. Interestingly,
in the other limit, ie when very little annotated
data is used, transductive inference does not
seem to yield a marked improvementover in-
ductive learning. This nding seems somehow
at odds with the results reported by Joachims
(1999) on a dierent task (text categorisation).
Weinterpret this result as a side-eect of the
search strategy, where one tries to optimise
both the size of the margin and the labelling
of the unannotated examples. In practice, an
exact optimisation over this labelling is imprac-
tical, and when a large amount of unlabelled
data is used, there is a risk that the approxi-
mate, sub-optimal search strategy described by
Joachims (1999)mayfail toyield a solution that
is markedly better that the result of inductive
inference.
For the twointermediate situation, however,
transductive inference seems to provide a size-
able performanceimprovement. Using only 24%
of annotated data, transductive learning is able
totrain alinear kernel SVM thatyields approxi-
mately the same performance as inductive infer-
ence on the full annotated dataset. This means
that we get comparable performance using only
what corresponds to about 30 abstracts, com-
pared to the 122 of the full training set.
6.4 Fisher kernels
The situation is somewhat dierent for SVM
trained with inductive inference, but using
5
Performance is not strictly equivalent because SVM
and TSVM use the data dierently when optimising the
trade-o parameter C over a validation set.
% annotated: 1.5% 6% 24% 100%
SVM (lin) 41.22 45.34 49.67 62.97
SVM (d=2) 40.97 46.78 52.12 62.69
lin+FK8 46.08 42.83 54.59 63.92
lin+FK16 44.43 40.92 55.70 63.76
lin+combi 46.38 38.10 52.74 63.08
Table 5: F
1
scores(in %) using dierent propor-
tions of annotated data for the following mod-
els: standard SVM with linear (lin) and second
degree polynomial kernel (d=2);; Combination
of linear kernel and Fisher kernel obtained from
a PLSA with 4 classes (lin+FK4) or 8 classes
(lin+FK8), and combination of linear and all
Fisher kernels obtained from PLSA using 4, 8,
12 and 16 classes (lin+combi).
Fisher kernels obtained from a model of the
entire (non-annotated) dataset. As the use
of Fisher kernels alone was unable to consis-
tently achieve acceptable results, the similarity
we used is a combination of the standard lin-
ear kernel and the Fisher kernel (a similar solu-
tion was advocate by Hofmann (2000)). Table 5
summarises the results obtained using several
types of Fisher kernels, depending on howmany
classes were used in PLSA. FK8 (resp. FK16)
indicates the model using 8 (resp. 16) classes,
while combi is a combination of the Fisher ker-
nels obtained using 4, 8, 12 and 16 classes.
The eect of Fisher kernels is not as clear-cut
as that of transductive inference. For fully an-
notated data, we obtain results that are similar
to the standard kernels, although often better
than the linear kernel. Results obtained using
1.5%and 6%annotateddataseem somewhatin-
consistent, whith a large improvement for 1.5%,
but a marked degradation for 6%, suggesting
that in that case, adding labels actually hurts
performance. We conjecture that this maybe
an artifact of the specic annotated set we se-
lected. For 24% annotated data, the Fisher ker-
nel provides results that are inbetween induc-
tive and transductive inference using standard
kernels.
7 Discussion
The results of our experiments are encouraging
in that they suggest that both transductive in-
ference and the use of Fisher kernels are poten-
tially eectiveway of taking unannotated data
into account to improve performance.
These experimental results suggestthe follow-
ing remark. Note that Fisher kernels can be
implemented by a simple scalar product (lin-
ear kernel) between Fisher scores r`(x) (equa-
tion 3). The question arises naturally as to
whether using non-linear kernels may improve
results. One one hand, Fisher kernels are
derived from information-geometric arguments
(Jaakkola and Haussler, 1999) which require
that the kernel reduces to an inner-product of
Fisher scores. On the other hand, polynomial
and RBF kernels often display better perfor-
mance than a simple dot-product. In order to
test this, wehave performed experiments using
the same features as in section 6.4, but with a
second degree polynomial kernel. Overall, re-
sults are consistently worse than before, which
suggest that the expression of the Fisher kernel
as the inner product of Fisher scores is theoret-
ically well-founded and empirically justied.
Among possible future work, let us mention
the following technical points:
1. Optimising the weight of the contributions
of the linear kernel and Fisher kernel, eg
as K(x;;y)=hx;;yi +(1; )FK(x;;y),
 2 [0;;1].
2. Understanding why the Fisher kernel alone
(ie without interpolation with the linear
kernel) is unable to provide a performance
boost, despite attractive theoretical prop-
erties.
In addition, the performance improvement
obtained by both transductive inference and
Fisher kernels suggest to use both in cunjunc-
tion. To our knowledge, the question of whether
this would allow to \bootstrap" the unlabelled
data by using them twice (once for estimating
the kernel, once in transductive learning) is still
an open research question.
Finally, regarding the application that we
have targeted, namely entity recognition, the
use of additional unlabelled data may help us
to overcome the current performance limit on
our database. None of the additional experi-
ments conducted internally using probabilisitc
models and symbolic, rule-based methods have
been able to yield F
1
scores higher than 63-64%
on the same data. In order to improve on this,
wehave collected several hundred additional
abstracts by querying the MedLine database.
After pre-processing, this yields more than a
hundred thousand (unlabelled) candidates that
wemay use with transductive inference and/or
Fisher kernels.
8 Conclusion
In this paper, we presented a comparison be-
tween two state-of-the-art methods to combine
labelled and unlabelled data: Fisher kernels and
transductive inference. Our experimental re-
sults suggest that both method are able to yield
a sizeable improvement in performance. For ex-
ample transductive learning yields performance
similar to inductive learning with only about a
quarter of the data. These results are very en-
couraging for tasks where annotation is costly
while unannotated data is easy to obtain, like
our task of biological entity recognition. In ad-
dition, it provides a way to benet from the
availability of large electronic databases in or-
der to automatically extract knowledge.
9 Acknowledgement
We thank Anne Schiller,

Agnes Sandor and Vi-
olaine Pillet for help with the data and re-
lated experimental results. This researchwas
supported by the European Commission un-
der the KerMIT project no. IST-2001-25431
and the French Ministry of Research under the
BioMIRE project, grant 00S0356.

References
M. E. Cali, editor. 1999. Proc. AAAI Work-
shop on Machine Learning for Information
Extraction. AAAI Press.
M. Craven and J. Kumlien. 1999. Construct-
ing biological knowledge bases by extract-
ing information from text sources. In Proc.
ISMB'99.
A.Gammerman, V.Vovk, and V. Vapnik. 1998.
Learning by transduction. In Cooper and
Morla, eds, Proc. Uncertainty in Articial In-
telligence,pages145{155.MorganKaufmann.
Eric Gaussier, Cyril Goutte, Kris Popat, and
Francine Chen. 2002. A hierarchical model
for clustering and categorising documents.
In Crestani, Girolami, and van Rijsbergen,
eds, Advances in Information Retrieval|
Proc. ECIR'02, pages 229{247. Springer.
Thomas Hofmann. 1999. Probabilistic latent
semantic analysis. In Proc. Uncertainty in
Articial Intelligence,pages289{296.Morgan
Kaufmann.
Thomas Hofmann. 2000. Learning the similar-
ity of documents: An information-geometric
approach to document retrieval and catego-
rization. In NIPS*12, page 914. MIT Press.
Tommi S. Jaakkola and David Haussler. 1999.
Exploiting generative models in discrimina-
tive classiers. In NIPS*11, pages 487{493.
MIT Press.
Thorsten Joachims. 1999. Transductive in-
ference for text classication using support
vector machine. In Bratko and Dzeroski,
eds, Proc. ICML'99, pages 200{209. Morgan
Kaufmann.
M. Krauthammer, A. Rzhetsky,P. Morozov,
and C. Friedman. 2000. Using blast for iden-
tifying gene and protein names in journal ar-
ticles. Gene.
Y. Ohta, Y. Yamamoto, T. Okazaki,
I. Uchiyama, andT.Takagi. 1997. Au-
tomatic constructing of knowledge base from
biological papers. In Proc. ISMB'97.
D. Proux, F. Reichemann, and L. Julliard.
2000. A pragmatic information extraction
strategy for gathering data on genetic inter-
actions. In Proc. ISMB'00.
Bernhard Scholkopf and Alexander J. Smola.
2002. Learning with Kernels. MIT Press.
J. Thomas, D. Milward, C. Ouzounis, S. Pul-
man, and M. Caroll. 2000. Automatic ex-
traction of protein interactions from scientic
abstracts. In Proc. PSB 2000.
Kristina Toutanova,Francine Chen, Kris Popat,
and Thomas Hofmann. 2001. Text classica-
tion in a hierarchical mixture model for small
training sets. In Proc. ACM Conf. Informa-
tion and Knowledge Management.
Vladimir N. Vapnik. 1998. Statistical Learning
Theory.Wiley.
Yiming Yang and Xin Liu. 1999. A re-
examination of text categorization methods.
In Proc. 22nd ACM SIGIR, pages 42{49.
