Term Distillation in Patent Retrieval
Hideo Itoh Hiroko Mano Yasushi Ogawa
Software R&D Group, RICOH Co., Ltd.
1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN
fhideo,mano,yogawag@src.ricoh.co.jp
Abstract
In cross-database retrieval, the domain
of queries diers from that of the re-
trieval target in the distribution of
term occurrences. This causes incor-
rect term weighting inthe retrieval sys-
tem which assigns to each term a re-
trieval weight based on the distribu-
tion of term occurrences. To resolve
the problem, we propose \term distil-
lation", a framework for query term
selection in cross-database retrieval.
The experiments using the NTCIR-3
patent retrieval test collection demon-
strate that term distillation is eective
for cross-database retrieval.
1 Introduction
For the mandatory runs of NTCIR-3 patentre-
trieval task
1
, participants are required to con-
struct a search query from a news article and
retrieve patents whichmightberelevant to the
query. This is a kind of cross-database retrieval
in that the domain of queries (news article) dif-
fers from that of the retrieval target (patent)
(Iwayama et al., 2001).
Because in the distribution of term occur-
rences the query domain diers from the tar-
get domain, some query terms are given very
large weights (importance) by the retrieval sys-
tem even if the terms are not appropriate for
1
http://research.nii.ac.jp/ntcir/ntcir-ws3/patent/
retrieval. For example, the query term \presi-
dent" in a news article might not be eective for
patent retrieval. However, the retrieval system
gives the term a large weight, because the docu-
ment frequency of the term in the patent genre
is very low. We think these problematic terms
are so many that the terms cannot be eliminated
using a stop word dictionary.
In order to resolve the problem mentioned
above, we propose \term distillation" which is
a framework for query term selection in cross-
database retrieval. The experiments using the
NTCIR patent retrieval test collection demon-
strate that term distillation is eective for cross-
database retrieval.
2 System description
Before describing our approach, we give a short
description on our retrieval system. For the
NTCIR-3 experiments, we revised query pro-
cessing although the framework is the same as
that of NTCIR-2 (Ogawa and Mano, 2001). The
basic features of the system are as follows :
 Eective document ranking with pseudo-
relevance feedback based on Okapi's ap-
proach (Robertson and Walker, 1997) with
some improvements.
 Scalable and ecient indexing and search
based on the inverted le system (Ogawa
and Matsuda, 1999)
 Originally developed Japanese morpholog-
ical analyzer and normalizer for document
indexing and query processing.
The inverted le was constructed for the re-
trieval target collection whichcontains full texts
of two years' Japanese patents. We adopted
character n-gram indexing because it might be
dicult for Japanese morphological analyzer to
correctly recognize technical terms which are
crucial for patent retrieval.
Inwhat follows, we describethe fullautomatic
process of document retrieval in the NTCIR-3
patent retrieval task.
1. Query term extraction
Input query string is transformed into a se-
quence of words using the Japanese mor-
phological analyzer. Query terms are ex-
tracted by matching patterns against the
sequence. We can easily specify term ex-
traction using some patterns which are de-
scribed in regular expression on each word
form or tag assigned by the analyzer. Stop
words are eliminated using a stop word dic-
tionary. For initial retrieval, both \single
terms" and \phrasal terms" are used. A
phrasal term consists of two adjacentwords
in the query string.
2. Initial retrieval
Each query term is submitted one byoneto
the ranking search module, which assigns a
weight to the term andscores documents in-
cludingit. Retrieved documents are merged
and sorted on the score in the descending
order.
3. Seed document selection
As a result of the initial retrieval, top
ranked documents are assumed to be
pseudo-relevant to the query and selected
as a \seed" of query expansion. The maxi-
mum number of seed documents is ten.
4. Query expansion
Candidates of expansion terms are ex-
tracted from the seed documents by pattern
matching as in the query term extraction
mentioned above.
Phrasal terms are not used for query ex-
pansion because phrasal terms may be less
eectivetoimprove recall and risky in case
of pseudo-relevance feedback.
The weight of initial query term is
re-calculated with the Robertson/Spark-
Jones formula (Robertson and Sparck-
Jones, 1976) if the term is found in the can-
didate pool.
The candidates are ranked on the Robert-
son's SelectionValue (Robertson, 1990) and
top-ranked terms are selected as expansion
terms.
5. Final retrieval
Each query and expansion term is submit-
ted one by one to the rankingsearch module
as in the initial retrieval.
3 Term distillation
In cross-database retrieval, the domain of
queries (news article) diers from that of the re-
trievaltarget (patent) inthe distributionof term
occurrences. This causes incorrect term weight-
ing in the retrieval system which assigns to each
term a retrieval weight based on the distribution
of term occurrences. Moreover, the terms which
mightbegiven an incorrect weight are too many
to be collected in a stop word dictionary.
For these reasons, we nd it necessary to
have a query term selection stage specially de-
signed for cross-database retrieval. We dene
\term distillation" as a general framework for
the query term selection.
More specically, the termdistillationconsists
of the following steps :
1. Extraction of query term candidates
Candidates of query terms are extracted
from the query string (news articles) and
pooled.
2. Assignment of TDV (Term Distillation
Value)
Each candidate in the pool is givenaTDV
which represents \goodness" of the term to
retrieve documents in the target domain.
3. Selection of query terms
The candidates are ranked on the TDVand
top-ranked n terms are selected as query
terms, where n is an unknown constant
and treated as a tuning parameter for full-
automatic retrieval.
The term distillation seems appropriate to
avoid falling foul of the \curse of dimensional-
ity" (Robertson, 1990) incase that a given query
is very lengthy.
In what follows in this section, we explain
a generic model to dene the TDV. Thereafter
some instances of the model whichembody the
term distillation are introduced.
3.1 Generic Model
In order to dene the TDV, we give a generic
model with the following formula.
TDV = QV  TV
where QV and TV represent the importance of
the term in the query and the target domain
respectively. QV seems to be commonly used
for query term extraction in ordinary retrieval
systems, however, TV is newly introduced for
cross-database retrieval. A combination of QV
and TV embodies a term distillation method.
We instance them separately as bellow.
3.2 Instances of TV
We give some instances of TV using two prob-
abilities p and q, where p is a probability that
the term occurs in the target domain and q is
a probability that the term occurs in the query
domain. Because the estimation method of p
and q is independent on the instances of TV,it
is explained later. Weshoweach instance of TV
with the id-tag as follows:
TV0 : Zero model
TV = constant =1
TV1 : Swet model (Robertson, 1990)
TV = p ; q
TV2 : NaiveBayes model
TV =
p
q
TV3 : Bayesian classication model
TV =
p
p+(1;;)q+
where  and  are unknown constants.
TV4 : Binary independence model (Robertson
and Sparck-Jones, 1976)
TV =log
p(1;q)
q(1;p)
TV5 : Target domain model
TV = p
TV6 : Query domain model
TV =1; q
TV7 : Binary model
TV =1 (p>0) or 0 (p =0)
TV8 : Joint probabilitymodel
TV = p  (1 ; q)
TV9 : Decision theoretic model (Robertson
and Sparck-Jones, 1976)
TV =log(p) ; log(q)
3.3 Instances of QV
We showeach instance of QV with the id-tag as
follows:
QV0 : Zero model
QV = constant =1
QV1 : Approximated 2-poisson model
(Robertson and Walker, 1994)
QV =
tf
tf+
where tf isthe within-queryterm frequency
and  is an unknown constant.
QV2 : Term frequency model
QV = tf
QV3 : Term weight model
QV = weight
where weight is the retrieval weight given
by the retrieval system.
QV4 : Combination of QV1 and QV3
QV =
tf
tf+
 weight
QV5 : Combination of QV2 and QV3
QV = tf  weight
4 Experiments on term distillation
Using the NTCIR-3 patent retrieval test collec-
tion, we conducted experiments to evaluate the
eect of term distillation.
For query construction, weusedonlynewsar-
ticle elds in the 31 topics for the formal run.
The number of query terms selected by term
distillation was just eight in each topic. As
described in the section 2, retrieval was full-
automatically executed with pseudo-relevance
feedback.
The evaluation results for some combinations
of QV and TV are summarizedinTable 1, where
the documents judged to be \A" were taken as
relevant ones. The combinations were selected
on the results in our preliminary experiments.
Each of \t", \i", \a" and \w" in the columns
\p" or \q" represents a certain method for esti-
mation of the probability p or q as follows :
t : estimate p by the probability that the term
occurs in titles of patents. More specically
p =
n
t
N
p
, where n
t
is the number of patent
titles includingthe term and N
p
is the num-
ber of patents in the NTCIR-3 collection.
i : estimate q by the probability that the term
occurs in news articles. More specically
q =
n
i
N
i
, where n
i
is the number of articles
including the term and N
i
is the number of
news articles inthe IREX collection ('98-'99
MAINICHI news article).
a : estimate p by the probability that the term
occurs in abstracts of patents. More specif-
ically p =
n
a
N
p
, where n
a
is the number of
patent abstracts in which the term occurs.
w : estimate q by the probability that the term
occurs in the whole patent. More specif-
ically q =
n
w
N
p
, where n
w
is the number
of patents in which the term occurs. We
tried to approximate the dierence in term
statistics between patents and news articles
using the conbination of "a" and "w" in the
term distillation.
In Table 1, the combination of QV2 and TV0
corresponds to query term extraction without
QV TV p q AveP P@10
QV2 TV4 t i 0.1953 0.2645
QV2 TV9 t i 0.1948 0.2677
QV5 TV3 t i 0.1844 0.2355
QV2 TV3 t i 0.1843 0.2645
QV0 TV3 t i 0.1816 0.2452
QV2 TV6 t i 0.1730 0.2258
QV2 TV2 t i 0.1701 0.2194
QV2 TV3 a w 0.1694 0.2355
QV2 TV0 - - 0.1645 0.2226
QV2 TV7 t i 0.1597 0.2065
Table 1: Results using article eld
term distillation. Comparing with the combina-
tion, retrieval performances are improved using
instances of TV except for TV7. This means the
term distillation produces a positive eect.
The best performance in the table is pro-
duced by the combination of QV2 (raw term
frequency) and TV4 (BIM).
While the combination of \a" and \w" for es-
timation of probabilities p and q has the virtue
in that the estimation requires only target docu-
ment collection, the performance is poor in com-
parison with the combination of \t" and \i".
Although the instances of QV can be com-
pared each other by focusing on TV3, it is un-
clear whether QV5 is superiorto QV2. We think
it is necessary to proceed to the evaluation in-
cluding the other combinations of TV and QV .
5 Results in NTCIR-3 patent task
We submitted four mandatory runs. The evalu-
ation results of our submitted runs are summa-
rizedinTable 2, where the documents judged to
be \A" were taken as relevant ones.
These runs were automatically produced us-
ing both article and supplement elds, where
each supplement eld includes a short descrip-
tionon thecontent of the newsarticle. Termdis-
tillation using TV3 (Bayes classication model)
and query expansion by pseudo-relevance feed-
backwere applied to all runs.
The retrieval performances are remarkable
among all submitted runs. However, the eect
QV TV p q AveP P@10
QV2 TV3 t i 0.2794 0.3903
QV0 TV3 t i 0.2701 0.3484
QV2 TV3 a w 0.2688 0.3645
QV5 TV3 t i 0.2637 0.3613
Table 2: Results in the NTCIR-3 patenttask
of term distillation is somewhat unclear, com-
paring with the run with only supplementelds
in Table 3 (the average precision is 0.2712). We
think supplement elds supply enough terms so
that it is dicult to evaluate the performance of
cross-database retrieval in the mandatory runs.
6 Results on ad-hoc patent retrieval
In Table 3, we show evaluation results corre-
sponding to various combinations of topic elds
in use. The documents judged to be \A" were
taken as relevant ones.
elds AveP P@10 Rret
t,d,c 0.3262 0.4323 1197
t,d,c,n 0.3056 0.4258 1182
d 0.3039 0.4032 1133
t,d 0.2801 0.3581 1100
t,d,n 0.2753 0.4000 1140
d,n 0.2750 0.4323 1145
s 0.2712 0.3806 991
t 0.1283 0.1968 893
Table 3: Results on ad-hoc patent retrieval
Inthe table, the elds\t", \d", \c", \n" or\s"
correspond to title, description, concept, nar-
rative or supplement respectively. As a result,
the combination of \t,d,c" produced the best re-
trieval performance for a set of the formal run
topics. Pseudo-relevance feedback had a posi-
tive eect except for the case using a title eld
only.
7 Conclusions
We proposed term distillationfor cross-database
retrieval. Using NTCIR-3 test collection, we
evaluated this technique in patent retrieval and
found a positive eect. We think cross-database
retrieval can be applied to various settings in-
cluding personalized retrieval, similar document
retrieval and so on.
For the future work, we hope to apply term
distillation to cope with vocabulary gap prob-
lems in these new settings. In addition, we think
term distillation can be used to present query
terms to users in reasonable order in interactive
retrieval systems.

References
M. Iwayama, A. Fujii, A. Takano, and N. Kando.
2001. Patent retrieval challenge in NTCIR-3.
IPSJ SIG Notes, 2001-FI-63:49{56.
Y. Ogawa and H. Mano. 2001. RICOH at NTCIR-
2. Proc. of NTCIR Workshop 2 Meeting, pages
121{123.
Y. Ogawa and T. Matsuda. 1999. An ecientdoc-
ument retrieval method using n-gram indexing.
Trans. of IEICE, J82-D-I(1):121{129.
S. E. Robertson and K. Sparck-Jones. 1976. Rele-
vance weighting of search terms. Journal of ASIS,
27:129{146.
S. E Robertson and S. Walker. 1994. Some simple
eective approximations to the 2-poisson model
for probabilistic weighted retrieval. Proc. of 17th
ACM SIGIR Conf., pages 232{241.
S. E. Robertson and S. Walker. 1997. On relevance
weights with little relevance information. Proc. of
20th ACM SIGIR Conf., pages 16{24.
S. E. Robertson. 1990. On term selection for query
expansion. Journal of Documentation, 46(4):359{
364.
