A Patent Document Retrieval System Addressing
Both Semantic and Syntactic Properties
Liang Chen
Computer Science Department
University of Northern British Columbia
Prince George, BC, Canada V2N 4Z9
chenl@unbc.ca
Naoyuki Tokuda Hisahiro Adachi
R & D Center, Sunflare Company
Shinjuku-Hirose Bldg., 4-7 Yotsuya
Sinjuku-ku, Tokyo, Japan 160-0004
CUtokuda n,adachi hCV@sunflare.co.jp
Abstract
Combining the principle of Differential
Latent Semantic Index (DLSI) (Chen et
al., 2001) and the Template Matching
Technique (Tokuda and Chen, 2001), we
propose a new user queries-based patent
document retrieval system by NLP tech-
nology. The DLSI method first narrows
down the search space of a sought-after
patent document by content search and
the template matching technique then
pins down the documents by exploit-
ing the words-based template matching
scheme by syntactic search. Compared
with the synonymous search scheme by
thesaurus dictionaries, the new method
results in an improved overall retrieval
efficiency of patent documents.
1 Introduction
Information (document) retrieval systems resort to
two classes of approaches; the first makes use of the
form-based or words-based approach addressing
the exact syntactic properties of documents, while
the second makes use of the content-based ap-
proach which exploits the semantic connection be-
tween documents and queries. While most of com-
mercial systems adopt the form-based approach ex-
ploiting the simple string matching algorithm or the
weighted matching algorithm, the approach needs a
thesaurus dictionary to resolve the synonym-related
problem. Some research works have now been un-
derway from the content-based approach focusing
the dimension reduction scheme.
The content-based approach is motivated by
semantics-based search schemes. Assuming that
the content of a document is closely related to the
tf-idf of the words used (Zobel and Moffat, 1998),
we first represent documents as term vectors. One
of the immediate difficulties we encounter in deal-
ing with document vector spaces lies in its too
high a dimensionality of the vector spaces which
is particularly true in document analysis largely
due to a large variety of synonyms and polysemic
words used in natural language. In image recog-
nition field (Turk and Pentland, 1991; Chen and
Tokuda, 2003b), a so-called PCA (principal com-
ponent analysis) principle has been used success-
fully in facial recognition problems as a most ef-
fective scheme of dimension reduction. The LSI
(latent semantic indexing) technique (Berry et al.,
1999; Littman et al., 1998) is a counterpart of the
PCA in text document processing.
We have recently extended the LSI to a DLSI
(differential latent semantic indexing) method
(Chen et al., 2001), where in the DLSI scheme, we
improve the robustness of the LSI scheme by in-
troducing and making use of projections of, inte-
rior as well as exterior differential document vec-
tors (see Section 2 for detailed discussions). Our
present study shows how we can make use of the
characteristics in improving the IR performance in
patent document search. In patent retrieval applica-
tion, we are fortunate because all the patent docu-
ments are well structured with very precise, human
generated abstracts attached so that two interior and
exterior documents are automatically provided, fa-
cilitating the application of the DLSI method in de-
veloping a patent document retrieval system.
Despite the improved superiority of the DLSI
technique over the LSI technique (see Section 2 for
detailed discussions), the system still has a problem
of instability when used as an NLP-oriented query-
based commercial product due to content search’s
inherent poor precision and recall rate. A content-
based information retrieval system is still far be-
yond our research ability to be implemented into a
coding system. Some syntactic properties seeking
the “form” or ”word” similarity must be introduced
if the LSI/DLSI based system can be used with ro-
bustness. This is so because we have to resolve
some conflicting factors here. The content based IR
system tries to search the document in accordance
with the similarity of “meaning” of a query, which
captures the abstraction of the exact words used.
For example, we believe that the LSI/DLSI based
system should be able to retrieve a similar set of
documents to a query “Information Processing De-
vices” and “Computing Machinery”, where prob-
ably some of documents obtained might not con-
tain even the phrases “Information Processing De-
vices” or “Computing Machinery”, or even neither
of these words at all. Form based systems, on the
other hand, have to depend on the exact words used;
in other words, unless a “perfect” thesaurus dictio-
nary is used, we may not capture the correct doc-
uments. Unfortunately we know of no such com-
plete thesaurus dictionary, and even if there is such
a dictionary, the matching or collating method will
be still too complex with respect to computing re-
sources.
To solve “form” similarity problems encoun-
tered in a DLSI/LSI approach, we introduce the
template-automaton method which has been orig-
inally developed for the language tutoring system
(Tokuda and Chen, 2001). The template method
sets up a variety of expected patterns of patent doc-
ument abstracts whereby we want to match a query
against a multitude of template paths by pinning
down a path having the highest similarity mea-
sure to the query from among the documents pre-
selected by the DLSI method. All we have to do
here is to maintain the template structure contain-
ing the possible candidates of the abstracts of patent
documents in natural language, and maintain the
template structures in the database. A DP(dynamic
programming) based-template matching method is
very efficient in finding a best matched path to a
query facilitating the final location of the patent
document.
The rest of the paper is organized as follows. The
scheme of the DLSI method is introduced in Sec-
tion 2 while the template structure will be explained
in Section 3. The Flow of the entire search process
and concluding remarks will be given in Sections 4
&5.
2 Differential Latent Semantic Indexing
Method
A term is defined as a word or a phrase that appears
at least in two documents. We exclude the so-called
stop words such as ”a”, ”the” in English which are
used most frequently in any topics, but remain ir-
relevant to our purpose of document search.
Suppose we select and list the terms that appear
in the documents as D8
BD
BND8
BE
BNBMBMBMBND8
D1
. For each patent
document in collection, we preprocess it and assign
it with a document vector as B4CP
BD
BNCP
BE
BNBMBMBMBNCP
D1
B5, where
CP
CX
BP CU
CX
A1 CV
CX
; here CU
CX
denotes the number of times
the term D8
CX
appears in an expression of the docu-
ment, and CV
CX
denotes the global weight over all the
documents; the weight denotes a parameter indicat-
ing the relative importance of the term in represent-
ing the document abstracts. Local weights could
be either raw occurrence counts, boolean, or loga-
rithms of occurrence count. Global weights could
be no weighting (uniform), domain specific, or en-
tropy weighting. The document vector is normal-
ized as B4CQ
BD
BNCQ
BE
BNBMBMBMBNCQ
D1
B5. Since all the patent docu-
ments are provided with a formal abstract, we sup-
pose the abstracts be equivalent to their documents
in content so that the abstract and the document
should both be retrieved as part of the similar doc-
uments to the query supplied. We will show be-
low how we can set up the DLSI technique lead-
ing to an improved robust scheme below. We have
shown how the shortcoming of a global projection-
based LSI scheme can be improved by making a
best use of differences of two vectors in adapting to
the unique characteristics of each document (Chen
et al., 2001).
A Differential Document Vector is defined as
C1
BD
A0 C1
BE
where C1
BD
and C1
BE
are normalized document
vectors satisfying particular types of documents.
An Exterior Differential Document Vector in par-
ticular is defined as the Differential Document Vec-
tor C1 BP C1
BD
A0 C1
BE
,ifC1
BD
and C1
BE
constitute two nor-
malized document vectors of any two different doc-
uments. An Interior Differential Document Vec-
tor is defined by the Differential Document Vector
C1 BP C1
BD
A0 C1
BE
, where C1
BD
and C1
BE
constitute two differ-
ent normalized document vectors of the same doc-
ument. The different document vectors of the same
documents may be taken from parts of documents
including abstracts, or may be produced by differ-
ent schemes of summaries, or from the querries.
The Exterior Differential Term-Document Matrix
is defined as a matrix, each column of which is set
to an Exterior Differential Document Vector. The
Interior Differential Term-Document Matrix is de-
fined as a matrix, each column of which comprises
an interior Differential Document Vector.
2.1 Details of a DLSI Model
Any differential term-document matrix, say, of m-
by-n matrix D of rank D6 AK D5 BP D1CXD2B4D1BND2B5,
can be decomposed into a product of three ma-
trices, namely BW BP CDCBCE
CC
, such that CD and CE
are an D1-by-D5 and D5-by-D2 unitary matrices respec-
tively, where the first D6 columns of CD and CE are
the eigenvectors of BWBW
CC
and BW
CC
BW respectively.
CB BP diagB4Æ
BD
BNÆ
BE
BNA1A1A1BNÆ
D5
B5, where Æ
CX
are nonnega-
tive square roots of eigen values of BWBW
CC
, Æ
C1
BQ BC
for CX AK D6 and Æ
CX
BP BC for CX BQ D6. By convention,
the diagonal elements of S are sorted in decreasing
order of magnitude. To obtain a new reduced ma-
trix CB
CZ
, we simply keep the CZ-by-CZ leftmost-upper
corner matrix B4CZ BO D6B5 of CB , other terms being
deleted; we similarly obtain the two new matrices
CD
CZ
and CE
CZ
by keeping the leftmost CZ columns of
CD and CE respectively. The product of CD
CZ
, CB
CZ
and
CE
CC
CZ
provides a matrix BW
CZ
which is approximately
equal to BW. Each of differential document vec-
tor D5 could find a projection on the CZ dimensional
differential latent semantic fact space spanned by
the k columns of CD
CZ
. The projection can easily
be obtained by CD
CC
CZ
D5. Note that, the mean AMDC of the
exterior-(interior-)differential document vectors are
approximately 0. Thus,
C8
BP
BD
D2
BWBW
CC
, where
C8
is
the covariance of the distribution computed from
the training set. Assuming that the differential doc-
ument vectors formed follow a high-dimensional
Gaussian distribution, the likelihood of any differ-
ential document vector DC will be given by
C8B4DCCYBWB5BP
CTDCD4
CW
A0
BD
BE
CSB4DCB5
CX
B4BEAPB5
D2BPBE
CYA6CY
BDBPBE
BN
where CSB4DCB5 BP DC
CC
A6
A0BD
DC. Since Æ
BE
CX
are eigenvalues
of BWBW
CC
,wehaveCB
BE
BP CD
CC
BWBW
CC
CD, and thus
CSB4DCB5BPD2DC
CC
B4BWBW
CC
B5
A0BD
DC BP D2DD
CC
CB
A0BE
DDBN
where DD BP CD
CC
DC BPB4DD
BD
BNDD
BE
BNA1A1A1BNDD
D2
B5
CC
.
Because CB is a diagonal matrix, CSB4DCB5 BP
D2
C8
D6
CXBPBD
DD
BE
CX
BPÆ
BE
CX
.
It is convenient to estimate the quantity by
CM
CSB4DCB5BPD2B4
CZ
CG
CXBPBD
DD
BE
CX
BPÆ
CX
BE
B7
BD
AQ
D6
CG
CXBPCZB7BD
DD
BE
CX
B5BM
where AQ BP
BD
D6A0CZ
C8
D6
CXBPCZB7BD
Æ
BE
CX
.
Because the columns of CD are orthonormal vec-
tors,
C8
D6
CXBPCZB7BD
DD
BE
CX
could be estimated by CYCYDCCYCY
BE
A0
C8
CZ
CXBPBD
DD
CX
BE
. Thus, the likelihood function C8B4DCCYBWB5
could be estimated by
CM
C8B4DCCYBWB5BP
D2
BDBPBE
CTDCD4
AI
A0
D2
BE
C8
CZ
CXBPBD
DD
BE
CX
Æ
BE
CX
AJ
A1 CTDCD4
AG
A0
D2AF
BE
B4DCB5
BEAQ
AH
B4BEAPB5
D2BPBE
C9
CZ
CXBPBD
Æ
CX
A1 AQ
B4D6A0CZB5BPBE
BN (1)
where DD BP CD
CC
CZ
DC, AF
BE
B4DCB5 BP CYCYDCCYCY
BE
A0
C8
CZ
CXBPBD
DD
BE
CX
, AQ BP
BD
D6A0CZ
C8
D6
CXBPCZB7BD
Æ
BE
CX
, D6 is the rank of matrix BW. In prac-
tical cases, AQ may be approximated by Æ
BE
CZB7BD
BPBE, and
D6 by D2.
2.2 Algorithm
2.2.1 Setting Up Retrieval System
1. Text preprocessing: Identify words and noun
phrases as well as stop words.
2. System term construction: Set up the term list as
well as the global weights.
3. Set up the document vectors of all the collected
documents in normalized form .
4. Construct interior differential term-document
matrix BW
D1A2D2
BD
C1
, such that each of its column is an
interior differential document vector.
5. Construct an exterior differential term-document
matrix BW
D1A2D2
BE
BX
, such that each of its column is an
exterior differential document vector.
6. Decompose BW
C1
and BW
BX
by CBCEBW (singular value
decomposition) algorithm into CDCBCE form. Find
proper values of CZ’s to define the likelihood func-
tions C8B4DCCYBW
C1
B5 and C8B4DCCYBW
BX
B5 as Equition (1).
7. C8B4BW
C1
CYDCB5BP
C8B4DCCYBW
C1
B5C8B4BW
C1
B5
C8B4DCCYBW
C1
B5C8B4BW
C1
B5B7C8B4DCCYBW
BX
B5C8B4BW
BX
B5
BN
where C8B4BW
C1
B5 is set to an average number of re-
calls divided by the number of documents in the
data base and C8B4BW
BX
B5 is set to BD A0 C8B4BW
C1
B5.
2.2.2 Patent Document Search
1. A query is treated as a document; a document
vector is set up by generating the terms as well as
their frequency of occurrence, and thus a normal-
ized document vector is obtained for the query .
Each document in the data base are processed by
the procedures in items 2-5 below.
2. Given a query, construct a differential document
vector DC .
3. Calculate the interior document likelihood func-
tion C8B4DCCYBW
C1
B5, and calculate the exterior document
likelihood function C8B4DCCYBW
BX
B5 for the document.
4. Calculate the Bayesian posteriori probability
function C8B4BW
C1
CYDCB5 .
5. Select those documents whose C8B4BW
C1
CYDCB5 exceeds
a given threshold (say, 0.5), or choose N documents
having the first C6 largest C8B4BW
C1
CYDCB5.
3 Template Structure for Storing Patent
Abstracts
Each patent document is usually provided with an
abstract. The abstract can be used for content-based
information retrieval by using DLSI method as de-
scribed above. As we have mentioned before, the
content-based information retrieval system by LSI
analysis is not robust enough to be directly applica-
ble to a real system. We will use the DLSI method
only to narrow down the search space at a first stage
of filtering in information retrieval. We will resort
to a form based searching strategy to pin down the
patent document.
Now that the content-based DLSI search scheme
has narrowed down the search space in content, the
form based search strategy we now employ need
not to pay attention to the synonymous expressions
of the searching terms or sentences.
This first stage of filtering is now implemented
without going through the tedious process of deal-
ing with the synonymous expressions by synonym
dictionaries which are hard to develop and to use.
Even if we succeeded in treating the synomyms, we
also have to realize that the polynonym of a nat-
ural language will reduce the advantage of using
synonym dictionary further, because two words are
synonymous in one situation but might not be so in
other situations, depending on context.
In view of lengthy sentences used in patent docu-
ments including their abstracts, we want to empha-
size that automaton-based template structure is an
extremely efficient way of expressing lengthy sen-
tences with their synonymous expressions.
We will demonstrate this point by way of exam-
ples below. For a sentence, “There are beautiful
parks in Japan across the nation”, we can use a tem-
plate as of figure 1 where a variety of synonymous
expressions are explicitly represented.
The problem here is, how we could get the tem-
plate for an abstract of patent document? Firstly,
we regard the original abstract of patent itself as a
simplest template. Then, we register queries into
the matched template structures by combining each
pair of matched terms into one node. This is illus-
trated by an example procedure in figures 1-3. The
original template of an abstract is indicated by fig-
ure 1, but when a query of figure 2, namely, “There
are lovely parks across Japan”, is matched to the
template of figure 1, the template could be modi-
fied to a new structure of figure3.
Suppose that the query sentence is, “There are
ugly streets in Japan”. Now although we could lo-
cate a matching pattern similar to that of figure 2,
we will have to rule it out so that we will not come
up with a template which include the above sen-
tence as a path, or part of a path . This mecha-
nism should be established from users’ response.
We will explain it in Section 4.1.
across
all  over
the
country
nation
in  Japan
nationwide
nationwide
in  Japan
beautiful
There are
pretty
parks
Figure 1: Template Example Indicating a set of Semantically Similar Patent Abstracts
across
all over
the
countr y
nation
in Japan
nationwide
nationwide
in
Japan
beautiful
There are
pretty
parks
lovelyThere are parks across Japan
Figure 2: Query Template to be matched with Abstract Template
across
all over
the
country
nation
in
nationwide
nationwide
in
Japan
beautiful
There are
pretty
parks
lovely across
Japan
Figure 3: Modified Template
4 The Flow of the Search Process
4.1 The Entire Flow of the Complete Search
Process
Before starting the search process, we should set up
the DLSI for all the patent documents.
1. Locate the query in the DLSI space.
2. Find and select those patent documents whose
abstracts’ vector space lie in a neighborhood of the
query vector space having semantic similarity to
sentences of figure 1 by the DLSI matching algo-
rithm.
3. For each of the abstracts obtained by step 4.1,
use the template matching algorithm of (Chen and
Tokuda, 2003a) to calculate the similarity of the
summary and the query, select the documents of
which the abstracts have a highest similarity to the
query.
4. Show the result to the user.
5. Modify the abstracts in the database by users’
responses.
5 Concluding Remarks
We have proposed a new IR method for patent
documents addressing both semantic and syntactic
properties by combining a mixed model of content
and form based methods; the first stage of DLSI
method narrows down the search space by content
and the second template method pins down the doc-
ument by syntactic search on words. We are able to
do so, mainly because the DLSI matching in the
first stage captures those documents based on con-
tent while the template method can now pin down
the patent documents having a highest similarity in
form with the query. An experimental verification
of the present approach is now underway.

References
M. W. Berry, Z. Drmac, and E. R. Jessup. 1999. Matri-
ces, vector spaces, and information retrieval. SIAM Rev.,
41(2):335–362.
L. Chen and N. Tokuda. 2003a. Bug diagnosis by string
matching: Application to ILTS for translation. CALICO
Journal, 20(2):227–244.
L. Chen and N. Tokuda. 2003b. Robustness of regional match-
ing scheme over global matching scheme. Artificial Intelli-
gence, 144(1-2):213–232.
L. Chen, N. Tokuda, and A. Nagai. 2001. Probabilistic In-
formation Retrieval Method Based on Differential Latent
Semantic Index Space. IEICE Trans. on Information and
Systems, E84-D(7):910–914.
M. L. Littman, Fan Jiang, and Greg A. Keim. 1998. Learn-
ing a language-independent representation for terms from a
partially aligned corpus. In Proceedings of the Fifteenth In-
ternational Conference on Machine Learning, pages 314–
322.
N. Tokuda and L. Chen. 2001. An online tutoring system for
language translation. IEEE Multimedia, 8(3):46–55.
M. Turk and A. Pentland. 1991. Eigenfaces for recognition.
Journal of Cognitive Neuroscience, 3(1):71–86.
Justin Zobel and Alistair Moffat. 1998. Exploring the similar-
ity space. ACM SIGIR FORUM, 32(1):18–34.
