A Robust Retrieval Engine for Proximal and Structural Search
Katsuya Masuday Takashi Ninomiyayz Yusuke Miyaoy Tomoko Ohtayz Jun’ichi Tsujiiyz
y Department of Computer Science, Graduate School of Information Science and Technology,
University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan
z CREST, JST (Japan Science and Technology Corporation)
Honcho 4-1-8, Kawaguchi-shi, Saitama 332-0012, Japan
fkmasuda,ninomi,yusuke,okap,tsujiig@is.s.u-tokyo.ac.jp
1 Introduction
In the text retrieval area including XML and Region Al-
gebra, many researchers pursued models for specifying
what kinds of information should appear in specified
structural positions and linear positions (Chinenyanga
and Kushmerick, 2001; Wolff et al., 1999; Theobald and
Weilkum, 2000; Clarke et al., 1995). The models at-
tracted many researchers because they are considered to
be basic frameworks for retrieving or extracting complex
information like events. However, unlike IR by keyword-
based search, their models are not robust, that is, they
support only exact matching of queries, while we would
like to know to what degree the contents in specified
structural positions are relevant to those in the query even
when the structure does not exactly match the query.
This paper describes a new ranked retrieval model
that enables proximal and structural search for structured
texts. We extend the model proposed in Region Alge-
bra to be robust by i) incorporating the idea of ranked-
ness in keyword-based search, and ii) expanding queries.
While in ordinary ranked retrieval models relevance mea-
sures are computed in terms of words, our model assumes
that they are defined in more general structural fragments,
i.e., extents (continuous fragments in a text) proposed in
Region Algebra. We decompose queries into subqueries
to allow the system not only to retrieve exactly matched
extents but also to retrieve partially matched ones. Our
model is robust like keyword-based search, and also en-
ables us to specify the structural and linear positions in
texts as done by Region Algebra.
The significance of this work is not in the development
of a new relevance measure nor in showing superiority
of structure-based search over keyword-based search, but
in the proposal of a framework for integrating proximal
and structural ranking models. Since the model treats all
types of structures in texts, not only ordinary text struc-
tures like “title,” “abstract,” “authors,” etc., but also se-
mantic tags corresponding to recognized named entities
or events can also be used for indexing text fragments
and contribute to the relevance measure. Since extents
are treated similarly to keywords in traditional models,
our model will be integrated with any ranking and scala-
bility techniques used by keyword-based models.
We have implemented the ranking model in our re-
trieval engine, and had preliminary experiments to eval-
uate our model. Unfortunately, we used a rather small
corpus for the experiments. This is mainly because
there is no test collection of the structured query and
tag-annotated text. Instead, we used the GENIA cor-
pus (Ohta et al., 2002) as structured texts, which was
an XML document annotated with semantics tags in the
filed of biomedical science. The experiments show that
our model succeeded in retrieving the relevant answers
that an exact-matching model fails to retrieve because of
lack of robustness, and the relevant answers that a non-
structured model fails because of lack of structural spec-
ification.
2 A Ranking Model for Structured
Queries and Texts
This section describes the definition of the relevance be-
tween a document and a structured query represented by
the region algebra. The key idea is that a structured query
is decomposed into subqueries, and the relevance of the
whole query is represented as a vector of relevance mea-
sures of subqueries.
The region algebra (Clarke et al., 1995) is a set of op-
erators, which represent the relation between the extents
(i.e. regions in texts). In this paper, we suppose the re-
gion algebra has seven operators; four containment oper-
ators (⁄, ¢, 6⁄, 6¢) representing the containment relation
between the extents, two combination operators (4, 5)
corresponding to “and” and “or” operator of the boolean
model, and ordering operator (3) representing the order
of words or structures in the texts. For convenience of
explanation, we represent a query as a tree structure as
a0
a1a1a1a1
a0
a2a4a3a6a5a6a7a9a8a10a8a9a11a13a12a15a14a2a16a3a17a7a15a8a15a8a9a11a13a12a15a14
a1a1a1a1
a2a16a3a17a18a20a19a21a18a20a22a24a23a25a12a15a14a26a2a4a3a6a5a6a18a20a19a27a18a28a22a29a23a30a12a15a14
a2a4a31a32a23a33a18a34a31a35a19a24a23a37a36a6a38a30a22a39a14
a2a4a3a33a7a15a8a9a8a10a11a40a12a25a14a2a16a3a6a5a6a7a9a8a9a8a10a11a41a12a10a14a2a4a3a33a18a28a19a27a18a20a22a24a23a10a12a25a14a2a16a3a6a5a6a18a20a19a21a18a20a22a24a23a10a12a10a14
a2a42a31a32a23a41a18a34a31a35a19a24a23a37a36a6a38a25a22a43a14
a0
a2a4a3a30a5a6a7a15a8a15a8a15a11a41a12a10a14a2a9a3a33a7a9a8a9a8a15a11a13a12a15a14
a0
a0
a2a4a3a33a18a44a19a45a18a20a22a24a23a10a12a15a14a26a2a4a3a6a5a6a18a20a19a27a18a28a22a29a23a30a12a15a14
a0
a0
a1a1a1a1
a2a16a3a33a18a20a19a21a18a20a22a24a23a10a12a15a14a46a2a9a3a30a5a30a18a20a19a21a18a20a22a24a23a25a12a15a14
a2a42a31a32a23a40a18a34a31a35a19a24a23a17a36a6a38a10a22a43a14
a0
a1a1a1a1
a47a49a48a51a50a53a52a46a54a55a57a56a53a52a59a58a15a60
a61a62a56a64a63a55a65a56a66a52a67a58a40a60
a68
a69
a70
a71 a72
a73
a74
a71
a75
a71
a76
a71
a77
a78
a79
a71
a80
a78
a81
a82
a83
a84
a85
a84 a86a84
a87
a84 a88
a84
a89
a90
a91
a92
a93
a90
a94
Figure 1: Subqueries of the query ‘[book] ⁄ ([title] ⁄
“retrieval”)’
shown in Figure 1 1 . This query represents ‘Retrieve the
books whose title has the word “retrieval.” ’
Our model assigns a relevance measure of the struc-
tured query as a vector of relevance measures of the sub-
queries. In other words, the relevance is defined by the
number of portions matched with subqueries in a docu-
ment. If an extent matches a subquery of query q, the
extent will be somewhat relevant to q even when the ex-
tent does not exactly match q. Figure 1 shows an example
of a query and its subqueries. In this example, even when
an extent does not match the whole query exactly, if the
extent matches “retrieval” or ‘[title]⁄“retrieval”’, the ex-
tent is considered to be relevant to the query. Subqueries
are formally defined as following.
Definition 1 (Subquery) Let q be a given query and
n1;:::;nm be the nodes of q. Subqueries q1;:::;qm of q
are the subtrees of q. Each qi has node ni as a root node.
When a relevance  (qi;d) between a subquery qi and
a document d is given, the relevance of the whole query
is defined as following.
Definition 2 (Relevance of the whole query) Let q be a
given query, d be a document and q1;:::;qm subqueries of
q. The relevance vector Σ(q;d) of d is defined as follows:
Σ(q;d) = h (q1;d); (q2;d);:::; (qm;d)i
A relevance of a subquery should be defined similarly to
that of keyword-based queries in the traditional ranked re-
trieval. For example, TFIDF, which is used in our experi-
ments in Section 3, is the most simple and straightforward
one, while other relevance measures recently proposed in
(Robertson and Walker, 2000) can be applied. TF value is
calculated using the number of extents matching the sub-
query, and IDF value is calculated using the number of
documents including the extents matching the subquery.
While we have defined a relevance of the structured
query as a vector, we need to sort the documents accord-
ing to the relevance vectors. In this paper, we first map
a vector into a scalar value, and then sort the documents
1In this query, ‘[x]’ is a syntax sugar of ‘hxi3h/xi’.
according to this scalar measure. Three methods are in-
troduced for the mapping from the relevance vector to the
scalar measure. The first one simply works out the sum
of the elements of the relevance vector.
Definition 3 (Simple Sum)
‰sum(q;d) =
mX
i=1
 (qi;d)
The second represents the rareness of the structures.
When the query is A ⁄ B or A ¢ B, if the number of
extents matching the query is close to the number of ex-
tents matching A, matching the query does not seem to
be very important because it means that the extents that
match A mostly match A⁄B or A¢B. The case of the
other operators is the same as with ⁄ and ¢.
Definition 4 (Structure Coefficient) When the operator
op is 4, 5 or 3, the structure coefficient of the query
A op B is:
scAopB = C(A)+C(B)¡C(A op B)C(A)+C(B)
and when the operator op is ⁄ or ¢, the structure coeffi-
cient of the query A op B is:
scAopB = C(A)¡C(A op B)C(A)
where A and B are the queries and C(A) is the number
of extents that match A in the document collection.
The scalar measure ‰sc(qi;d) is then defined as
‰sc(q;d) =
mX
i=1
scqi ¢ (qi;d)
The third is a combination of the measure of the query
itself and the measure of the subqueries. Although we
calculate the score of extents by subqueries instead of us-
ing only the whole query, the score of subqueries can not
be compared with the score of other subqueries. We as-
sume normalized weight of each subquery and interpolate
the weight of parent node and children nodes.
Definition 5 (Interpolated Coefficient) The interpo-
lated coefficient of the query qi is recursively defined as
follows:
‰ic(qi;d) = ‚¢ (qi;d)+(1¡‚)
P
ci ‰ic(qci;d)
l
where ci is the child of node ni, l is the number of children
of node ni, and 0 • ‚ • 1.
This formula means that the weight of each node is de-
fined by a weighted average of the weight of the query
and its subqueries. When ‚ = 1, the weight of each
query is normalized weight of the query. When ‚ = 0,
the weight of each query is calculated from the weight of
the subqueries, i.e. the weight is calculated by only the
weight of the words used in the query.
1 ‘([cons]⁄([sem]⁄“G#DNA domain or region”))4(“in”3([cons]⁄([sem]⁄(“G#tissue”5“G#body part”))))’
2 ‘([event]⁄([obj]⁄“gene”))4(“in”3([cons]⁄([sem]⁄(“G#tissue”5“G#body part”))))’
3 ‘([event]⁄([obj]3([sem]⁄“G#DNA domain or region”)))4(“in”3([cons]⁄([sem]⁄(“G#tissue”5“G#body part”))))’
4 ‘([event]⁄([dummy]⁄“G#DNA domain or region”))4(“in”3([cons]⁄([sem]⁄(“G#tissue”5“G#body part”))))’
Table 1: Queries submitted in the experiments
3 Experiments
In this section, we show the results of our preliminary
experiments of text retrieval using our model. Because
there is no test collection of the structured query and tag-
annotated text, we used the GENIA corpus (Ohta et al.,
2002) as a structured text, which was an XML document
composed of paper abstracts in the field of biomedical
science. The corpus consisted of 1,990 articles, 873,087
words (including tags), and 16,391 sentences.
We compared three retrieval models, i) our model, ii)
exact matching of the region algebra (exact), and iii)
not-structured flat model. In the flat model, the query
was submitted as a query composed of the words in the
queries in Table 1 connected by the “and” operator (4).
The queries submitted to our system are shown in Ta-
ble 1, and the document was “sentence” represented by
“hsentencei” tags. Query 1, 2, and 3 are real queries made
by an expert in the field of biomedicine. Query 4 is a toy
query made by us to see the robustness compared with
the exact model easily. The system output the ten results
that had the highest relevance for each model2.
Table 2 shows the number of the results that were
judged relevant in the top ten results when the ranking
was done using ‰sum. The results show that our model
was superior to the exact and flat models for Query 1,
2, and 3. Compared to the exact model, our model out-
put more relevant documents, since our model allows the
partial matching of the query, which shows the robust-
ness of our model. In addition, our model outperforms
the flat model, which means that the structural specifi-
cation of the query was effective for finding the relevant
documents. For Query 4, our model succeeded in find-
ing the relevant results although the exact model failed
to find results because Query 4 includes the tag not con-
tained in the text (“hdummyi” tag). This result shows the
robustness of our model.
Although we omit the results of using ‰sc and ‰ic be-
cause of the limit of the space, here we summarize the
results of them. The number of relevant results using ‰sc
was the same as that of ‰sum, but the rank of irrelevant
2For the exact model, ten results were selected randomly
from the exactly matched results if the total number of results
was more than ten. After we had the results for each model,
we shuffled these results randomly for each query, and the shuf-
fled results were judged by an expert in the field of biomedicine
whether they were relevant or not.
Query our model exact flat
1 10/10 9/10 9/10
2 6/10 5/ 5 3/10
3 10/10 9/ 9 8/10
4 7/10 0/ 0 9/10
Table 2: (The number of relevant results) / (the number
of all results) in top 10 results.
results using ‰sc was lower than that of ‰sum. The results
using ‰ic varied between the results of the flat model and
the results of ‰sum depending on the value of ‚.
4 Conclusions
We proposed a ranked retrieval model for structured
queries and texts by extending the region algebra to be
ranked. Our model achieved robustness by extending the
concept of words to extents and by matching with sub-
queries decomposed from a given query instead of match-
ing the entire query or words.
References
T. Chinenyanga and N. Kushmerick. 2001. Expressive
and efficient ranked querying of XML data. In Pro-
ceedings of WebDB-2001.
C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski.
1995. An algebra for structured text search and a
framework for its implementation. The computer Jour-
nal, 38(1):43–56.
T. Ohta, Y. Tateisi, H. Mima, and J. Tsujii. 2002. GE-
NIA corpus: an annotated research abstract corpus in
molecular biology domain. In Proceedings of HLT
2002.
S. E. Robertson and S. Walker. 2000. Okapi/Keenbow at
TREC-8. In TREC-8, pages 151–161.
A. Theobald and G. Weilkum. 2000. Adding relevance
to XML. In Proceedings of WebDB’00.
J. Wolff, H. Fl¨orke, and A. Cremers. 1999. XPRES:
a Ranking Approach to Retrieval on Structured Docu-
ments. Technical Report IAI-TR-99-12, University of
Bonn.
