Lexical Query Paraphrasing for Document Retrievala0
Ingrid Zukerman
School of Computer Science and Software Eng.
Monash University
Clayton, VICTORIA 3800
AUSTRALIA
Bhavani Raskutti
Telstra Research Laboratories
770 Blackburn Road
Clayton, VICTORIA 3168
AUSTRALIA
Abstract
We describe a mechanism for the generation of
lexical paraphrases of queries posed to an Inter-
net resource. These paraphrases are generated us-
ing WordNet and part-of-speech information to pro-
pose synonyms for the content words in the queries.
Statistical information, obtained from a corpus, is
then used to rank the paraphrases. We evaluated
our mechanism using 404 queries whose answers
reside in the LA Times subset of the TREC-9 cor-
pus. There was a 14% improvement in perfor-
mance when paraphrases were used for document
retrieval.
1 Introduction
The vocabulary of users of domain-specific retrieval
systems often differs from the vocabulary within a
particular resource, leading to retrieval failure. In
this research, we address this problem by submit-
ting multiple paraphrases of a query to a retrieval
system, in the hope that one or more of the posited
paraphrases will match a relevant document.
We focus on the generation of lexical paraphrases
for queries posed to the Internet. These are para-
phrases where content words are replaced with syn-
onyms. We use WordNet (Miller et al., 1990) and
part-of-speech information to propose these syn-
onyms, and build candidate paraphrases from com-
binations of these synonyms. The resultant para-
phrases are then scored using word co-occurrence
information obtained from a corpus, and the high-
est scoring paraphrases are retained. Our evaluation
shows a 14% improvement in retrieval performance
as a result of query paraphrasing.
In the next section we describe related research.
In Section 3, we discuss the resources used by our
mechanism. The paraphrase generation and docu-
ment retrieval processes are described in Section 4.
Section 5 presents sample paraphrases, followed by
our evaluation and concluding remarks.
a1 This research was supported in part by Australian Research
Council grant DP0209565.
2 Related Research
The vocabulary mis-match between user queries
and indexed documents is often addressed through
query expansion. Two common techniques for
query expansion are blind relevance feedback
(Buckley et al., 1995; Mitra et al., 1998) and
word sense disambiguation (WSD) (Mihalcea and
Moldovan, 1999; Lytinen et al., 2000; Sch¨utze and
Pedersen, 1995; Lin, 1998). Blind relevance feed-
back consists of retrieving a small number of docu-
ments using a query given by a user, and then con-
structing an expanded query that includes content
words that appear frequently in these documents.
This expanded query is used to retrieve a new set of
documents. WSD often precedes query expansion
to avoid retrieving irrelevant information. Mihalcea
and Moldovan (1999) and Lytinen et al. (2000) used
a machine readable thesaurus, specifically WordNet
(Miller et al., 1990), to obtain the sense of a word,
while Sch¨utze and Pedersen (1995) and Lin (1998)
used automatically constructed thesauri.
The improvements in retrieval performance re-
ported in (Mitra et al., 1998) are comparable to
those reported here (note that these researchers con-
sider precision, while we consider recall). The re-
sults obtained by Sch¨utze and Pedersen (1995) and
by Lytinen et al. (2000) are encouraging. However,
experimental results reported in (Sanderson, 1994;
Gonzalo et al., 1998) indicate that the improvement
in IR performance due to WSD is restricted to short
queries, and that IR performance is very sensitive to
disambiguation errors.
Our approach to document retrieval differs from
the above approaches in that the expansion of a
query takes the form of alternative lexical para-
phrases. Like Harabagiu et al. (2001), we use
WordNet to propose synonyms for the words in
a query. However, they apply heuristics to select
which words to paraphrase. In contrast, we use
corpus-based information in the context of the en-
tire query to calculate the score of a paraphrase
and select which paraphrases to retain, and then use
the paraphrase scores to influence the document re-
trieval process.
3 Resources
Our system uses syntactic, semantic and statistical
information for paraphrase generation. Syntactic in-
formation for each query was obtained from Brill’s
part-of-speech (PoS) tagger (Brill, 1992). Seman-
tic information consisting of different types of syn-
onyms for the words in each query was obtained
from WordNet (Miller et al., 1990).
The corpus used for information retrieval and for
the collection of statistical information was the LA
Times portion of the NIST Text Research Collec-
tion (//trec.nist.gov). This corpus was small
enough to satisfy our disk space limitations, and suf-
ficiently large to yield statistically significant results
(131,896 documents). Full-text indexing was per-
formed for the documents in the LA Times collec-
tion, using lemmas (rather than words). The lemmas
for the words in the LA Times collection were also
obtained from WordNet (Miller et al., 1990).
The statistical information was used to assign a
score to the paraphrases generated for a query (Sec-
tion 4.4). This information was stored in a lemma
dictionary (202,485 lemmas) and a lemma-pair dic-
tionary (37,341,156 lemma-pairs). The lemma dic-
tionary associates with each lemma the number of
times it appears in the corpus. The lemma-pair
dictionary associates with each ordered lemma-pair
a0a2a1 -a0a4a3 the number of times a0a5a1 appears before a0a6a3 in
a five-word window in the corpus (not counting
stop words and closed-class words). The dictionary
maintains a different entry for the lemma pair a0a7a3 -
a0a2a1 . Lemma-pairs which appear only once constitute
64% of the pairs, and were omitted from our dictio-
nary owing to disk space limitations.
4 Paraphrasing and Retrieval Procedure
The procedure for paraphrasing a query consists of
the following steps:
1. Tokenize, tag and lemmatize the query.
2. Generate synonyms for each content lemma in
the query (stop words are ignored).
3. Propose paraphrases for the query using differ-
ent synonym combinations, compute a score for
each paraphrase, and rank the paraphrases ac-
cording to their score. The lemmatized query
plus the 19 top paraphrases are retained.
Documents are then retrieved for the query and
its paraphrases.
4.1 Tagging and lemmatizing the queries
We used the part-of-speech (PoS) of a word to con-
strain the number of synonyms generated for it.
Brill’s tagger correctly tagged 84% of the queries.
In order to determine the effect of tagging er-
rors on retrieval performance, we corrected manu-
ally the wrong tags, and ran our system with both
automatically-obtained and manually-corrected tags
(Section 6). After tagging, each query was lemma-
tized (using WordNet). This was done since the in-
dex used for document retrieval is lemma-based.
4.2 Proposing synonyms for each word
The following types of WordNet synonyms were
generated for each content lemma in a query:
synonyms, attributes, pertainyms and
seealsos (Miller et al., 1990).1 For example,
according to WordNet, a synonym for “high” is
“steep”, an attributeis “height”, and a seealso
is “tall”; a pertainym for “chinese” is “China”.
In order to curb the combinatorial explosion, we
do not allow multiple-word synonyms for a lemma,
and do not generate synonyms for proper nouns or
stop words.
4.3 Paraphrasing queries
Query paraphrases are generated by an iterative pro-
cess which considers each content lemma in a query
in turn, and proposes a synonym from those col-
lected from WordNet (Section 4.2). Queries which
do not have sufficient context are not paraphrased.
These are queries where all the words except one
are stop words or closed-class words.
4.4 Computing paraphrase scores
The score of a paraphrase is based on how common
are the lemma combinations in it. Ideally, this score
would be represented by Pra8 a0a10a9a12a11a14a13a14a13a14a13a12a11a15a0a2a16a18a17 , where a19 is
the number of lemmas in the paraphrase. However,
in the absence of sufficient information to compute
this joint probability, approximations based on con-
ditional probabilities are often used, e.g.,
Pra8 a0a20a9a12a11a14a13a14a13a14a13a12a11a15a0a2a16a18a17a22a21 Pra8 a0a7a16a24a23 a0a2a16a26a25a27a9a28a17a30a29a31a13a14a13a14a13a15a29 Pra8 a0a5a32a33a23 a0a34a9a35a17a30a29 Pra8 a0a20a9a28a17
Unfortunately, this approximation yielded poor
paraphrases in preliminary trials. We postulate that
this is due to two reasons: (1) it takes into account
the interaction between a lemma a0a36a1 and only one
other lemma (without considering the rest of the
lemmas in the query), and (2) relatively infrequent
lemma combinations involving one frequent lemma
1In preliminary experiments we also generated hypernyms
and hyponyms. However, this increased the number of alterna-
tive paraphrases exponentially, without improving the quality
of the results in most cases.
are penalized (which is correct for conditional prob-
abilities). For instance, if a0 a3 appears 10 times in the
corpus and a0a5a1 -a0a6a3 appears 4 times, a1 a8 a0a5a1 a23 a0a6a3 a17a3a2a5a4 a13a7a6a9a8
(where a8 is a normalizing constant). In contrast,
if a0a11a10a3 appears 200 times in the corpus and a0a12a10a1 -a0a11a10a3 ap-
pears 30 times, a1 a8 a0a13a10a1 a23 a0a11a10a3 a17a14a2a15a4 a13a17a16a19a18a20a8 . However, a0a13a10a1 -a0a11a10a3
is a more frequent lemma combination, and should
contribute a higher score to the paraphrase.
To address these problems, we propose using the
joint probability of a pair of lemmas instead of their
conditional probability. In the above example, this
yields a1 a8 a0a7a1a10a11a15a0a6a3 a17a21a2a22a6a24a23 and a1 a8 a0 a10a1 a11a15a0 a10a3 a17a25a2a27a26a20a4a20a23 (where a23
is a normalizing constant). These probabilities re-
flect more accurately the goodness of paraphrases
containing these lemma-pairs. The resulting ap-
proximation of the probability of a paraphrase com-
posed of lemmas a0 a9 a11a14a13a14a13a14a13 a11a15a0 a16 is as follows:
Pra8 a0a20a9a12a11a14a13a14a13a14a13 a11a15a0a7a16 a17 a21
a16
a28
a1a17a29 a9
a16
a28
a3a30a29 a1a17a31 a9
Pra8 a0a2a1 a11a15a0a4a3 a17 (1)
Pra8 a0a7a1a10a11a15a0a6a3 a17 is obtained directly from the lemma-pair
frequencies, yielding
Pra8 a0a20a9a12a11a14a13a14a13a14a13a12a11a15a0a2a16a18a17 a21
a16
a28
a1a17a29 a9
a16
a28
a3a30a29 a1a32a31 a9
a23 a29 freq
a8
a0a2a1 a11a15a0a6a3 a17
where a23 is a normalizing constant.2 Since this con-
stant is not informative with respect to the rela-
tive scores of the paraphrases for a particular query,
we drop it from consideration, and use only the
frequencies to calculate the score of a paraphrase.
Thus, our paraphrase scoring function is
a33a35a34
a8
a0a34a9a12a11a14a13a14a13a14a13 a11a15a0a2a16 a17a36a2
a16
a28
a1a17a29 a9
a16
a28
a3a37a29 a1a17a31 a9
freqa8 a0a7a1 a11a15a0a4a3 a17 (2)
4.4.1 Experimental parameters
When calculating the score of a paraphrase us-
ing Equation 2, the following aspects regarding
freqa8 a0a7a1 a11a15a0a6a3 a17 must be specified: (1) the extent to which
the order of a0 a1 and a0 a3 (as it appears in the paraphrase)
should be enforced; and (2) how to handle a0 a1 -a0 a3 pairs
in the paraphrase that are absent from the lemma-
pair dictionary. To illustrate these aspects, consider
the candidate paraphrase “who is the greek deity of
the ocean?” (proposed for “who is the greek god of
the sea?”). The first aspect determines whether the
frequency of only “greek deity” should be used, or
whether “deity greek” should also be taken into ac-
count. The second aspect determines how to score
the paraphrase if “greek ocean” is absent from the
lemma-pair dictionary. These aspects are specified
as experimental parameters of the system.
2a38a40a39 a41
# of lemma-pairsa42a30a43a44a42a46a45a48a47a50a49a52a51a13a53
a39 a41
a54a12a55a57a56 a54a59a58
a41
a56
a41a50a60a13a61
a42a30a43a44a42a46a45a48a47a50a49a52a51a13a53
.
Relative word order. The extent to which we en-
force the order of a0a5a1 -a0a6a3 when calculating freqa8 a0a5a1 a11a15a0a6a3 a17
is determined by the weight a62 order as follows:
freqa8 a0a2a1 a11a15a0a6a3 a17a36a2 freqa8 a0a7a1a64a63 a0a4a3 a17a66a65 a62 order a29 freqa8 a0a4a3a30a63 a0a2a1 a17
(3)
where freqa8 a0a7a1a67a63 a0a6a3 a17 is the frequency of the lemma-
pair a8 a0 a1 a11a15a0 a3 a17 when a0 a1 is followed by a0 a3 . a62 order a2a68a4
allows only the word order in the paraphrase, while
a62 order
a2a69a16 counts equally the order in the para-
phrase and the reverse order. We experimented with
weights of 0, 1 and 0.5 for a62 order (Section 6).
Absent lemma-pairs. When a lemma-pair is not
in the dictionary, a frequency of 0 is returned. Us-
ing this frequency is too strict, because it invali-
dates an entire paraphrase on account of one cul-
prit which may actually be innocent (recall that 64%
of the lemma-pairs in the corpus – approximately
66 million pairs – had a frequency of 1 but were
not stored). To address this problem, we assigned a
penalty frequency of AbsFreq = 0.1 to a lemma-pair
in a paraphrase that does not appear in the dictio-
nary. That is, the score of a paraphrase is divided by
10 for each of its lemma-pairs that is absent from
the dictionary.
In addition, we defined the experimental parame-
ter AbsAdjDiv, which models the impact of adjacent
lemma-pairs on paraphrasing and retrieval perfor-
mance. This parameter takes the form of a divisor
for AbsFreq: it stipulates by how much to divide Ab-
sFreq for a lemma-pair that is adjacent in the para-
phrase but absent from the dictionary. In the above
example, AbsAdjDiv=10 would cause an absent “de-
ity ocean” to receive a penalty of 0.01 (=0.1/10)
compared to an absent “greek ocean”, which would
receive a penalty of 0.1. We experimented with four
values for AbsAdjDiv: 1, 2, 10 and 20 (Section 6).
4.5 Retrieving documents for each query
Our retrieval process differs from the standard one
in that for each query a70 , we adjust the scores of the
retrieved documents according to the scores of the
paraphrases of a70 (obtained from Equation 2). Our
retrieval process consists of the following steps:
1. For each paraphrase a1 a1 of a70 (a71 a2 a4 a11a14a13a14a13a14a13 a11
# para a70 ), where a1a67a72 is the lemmatized query:
(a) Extract the content lemmas from a1 a1 :
a0a2a1a11a73 a9a14a11a14a13a14a13a14a13 a0a2a1a11a73 a74 , where
a75 is the number of
content lemmas in paraphrase a1 a1 .
(b) For each lemma, compute a score for the re-
trieved documents using a standard IR mea-
sure, e.g., Term Frequency Inverse Document
Frequency (TFIDF) (Salton and McGill,
1983). Let tfidfa8a13a76a78a77
a11a15a0a2a1a11a73a3 a17 be the score of
document a76a78a77 retrieved for lemma a0 a1a50a73a3 (a0 a2
a16 a11a14a13a14a13a14a13 a11
a75 ). When a document a76a3a77 is retrieved
by more than one lemma in a paraphrase
a1 a1 , its TFIDF scores are added, yielding the
score a1
a74
a3a30a29 a9 tfidf
a8a13a76a78a77
a11a15a0a7a1a50a73a3 a17 . This score indi-
cates how well a76a78a77 matches the lemmas in
paraphrase a1 a1 . In order to take into account
the plausibility of a1 a1 , this score is multiplied
by a33a35a34 a8 a1 a1a20a17 – the score of a1 a1 obtained from
Equation 2. This yields a2 a34 a77 a73 a1 , the score of
document a76a78a77 for paraphrase a1 a1 .
a2
a34
a77
a73 a1 a2 a33a35a34
a8
a1 a1 a17 a29
a74
a3
a3a30a29 a9
tfidfa8a13a76a78a77 a11a15a0 a1a11a73a3 a17 (4)
2. For each document a76a3a77 , add the scores from each
paraphrase (Equation 4), yielding
a2
a34
a77
a2a4
para a5
a3
a1a17a29 a9
a33a35a34
a8
a1 a1 a17 a29
a74
a3
a3a30a29 a9
tfidfa8a13a76 a77 a11a15a0 a1a11a73a3 a17 (5)
An outcome of this method is that lemmas
which appear in several paraphrases receive a higher
weight. This indirectly identifies the important
words in a query, which positively affects retrieval
performance (Section 6).
5 Sample Results
Table 1 shows the top 10 paraphrases generated by
our system for three sample queries, and the 7 para-
phrases generated for a fourth query (the lemma-
tized query is listed first). These paraphrases were
obtained with a62 order a2 a16 , AbsAdjDiv = 10, and
manually-corrected tagging (Section 4). The third
column contains the paraphrase, the first column
contains its score, and the second column contains
the number of lemma-pairs in the paraphrase which
were not found in the dictionary.
These examples illustrate the combined effect of
contextual information and WordNet senses. The
first query yields mostly felicitous paraphrases, de-
spite their low overall score and absent lemma-
pairs. This outcome may be attributed to the gen-
erally appropriate synonyms returned by WordNet
for the lemmas in this query. The second query
produces a mixed paraphrasing performance. The
problematic paraphrases are generated because our
corpus-based information supports WordNet’s inap-
propriate suggestions of “manufacture” as a syn-
onym for “invent” and “video” as a synonym for
“television”, thus yielding highly-ranked but incor-
rect paraphrases. The third query is an extreme
example of this behaviour, where WordNet syn-
onyms conspire with contextual information to steer
Table 1: Sample query paraphrases
Score #Abs Paraphrase
Who is the Greek God of the Sea ?
9.20E+02 0 who be the greek god of the sea ?
6.90E+00 1 who be the greek god of the ocean ?
5.00E-01 1 who be the greece god of the sea ?
1.00E-02 2 who be the greece deity of the sea ?
1.00E-02 2 who be the greece divinity of the sea ?
1.00E-02 2 who be the greece immortal of the sea ?
1.00E-02 2 who be the greece idol of the sea ?
8.00E-03 2 who be the greek deity of the sea ?
8.00E-03 2 who be the greek divinity of the sea ?
8.00E-03 2 who be the greek immortal of the sea ?
8.00E-03 2 who be the greek idol of the sea ?
Who invented television ?
7.00E+00 0 who invent television ?
1.60E+01 0 who manufacture television ?
1.60E+01 0 who manufacture video ?
1.10E+01 0 who manufacture tv ?
9.00E+00 0 who invent tv ?
2.00E+00 0 who devise television ?
2.00E+00 0 who forge tv ?
1.00E-02 1 who invent video ?
1.00E-02 1 who invent telly ?
1.00E-02 1 who contrive television ?
1.00E-02 1 who contrive tv ?
When was Babe Ruth born ?
6.06E+03 0 when be babe ruth bear ?
3.39E+04 0 when be babe ruth pay ?
1.97E+04 0 when be babe ruth stand ?
1.09E+04 0 when be babe ruth hold ?
2.42E+03 0 when be babe ruth carry ?
1.21E+03 0 when be babe ruth have ?
4.24E+02 1 when be babe ruth support ?
9.09E+01 1 when be babe ruth expect ?
6.06E+00 1 when be babe ruth brook ?
6.06E+00 1 when be babe ruth wear ?
3.03E-01 2 when be babe ruth deliver ?
How tall is the giraffe ?
4.00E+00 0 how tall be the giraffe ?
2.00E+00 0 how large be the giraffe ?
2.00E+00 0 how big be the giraffe ?
2.00E+00 0 how high be the giraffe ?
1.00E-01 1 how grandiloquent be the giraffe ?
1.00E-01 1 how magniloquent be the giraffe ?
1.00E-01 1 how improbable be the giraffe ?
1.00E-01 1 how marvelous be the giraffe ?
the paraphrasing process toward inappropriate syn-
onyms of “bear”. The final example illustrates the
opposite case, where the corpus information over-
comes the effect of WordNet’s less appropriate sug-
gestions, which yield low-scoring paraphrases.
6 Evaluation
For our evaluation, we performed two retrieval tasks
on the TREC LA Times collection, using TREC
judgments to identify the queries that had relevant
documents in this collection. Our main evaluation
was performed for the TREC-9 question-answering
task, since our ultimate goal is to answer ques-
tions posed to an Internet resource. From a total
of 131,896 documents in the collection, 1211 doc-
uments contained the correct answer for 404 of the
693 TREC-9 queries. An additional evaluation was
performed for the TREC-6 ad-hoc retrieval task,
where 1105 documents were judged relevant to 48
of the 50 TREC-6 keyword-based queries.
Our results show that query paraphrasing im-
proves overall retrieval performance. For the ad-hoc
task, when 20 retrieved documents were retained for
each query, 22 correct documents in total were re-
trieved without paraphrasing, while a maximum of
20 paraphrases per query yielded 35 correct docu-
ments (only 18 of the 48 queries were paraphrased).
For the question answering task, under the same
retrieval conditions, recall improved from 294 cor-
rect documents without paraphrasing to 337 with a
maximum of 20 paraphrases per query. Specifically,
the number of queries for which correct documents
were retrieved improved from 169 to 182.
In addition, we tested the effect of the following
factors on retrieval performance.
a0 WordNet co-locations – three usages of word co-
locations (none, for scoring only, for scoring and
paraphrase generation).
a0 Tagging accuracy – manually-corrected tagging
versus automatic PoS tagging (Brill, 1992),
which tagged correctly 84% of the queries.
a0 Out-of-order weight (
a62 order) – how much we
should take into account the word order in a
query (strict consideration, ignore word order,
intermediate).
a0 Absent adjacent-pair divisor (AbsAdjDiv) – how
much we should penalize lemma-pairs that are
adjacent in the query but absent from the corpus
(same penalty as non-adjacent absent lemma-
pairs, a little higher, a lot higher).
a0 Query length – how the number of words in the
query affects retrieval performance.
For each run, we submitted to the retrieval engine
increasing sets of paraphrases as follows: first the
lemmatized query alone (Set 0), next the query plus
1 paraphrase (Set 1), then the query plus 2 para-
phrases (Set 2), and so on, up to a maximum of
19 paraphrases (Set 19). For each submission, we
varied the number of documents returned by the re-
trieval engine from 1 to 20 documents.
6.1 WordNet Co-locations
As indicated above, we considered three usages of
WordNet with respect to word co-locations: Col,
0 5 10 15 20290
295
300
305
310
315
320
325
330
335
Number of paraphrases
Total number of correct documents
Correct Documents Vs Number of Paraphrases
Col
ColScore
NoCol
Figure 1: Effect of word co-location and number of
paraphrases (20 retrieved documents)
NoCol and ColScore. Under the Col setting, our
mechanism checked whether a lemma-pair in the
input query corresponds to a WordNet co-location,
and if so, generated synonyms for the pair, instead
of the individual lemmas. For instance, given the
lemma-pair “folic acid”, the Col setting yielded syn-
onyms such as “folate” and “vitamin m” for the
lemma-pair. During paraphrase scoring, these co-
locations were assigned a high frequency score, cor-
responding to the 999th percentile of pair frequen-
cies in the corpus. In contrast, the NoCol setting
did not take into account WordNet co-locations at
all. For instance, one of the paraphrases gener-
ated by this method for “folic acid” was “folic lsd”.
ColScore is a hybrid setting, where WordNet was
used for scoring lemma-pairs in the proposed para-
phrases, but not for generating them.
Figure 1 depicts the total number of correct doc-
uments retrieved (for 20 retrieved documents per
query), for each of the three co-location settings,
as a function of the number of paraphrases in a
set (from 0 to 19). The values for the other fac-
tors were: a62 order=1, AbsAdjDiv=2, and manually-
corrected tagging. 294 correct documents were re-
trieved when only the lemmatized query was sub-
mitted for retrieval (0 paraphrases). This number
increases dramatically for the first few paraphrases,
and eventually levels out for about 12 paraphrases.
In order to compare queries that had different num-
bers of paraphrases, when the maximum number of
paraphrases for a query was less than 19, the results
obtained for this maximum number were replicated
for the paraphrase sets of higher cardinality. For
instance, if only 6 paraphrases were generated for
a query, the number of correct documents retrieved
0 5 10 15 200
50
100
150
200
250
300
350
Number of retrieved documents
Total number of correct documents
Correct Documents Vs Number of Retrieved Documents
NoPara
Col
ColScore
NoCol
Figure 2: Effect of word co-location and number of
retrieved documents (maximum paraphrases)
for the 6 paraphrases was replicated for Sets 7 to 19.
Figure 2 depicts the total number of correct doc-
uments retrieved (for 19 paraphrases or maximum
paraphrases), for each of the three co-location set-
tings, as a function of the number of documents re-
trieved per query (from 1 to 20). As for Figure 1,
paraphrasing improves retrieval performance. In ad-
dition, as expected, recall performance improves as
more documents are retrieved.
The Col setting generally yielded fewer and more
felicitous paraphrases than those generated without
considering co-locations (for the 118 queries where
co-locations were identified). Surprisingly however,
this effect did not transfer to the retrieval process,
as the NoCol setting yielded a marginally better per-
formance. This difference in performance may be
attributed to whether a lemma or lemma-pair that
was important for retrieval was retained in enough
paraphrases. This happened in 9 instances of the
NoCol setting and 2 instances of the Col setting,
yielding a slightly better performance for the NoCol
setting overall. For example, the identification of
“folic acid” as a co-location led to synonyms such
as “vitamin m” and “vitamin bc”, which appeared
in most of the paraphrases. As a result, the effect of
the lemma-pair “folic acid”, which was actually re-
sponsible for retrieving the correct document, was
obscured. In contrast, the recognition of “major
league” as a co-location (which was paraphrased to
“big league” in only 3 of the 19 paraphrases) en-
abled the retrieval of the correct document. Since
the performance under the ColScore condition was
consistently worse than the performance under the
other two conditions, we do not consider it in the
rest of our evaluation.
6.2 Tagging accuracy
The PoS-tagger incorrectly tagged 64 of the 404
queries in our corpus (usually, one word was mis-
tagged in each of these queries). The instances of
mis-tagging which had the largest impact on the
quality of the generated paraphrases occurred when
nouns were mis-tagged as verbs and vice versa (18
cases). In addition, proper nouns were mis-tagged
as other PoS and vice versa in 24 cases, and the
verb “name” (e.g., “Name the highest mountain”)
was mis-tagged as a noun in 17 instances. Surpris-
ingly, retrieval performance was affected only in 5
instances both for the Col and the NoCol settings: 3
of these instances had a mis-tagged “name”, and 2
had a noun mis-tagged as another PoS.
6.3 Out-of-order weight
We considered three settings for the out-of-order
weight, a62 order (Equation 3): 1, 0 and 0.5. The
first setting ignores word order. For instance, given
the query “how many dogs pull a sled in the Idi-
tarod?” the frequency of the lemma-pair “dog-pull”
is added to that of the pair “pull-dog”. The second
setting enforces a strict word order, e.g., only “dog-
pull” is considered. The third setting considers out-
of-order lemma-pairs, but gives their frequency half
the weight of the ordered pairs.
Interestingly, this factor had no effect on retrieval
performance. This may be explained by the obser-
vation that the lemma order in the queries reflects
their order in the corpus. Thus, when an ordered
lemma-pair in a query matches a dictionary entry,
the additional frequency count contributed by the
reverse lemma order is often insufficient to affect
significantly the relative score of the paraphrases.
6.4 Penalty for absent adjacent lemma-pairs
We considered four settings for the penalty assigned
to lemma-pairs that are adjacent in a paraphrase but
absent from the dictionary. These settings are repre-
sented by the values 1, 2, 10 and 20 for the divisor
AbsAdjDiv. For instance, a value of 10 means that
the score for an absent adjacent lemma-pair is 1/10
of the score of an absent non-adjacent lemma-pair.
That is, the score of a paraphrase is divided by 100
for each absent adjacent lemma-pair.
This factor had only a marginal effect on retrieval
performance, with the best performance being ob-
tained for AbsAdjDiv = 10.
6.5 Query Length
Our investigation of the effect of query length on
retrieval performance indicates that better perfor-
mance is obtained for shorter queries. Figure 3
shows the percentage of queries where at least one
correct document was retrieved, as a function of
3 4 5 6 7 8 9 10 11 120
10
20
30
40
50
60
70
80
Query length (in words)
Percentage of successful queries
Percentage of successful queries Vs query length
Figure 3: Effect of query length (20 retrieved docu-
ments and maximum paraphrases)
query length in words (20 documents were retrieved
using 19 or maximum paraphrases). These results
were obtained for the settings Col, a62 order a2 a16
and AbsAdjDiv=10, with manually-corrected tag-
ging. As seen in Figure 3, there is a drop in retrieval
performance for queries with more than 5 words.
These results generally concur with the observa-
tions in (Sanderson, 1994; Gonzalo et al., 1998).
Nonetheless, on average we returned a correct doc-
ument for 42% of the queries which had 6 to 11
words.
7 Conclusion
We have offered a mechanism for the generation of
lexical paraphrases of queries posed to an Internet
resource. These paraphrases were generated using
WordNet and part-of-speech information to propose
synonyms for the content lemmas in the queries.
Statistical information obtained from a corpus was
used to rank the paraphrases. Our evaluation shows
that paraphrasing improves retrieval performance.
This is achieved despite mis-tagging and erroneous
paraphrasing of co-located words.

References

E. Brill. 1992. A simple rule-based part of speech
tagger. In ANLP-92 – Proceedings of the 3rd
Conference on Applied Natural Language Pro-
cessing, pages 152–155, Trento, IT.

C. Buckley, G. Salton, J. Allan, and A. Sing-
hal. 1995. Automatic query expansion using
SMART. In D. Harman, editor, The Third Text
REtrieval Conference (TREC3). NIST Special
Publication.

J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigar-
ran. 1998. Indexing with WordNet synsets can
improve text retrieval. In Proceedings of the
COLING-ACL’98 Workshop on Usage of Word-
Net in Natural Language Processing Systems,
pages 38–44, Montreal, Canada.

S. Harabagiu, D. Moldovan, M. Pasca, R. Mi-
halcea, M. Surdeanu, R. Bunescu, R. Girju,
V. Rus, and P. Morarescu. 2001. The role of
lexico-semantic feedback in open domain tex-
tual question-answering. In ACL01 – Proceed-
ings of the 39th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 274–
281, Toulouse, France.

D. Lin. 1998. Automatic retrieval and clustering of
similar words. In COLING-ACL’98 – Proceed-
ings of the International Conference on Compu-
tational Linguistics and the Annual Meeting of
the Association for Computational Linguistics,
pages 768–774, Montreal, Canada.

S. Lytinen, N. Tomuro, and T. Repede. 2000. The
use of WordNet sense tagging in FAQfinder. In
Proceedings of the AAAI00 Workshop on AI and
Web Search, Austin, Texas.

R. Mihalcea and D. Moldovan. 1999. A method for
word sense disambiguation of unrestricted text.
In ACL99 – Proceedings of the 37th Annual Meet-
ing of the Association for Computational Linguis-
tics, Baltimore, Maryland.

G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K. Miller. 1990. Introduction to WordNet: An
on-line lexical database. Journal of Lexicogra-
phy, 3(4):235–244.

M. Mitra, A. Singhal, and C. Buckley. 1998. Im-
proving automatic query expansion. In SIGIR’98
– Proceedings of the 21th ACM International
Conference on Research and Development in In-
formation Retrieval, pages 206–214, Melbourne,
Australia.

G. Salton and M.J. McGill. 1983. An Introduction
to Modern Information Retrieval. McGraw Hill.

M. Sanderson. 1994. Word sense disambiguation
and information retrieval. In SIGIR’94 – Pro-
ceedings of the 17th ACM International Confer-
ence on Research and Development in Informa-
tion Retrieval, pages 142–151, Dublin, Ireland.

H. Sch¨utze and J.O. Pedersen. 1995. Information
retrieval based on word senses. In Proceedings
of the Fourth Annual Symposium on Document
Analysis and Information Retrieval, pages 161–
175, Las Vegas, Nevada.
