Proceedings of the 43rd Annual Meeting of the ACL, pages 205–214,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Experiments with Interactive Question-Answering
Sanda Harabagiu, Andrew Hickl, John Lehmann, and Dan Moldovan
Language Computer Corporation
Richardson, Texas USA
sanda@languagecomputer.com
Abstract
This paper describes a novel framework
for interactive question-answering (Q/A)
based on predictive questioning. Gen-
erated off-line from topic representations
of complex scenarios, predictive ques-
tions represent requests for information
that capture the most salient (and diverse)
aspects of a topic. We present experimen-
tal results from large user studies (featur-
ing a fully-implemented interactive Q/A
system named FERRET) that demonstrates
that surprising performance is achieved by
integrating predictive questions into the
context of a Q/A dialogue.
1 Introduction
In this paper, we propose a new architecture for
interactive question-answering based on predictive
questioning. We present experimental results from
a currently-implemented interactive Q/A system,
named FERRET, that demonstrates that surprising
performance is achieved by integrating sources of
topic information into the context of a Q/A dialogue.
In interactive Q/A, professional users engage in
extended dialogues with automatic Q/A systems in
order to obtain information relevant to a complex
scenario. Unlike Q/A in isolation, where the per-
formance of a system is evaluated in terms of how
well answers returned by a system meet the specific
information requirements of a single question, the
performance of interactive Q/A systems have tradi-
tionally been evaluated by analyzing aspects of the
dialogue as a whole. Q/A dialogues have been evalu-
ated in terms of (1) efficiency, defined as the number
of questions that the user must pose to find particu-
lar information, (2) effectiveness, defined by the rel-
evance of the answers returned, (3) user satisfaction.
In order to maximize performance in these three
areas, interactive Q/A systems need a predictive di-
alogue architecture that enables them to propose re-
lated questions about the relevant information that
could be returned to a user, given a domain of inter-
est. We argue that interactive Q/A systems depend
on three factors: (1) the effective representation of
the topic of a dialogue, (2) the dynamic recognition
of the structure of the dialogue, and (3) the ability to
return relevant answers to a particular question.
In this paper, we describe results from experi-
ments we conducted with our own interactive Q/A
system, FERRET, under the auspices of the ARDA
AQUAINT1 program, involving 8 different dialogue
scenarios and more than 30 users. The results pre-
sented here illustrate the role of predictive question-
ing in enhancing the performance of Q/A interac-
tions.
In the remainder of this paper, we describe a new
architecture for interactive Q/A. Section 2 presents
the functionality of several of FERRET’s modules
and describes the NLP techniques it relies upon. In
Section 3, we present one of the dialogue scenar-
ios and the topic representations we have employed.
Section 4 highlights the management of the inter-
action between the user and FERRET, while Sec-
tion 5 presents the results of evaluating our proposed
1AQUAINT is an acronym for Advanced QUestion Answer-
ing for INTelligence.
205
Dialogue
Management
Collection
Document
Question
Similarity
Answer
Fusion
(PDN)
Network
Dialogue
Predictive
Answer
Fusion
Context
Management
Dialogue Shell
Online Question Answering
Topic
Predictive Dialogue
Question
Answer
Decomposition
Question
Information
Extraction
Representation
Off−line Question Answering
Database (QUAB)
Question−Answer
Figure 1: FERRET - A Predictive Interactive Question-Answering Architecture.
model, and Section 6 summarizes the conclusions.
2 Interactive Question-Answering
We have found that the quality of interactions pro-
duced by an interactive Q/A system can be greatly
enhanced by predicting the range of questions that
a user might ask in the context of a given topic.
If a large database of topic-relevant questions were
available for a wide variety of topics, the accuracy
of a state-of-the-art Q/A system such as (Harabagiu
et al., 2003) could be enhanced.
In FERRET, our interactive Q/A system, we store
such “predicted” pairs of questions and answers in a
database known as the Question Answer Database
(or QUAB). FERRET uses this large set of topic-
relevant question-and-answer pairs to improve the
interaction with the user by suggesting new ques-
tions. For example, when a user asks a question
like (Q1) (as illustrated in Table 1), FERRET returns
an answer to the question (A1) and proposes (Q2),
(Q3), and (Q4) as suggestions of possible continua-
tions of the dialogue. Users then choose how to con-
tinue the interaction by either (1) ignoring the sug-
gestions made by the system and proposing a differ-
ent question, or by (2) selecting one of the proposed
questions and examining its answer.
Figure 1 illustrates the architecture of FERRET.
The interactions are managed by a dialogue shell,
which processes questions by transforming them
into their corresponding predicate-argument struc-
tures2.
The data collection used in our experiments was
2We have employed the same representation of predicate-
argument structures as those encoded in PropBank. We use a
semantic parser (described in (Surdeanu et al., 2003)) that rec-
ognizes predicate-argument structures.
(Q1) What weapons are included in Egypt’s stockpiles?
(A1) The Israelis point to comments made by former President Anwar Sadat,
who in 1970 stated that Egypt has biological weapons stored in
refrigerators ready to use against Israel if need be. The program might
include ”plague, botulism toxin, encephalitis virus, anthrax,
Rift Valley fever and mycotoxicosis.”
(Q2) Where did Egypt inherit its first stockpiles of chemical weapons?
(Q3) Is there evidence that Egypt has dismantled its stockpiles of weapons?
(Q4) Where are Egypt’s weapons stockpiles located?
(Q5) Who oversees Egypt’s weapons stockpiles?
Table 1: User question and proposed questions from QUABs
made available by the Center for Non-Proliferation
Studies (CNS)3.
Modules from the FERRET’s dialogue shell inter-
act with modules from the predictive dialogue block.
Central to the predictive dialogue is the topic repre-
sentation for each scenario, which enables the pop-
ulation of a Predictive Dialogue Network (PDN).
The PDN consists of a large set of questions that
were asked or predicted for each topic. It is a net-
work because questions are related by “similarity”
links, which are computed by the Question Simi-
larity module. The topic representation enables an
Information Extraction module based on (Surdeanu
and Harabagiu, 2002) to find topic-relevant infor-
mation in the document collection and to use it as
answers for the QUABs. The questions associated
with each predicted answer are generated from pat-
terns that are related to the extraction patterns used
for identifying topic relevant information. The qual-
ity of the dialog between the user and FERRET de-
pends on the quality of the topic representations and
the coverage of the QUABs.
3The Center for Non-Proliferation Studies at the Monterrey
Institute of International Studies distributes collections of print
and online documents on weapons of mass destruction. More
information at: http://cns.miis.edu.
206
GENERAL BACKGROUND
 1) Country Profile
 3) Military Operations: Army, Navy, Air Force, Leaders, Capabilities, Intentions
 4) Allies/Partners: Coalition Forces
 5) Weapons: Chemical, Biological, Materials, Stockpiles, Facilities, Access, Research Efforts, Scientists
 6) Citizens: Population, Growth Rate, Education
 8) Economics: Growth Domestic Product, Growth Rate, Imports
 9) Threat Perception: Border and Surrounding States, International, Terrorist Groups
10) Behaviour: Threats, Invasions, Sponsorship and Harboring of Bad Actors
13) Leadership:
 7) Industrial: Major Industrires, Exports, Power Sources
14) Behaviour: Threats to use WMDs, Actual Usage, Sophistication of Attack, Anectodal or Simultaneous
Serving as a background to the scenarios, the following list contains subject areas that may be relevant
to the scenarios under examination, and it is provided to assist the analyst in generating questions.
 2) Government: Type of, Leadership, Relations
SCENARIO: Assessment of Egypt’s Biological Weapons
As terrorist Activity in Egypt increases, the Commander
of the United States Army believes a better understanding
of Egypt’s Military capabilities is needed. Egypt’s
biological weapons database needs to be updated to
correspond with the Commander’s request. Focus your 
investigation on Egypt’s access to old technology, 
assistance received from the Soviet Union for development
of their pharmaceutical infrastructure, production of
toxins and BW agents, stockpiles, exportation of these
materials and development technology to Middle Eastern
countries, and the effect that this information will have on
the United States and Coalition Forces in the Middle East.
Please incorporate any other related information to 
your report.
11) Transportation Infrastructure: Kilometers of Road, Rail, Air Runways, Harbors and Ports, Rivers
12) Beliefs: Ideology, Goals, Intentions
15) Weapons: Chemical, Bilogical, Materials, Stockpiles, Facilities, Access
Figure 2: Example of a Dialogue Scenario.
3 Modeling the Dialogue Topic
Our experiments in interactive Q/A were based on
several scenarios that were presented to us as part
of the ARDA Metrics Challenge Dialogue Work-
shop. Figure 2 illustrates one of these scenarios. It
is to be noted that the general background consists
of a list of subject areas, whereas the scenario is a
narration in which several sub-topics are identified
(e.g. production of toxins or exportation of materi-
als). The creation of scenarios for interactive Q/A
requires several different types of domain-specific
knowledge and a level of operational expertise not
available to most system developers. In addition to
identifying a particular domain of interest, scenar-
ios must specify the set of relevant actors, outcomes,
and related topics that are expected to operate within
the domain of interest, the salient associations that
may exist between entities and events in the sce-
nario, and the specific timeframe and location that
bound the scenario in space and time. In addition,
real-world scenarios also need to identify certain op-
erational parameters as well, such as the identity of
the scenario’s sponsor (i.e. the organization spon-
soring the research) and audience (i.e. the organiza-
tion receiving the information), as well as a series of
evidence conditions which specify how much verifi-
cation information must be subject to before it can
be accepted as fact. We assume the set of sub-topics
mentioned in the general background and the sce-
nario can be used together to define a topic structure
that will govern future interactions with the Q/A sys-
tem. In order to model this structure, the topic rep-
resentation that we create considers separate topic
signatures for each sub-topic.
The notion of topic signatures was first introduced
in (Lin and Hovy, 2000). For each subtopic in a sce-
nario, given (a) documents relevant to the sub-topic
and (b) documents not relevant to the subtopic, a sta-
tistical method based on the likelihood ratio is used
to discover a weighted list of the most topic-specific
concepts, known as the topic signature. Later work
by (Harabagiu, 2004) demonstrated that topic sig-
natures can be further enhanced by discovering the
most relevant relations that exist between pairs of
concepts. However, both of these types of topic rep-
resentations are limited by the fact that they require
the identification of topic-relevant documents prior
to the discovery of the topic signatures. In our ex-
periments, we were only presented with a set of doc-
uments relevant to a particular scenario; no further
relevance information was provided for individual
subject areas or sub-topics.
In order to solve the problem of finding relevant
documents for each subtopic, we considered four
different approaches:
a0 Approach 1: All documents in the CNS col-
lection were initially clustered using K-Nearest
Neighbor (KNN) clustering (Dudani, 1976).
Each cluster that contained at least one key-
word that described the sub-topic was deemed
relevant to the topic.
a0 Approach 2: Since individual documents may
contain discourse segments pertaining to differ-
ent sub-topics, we first used TextTiling (Hearst,
1994) to automatically segment all of the doc-
uments in the CNS collection into individual
text tiles. These individual discourse segments
207
then served as input to the KNN clustering al-
gorithm described in Approach 1.
a0 Approach 3: In this approach, relevant docu-
ments were discovered simultaneously with the
discovery of topic signatures. First, we asso-
ciated a binary seed relation a0a2a1 for each each
sub-topic a3 a1 . (Seed relations were created both
by hand and using the method presented in
(Harabagiu, 2004).) Since seed relations are by
definition relevant to a particular subtopic, they
can be used to determine a binary partition of
the document collection a4 into (1) a relevant
set of documents a5a6a1 (that is, the documents rel-
evant to relation a0 a1 ) and (2) a set of non-relevant
documents a4 -a5a6a1 . Inspired by the method pre-
sented in (Yangarber et al., 2000), a topic sig-
nature (as calculated by (Harabagiu, 2004)) is
then produced for the set of documents in a5a7a1 .
For each subtopic a3 a1 defined as part of the di-
alogue scenario, documents relevant to a cor-
responding seed relation a0 a1 are added to a5 iff
the relation a0a8a1 meets the density criterion (as
defined in (Yangarber et al., 2000)). If a9 rep-
resents the set of documents where a0a2a1 is recog-
nized, then the density criterion can be defined
as:
a10a11a13a12a15a14a16a10
a11a13a12a15a17a19a18
a10a14a16a10
a10a17a20a10 . Once
a9 is added to a5a21a1 , then
a new topic signature is calculated for a5 . Rela-
tions extracted from the new topic signature can
then be used to determine a new document par-
tition by re-iterating the discovery of the topic
signature and of the documents relevant to each
subtopic.
a0 Approach 4: Approach 4 implements the tech-
nique described in Approach 3, but operates
at the level of discourse segments (or texttiles)
rather than at the level of full documents. As
with Approach 2, segments were produced us-
ing the TextTiling algorithm.
In modeling the dialogue scenarios, we consid-
ered three types of topic-relevant relations: (1)
structural relations, which represent hypernymy
or meronymy relations between topic-relevant con-
cepts, (2) definition relations, which uncover the
characteristic properties of a concept, and (3) ex-
traction relations, which model the most relevant
events or states associated with a sub-topic. Al-
though structural relations and definition relations
are discovered reliably using patterns available from
our Q/A system (Harabagiu et al., 2003), we found
only extraction relations to be useful in determining
the set of documents relevant to a subtopic. Struc-
tural relations were available from concept ontolo-
gies implemented in the Q/A system. The definition
relations were identified by patterns used for pro-
cessing definition questions.
Extraction relations are discovered by processing
documents in order to identify three types of rela-
tions, including: (1) syntactic attachment relations
(including subject-verb, object-verb, and verb-PP
relations), (2) predicate-argument relations, and (3)
salience-based relations that can be used to encode
long-distance dependencies between topic-relevant
concepts. (Salience-based relations are discovered
using a technique first reported in (Harabagiu, 2004)
which approximates a Centering Theory-style ap-
proach (Kameyama, 1997) to the resolution of
coreference.)
Subtopic: Egypt’s production of toxins and BW agents
Topic Signature:
produce − phosphorous trichloride (TOXIN)
house − ORGANIZATION
cultivate − non−pathogenic Bacilus Subtilis (TOXIN)
produce − mycotoxins (TOXIN)
acquire − FACILITY
Subtopic: Egypt’s allies and partners
Topic Signature:
provide − COUNTRY
cultivate − COUNTRY
supply − precursors
cooperate − COUNTRY
train − PERSON
supply − know−how
Figure 3: Example of two topic signatures acquired
for the scenario illustrated in Figure 2.
We made the extraction relations associated with
each topic signature more general (a) by replacing
words with their (morphological) root form (e.g.
wounded with wound, weapons with weapon), (b)
by replacing lexemes with their subsuming category
from an ontology of 100,000 words (e.g. truck is re-
placed by VEHICLE, ARTIFACT, or OBJECT), and (c)
by replacing each name with its name class (Egypt
with COUNTRY). Figure 3 illustrates the topic sig-
natures resulting for the scenario illustrated in Fig-
ure 2.
Once extraction relations were obtained for a par-
ticular set of documents, the resulting set of re-
lations were ranked according to a method pro-
posed in (Yangarber, 2003). Under this approach,
208
the score associated with each relation is given by:
a0a2a1a4a3
a0a6a5
a7
a0a9a8a11a10 a12a14a13a16a15a18a17a20a19a22a21
a10a11 a10a24a23a26a25a28a27a14a29a9a30
a3a32a31a6a33 a7
a0a6a8 , where a34a9a35a34 rep-
resents the cardinality of the documents where the
relation is identified, and a3a32a31a9a33 a7 a0a9a8 represents sup-
port associated with the relation a0 . a3a32a31a9a33 a7 a0a9a8 is de-
fined as the sum of the relevance of each document
in a9 : a3a32a31a9a33 a7 a0a9a8a36a10 a37a39a38a41a40 a11 a5a42a5a44a43 a7a46a45 a8 . The relevance
of a document that contains a topic-significant re-
lation can be defined as: a5a42a5a47a43 a7a46a45 a8a48a10a50a49a52a51a54a53
a19
a40a44a55
a12
a7
a49a52a51
a56
a0a6a5
a1
a7
a0a9a8a57a8 , where a58
a3 represents the topic signature
of the subtopic4. The accuracy of the relation, then,
is given by: a56 a0a6a5 a1 a7 a0a9a8a59a10 a60a10a11 a10 a7 a37a39a38a41a40 a11 a5a42a5a47a43a62a61a64a63 a7a46a45 a8a65a51
a37a67a66a69a68a70
a1
a5a42a5a47a43a62a61a72a71
a7a46a45
a8a57a8 . Here, a5a42a5a44a43
a12
a63
a7a46a45
a8 measures the rel-
evance of a subtopic a3 a1 to a particular document a45 ,
while a5a42a5a47a43
a12
a71
a7a46a45
a8 measures the relevance of
a45 to an-
other subtopic, a3 a66 .
We use a different learner for each subtopic in or-
der to train simultaneously on each iteration. (The
calculation of topic signatures continues to iterate
until there are no more relations that can be added
to the overall topic signature.) When the precision
of a relation to a subtopic a3 a1 is computed, it takes
into account the negative evidence of its relevance
to any other subtopic a3 a1a74a73a10 a3 a66 . If a56 a0a14a5 a1 a7 a0a6a8a74a75a77a76 ,
the relation is not included in the topic signature,
where relations are ranked by the score a3 a1a41a3 a0a6a5 a7 a0a9a8a78a10
a56
a0a6a5
a1
a7
a0a9a8
a23
a43
a3a2a79
a7 a3a32a31a9a33 a7
a0a9a8a57a8 .
Representing topics in terms of relevant concepts
and relations is important for the processing of ques-
tions asked within the context of a given topic. For
interactive Q/A, however, the ideal topic-structured
representation would be in the form of question-
answer pairs (QUABs) that model the individual
segments of the scenario. We have currently cre-
ated two sets of QUABs: a handcrafted set and
an automatically-generated set. For the manually-
created set of QUABs, 4 linguists manually gener-
ated 3210 question-answer pairs for each of the 8
dialogue scenarios considered in our experiments.
In a separate effort, we devised a process for au-
tomatically populating the QUAB for each scenario.
In order to generate question-answer pairs for each
subtopic, we first identified relevant text passages in
the document collection to serve as “answers” and
then generated individual questions that could be an-
4Initially,
a80a32a81 contains only the seed relation. Additional
relations can be added with each iteration.
swered by each answer passage.
a82 Answer Identification: We defined an an-
swer passage as a contiguous sequence of sentences
with a positive answer rank and a passage price
of a75 4. To select answer passages for each sub-
topic a3 a1 , we calculate an answer rank, a0a14a83a85a84a87a86 a7a83a88a8a89a10
a37
a19
a63
a0a2a1a4a3
a0a6a5
a7
a0a8a1a90a8 , that sums across the scores of each
relation from the topic signature that is identified in
the same text window. Initially, the text window
is set to one sentence. (If the sentence is part of a
quote, however, the text window is immediately ex-
panded to encompass the entire sentence that con-
tains the quote.) Each passage with a0a14a83a85a84a87a86 a7a83a88a8a92a91a93a76 is
then considered to be a candidate answer passage.
The text window of each candidate answer passage
is then expanded to include the following sentence.
If the answer rank does not increase with the addi-
tion of the succeeding sentence, then the price (a33 ) of
the candidate answer passage is incremented by 1,
otherwise it is decremented by 1. The text window
of each candidate answer passage continues to ex-
pand untila33 a10a95a94 . Before the ranked list of candidate
answers can be considered by the Question Genera-
tion module, answer passages with a positive pricea33
are stripped of the lasta33 sentences.
ANSWER
In the early 1970s, Egyptian President Anwar Sadat
validates that Egypt has a BW stockpile.
Predicate−Argument Structures
P1: validate
arguments: A0 = E2: Answer Type: Definition
A1 = P2: have
arguments: A0 = E3
A1 = E4
ArgM−TMP: E1: Answer Type: Time
P3: admit
Reference 4 (relational)
Egyptian President X
E5: BW program
Reference 2 (metonymic)
Reference 3 (part−whole)
QUESTIONS
Definition Pattern: Who is X?
Q1: Who is Anwar Sadat?
Pattern: When did E3 P1 to P2 E4?
Q2: When did Egypt validate to having BW stockpiles?
Pattern: When did E3 P3 to P2 E4?
Q3: When did Egypt admit to having BW stockpiles?
Pattern: When did E3 P3 to P2 E5?
Q4: When did Egypt admint to having a BW program?
E1: "in the early 1970s"; Category: TIME
E2: "Egyptian President Anwar Sadat"; Category: PERSON
E3: "Egypt"; Category: COUNTRY
E4: "BW stockpile"; Category: UNKNOWN
4 entities
2 predicates: P1="validate"; P2="has"
PROCESSING
Reference 1 (definitional)
Figure 4: Associating Questions with Answers.
a82 Question Generation: In order to automati-
cally generate questions from answer passages, we
considered the following two problems:
a0 Problem 1: Every word in an answer passage
can refer to an entity, a relation, or an event. In
order for question generation be successful, we
must determine whether a particular reference
209
is “interesting” enough to the scenario such that
it deserves to be mentioned in a topic-relevant
question. For example, Figure 4 illustrates an
answer that includes two predicates and four
entities. In this case, four types of reference are
used to associate these linguistic objects with
other related objects: (a) definitional reference,
used to link entity (E1) “Anwar Sadat” to a cor-
responding attribute “Egyptian President”, (b)
metonymic reference, since (E1) can be coerced
into (E2), (c) part-whole reference, since “BW
stockpiles”(E4) necessarily imply the existence
of a “BW program”(E5), and (d) relational ref-
erence, since validating is subsumed as part
of the meaning of declaring (as determined by
WordNet glosses), while admitting can be de-
fined in terms of declaring, as in declaring [to
be true].
ANSWER
Egyptian Deputy Minister Mahmud Salim states that Egypt’s
Egyptians have "adequate means of retaliating without delay".
enemies would never use BW because they are aware that the
Predicates: P’1=state; P’2 = never use; P3 = be aware;
Causality:
P’2(BW) = NON−NEGATIVE RESULT(P5); P’5 = "obstacle"
Reference: P’1          P’6 = view
QUESTIONS
Does Egypt view the possesion of BW as an obstacle?
Does Egypt view the possesion of BW as a deterrent?
P’4 = have         P"4 = "the possesion"
P"4 = "the possesion" = nominalization(P’4) = EFFECT(P’2(BW))
PROCESSING
specialization
Pattern: Does Egypt P’6 P"4(BW) as a P’5?
Figure 5: Questions for Implied Causal Relations.
a0 Problem 2: We have found that the identifica-
tion of the association between a candidate an-
swer and a question depends on (a) the recogni-
tion of predicates and entities based on both the
output of a named entity recognizer and a se-
mantic parser (Surdeanu et al., 2003) and their
structuring into predicate-argument frames, (b)
the resolution of reference (addressed in Prob-
lem 1), (c) the recognition of implicit rela-
tions between predications stated in the answer.
Some of these implicit relations are referential,
as is the relation between predicates a56
a60
and a56a1a0
illustrated in Figure 4. A special case of im-
plicit relations are the causal relations. Fig-
ure 5 illustrates an answer where a causal re-
lation exists and is marked by the cue phrase
because. Predicates – like those in Figure 5 –
can be phrasal (like a56a3a2a0 ) or negative (like a56a3a2a30 ).
Causality is established between predicates a56 a2a30
and a56a5a4 ’ as they are the ones that ultimately de-
termine the selection of the answer. The predi-
catea33 a2a4 can be substituted by its nominalization
since a6 a0 a79
a60
of a56 a30 is BW, the same argument is
transferred to a56a7a2a2a4 . The causality implied by the
answer from Figure 5 has two components: (1)
the effect (i.e. the predicate a56 a2a2a4 ) and (2) the re-
sult, which eliminates the semantic effect of the
negative polarity item never by implying the
predicate a33a9a8 , obstacle. The questions that are
generated are based on question patterns asso-
ciated with causal relations and therefore allow
different degrees for the specificity of the resul-
tative, i.e obstacle or deterrent.
We generated several questions for each answer
passage. Questions were generated based on pat-
terns that were acquired to model interrogations
using relations between predicates and their argu-
ments. Such interrogations are based on (1) as-
sociations between the answer type (e.g. DATE)
and the question stem (e.g. “when” and (2) the
relation between predicates, question stem and the
words that determine the answer type (Narayanan
and Harabagiu, 2004). In order to obtain these
predicate-argument patterns, we used 30% (approxi-
mately 1500 questions) of the handcrafted question-
answer pairs, selected at random from each of the 8
dialogue scenarios. As Figures 4 and 5 illustrate, we
used patterns based on (a) embedded predicates and
(b) causal or counterfactual predicates.
4 Managing Interactive Q/A Dialogues
As illustrated in Figure 1, the main idea of man-
aging dialogues in which interactions with the Q/A
system occur is based on the notion of predictions,
i.e. by proposing to the user a small set of questions
that tackle the same subject as her question (as illus-
trated in Table 1). The advantage is that the user can
follow-up with one of the pre-processed questions,
that has a correct answer and resides in one of the
QUABs. This enhances the effectiveness of the dia-
logue. It also may impact on the efficiency, i.e. the
number of questions being asked if the QUABs have
good coverage of the subject areas of the scenario.
Moreover, complex questions, that generally are not
processed with high accuracy by current state-of-
the-art Q/A systems, are associated with predictive
questions that represent decompositions based on
210
similarities between predicates and arguments of the
original question and the predicted questions.
The selection of the questions from the QUABs
that are proposed for each user question is based on
a similarity-metric that ranks the QUAB questions.
To compute the similarity metric, we have experi-
mented with seven different metrics. The first four
metrics were introduced in (Lytinen and Tomuro,
2002).
a0 Similarity Metric 1 is based on two process-
ing steps:
(a) the content words of the questions are
weighted using the a0a2a1a4a3 a45 a1 measure used in In-
formation Retrieval a5 a1 a10 a5 a7a0 a1 a8 a10 a7 a49a7a6
a25a28a27a14a29
a7
a0a8a1a2a1a90a8a57a8a10a9a12a11a2a13a15a14
a38a17a16
a63
, where a18 is the number of
questions in the QUAB, a45 a1 a1 is the num-
ber of questions containing a0 a1 and a0a2a1a2a1 is
the number of times a0 a1 appears in the ques-
tion. This allows the user question and any
QUAB question to be transformed into two
vectors, a19
a13
a10 a20a21a5
a13a23a22a25a24
a5
a13a10a26a10a24a25a27a17a27a17a27a17a24
a5
a13a29a28
a30 and
a19a23a31 a10
a20a21a5a32a31
a22a25a24
a5a32a31
a26a29a24a25a27a17a27a17a27a17a24
a5a32a31a34a33
a30 ;
(b) the term vector similarity is used to compute
the similarity between the user question and
any question from the QUAB: a35 a27a37a36 a7a19
a13a15a24
a19a23a31 a8a74a10
a7
a37
a1
a5
a13
a63
a5a32a31
a63
a8a2a38
a7a57a7
a37
a1
a5
a30
a13
a63
a8
a22
a26a40a39
a7
a37
a1
a5
a30
a31
a63
a8
a22
a26
a8
a0 Similarity Metric 2 is based on the percent of
user question terms that appear in the QUAB
question. It is obtained by finding the intersec-
tion of the terms in the term vectors of the two
questions.
a0 Similarity Metric 3 is based on semantic in-
formation available from WordNet. It involves:
(a) finding the minimum path between Word-
Net concepts. Given two terms a0
a60
and a0 a30 ,
each with a84 and a41 WordNet senses a3
a60
a10
a42
a0
a60 a24a25a27a17a27a17a27a17a24
a0a44a43a46a45 and
a3
a30
a10
a42
a0
a60 a24a25a27a17a27a17a27a17a24
a0a29a47
a45 . The se-
mantic distance between the terms a48
a7
a0
a60 a24
a0
a30
a8 is
defined by the minimum of all the possible pair-
wise semantic distances between a3
a60
and a3 a30 :
a48
a7
a0
a60
a24
a0
a30
a8 a10 a49a51a50a53a52
a61 a63
a40
a12a37a22a55a54a19
a71
a40
a12a10a26
a9
a7
a0
a1
a24
a0
a66
a8 , where
a9
a7
a0
a1
a24
a0
a66
a8 is the path length between
a0
a1 and a0
a66 .
(b) the semantic similarity between the user
question a58
a13
a10a56a20
a31
a60 a24
a31
a30
a24a25a27a17a27a17a27a17a24
a31
a43
a30 and the QUAB
question a58a57a31 a10 a20a59a58
a60 a24
a58
a30
a24a25a27a17a27a17a27a17a24
a58a60a47
a30 to be defined
as a0 a5a44a41 a7a58
a13a61a24
a58a62a31a47a8 a10 a63 a17
a55a10a64
a54
a55a10a65
a21a67a66 a63 a17
a55a10a65
a54
a55a10a64
a21
a10
a55 a64
a10
a66
a10
a55 a65
a10 , where
a68 a7
a58a4a69
a24
a58a62a70a18a8a78a10 a37
a69
a40a44a55a72a71 a60
a60
a66a4a73a75a74a12a76a78a77a60a79a81a80
a77a4a82
a17
a69
a54
a70
a21
a0 Similarity Metric 4 is based on the question
type similarity. Instead of using the question
class, determined by its stem, whenever we
could recognize the answer type expected by
the question, we used it for matching. As back-
off only, we used a question type similarity
based on a matrix akin to the one reported in
(Lytinen and Tomuro, 2002)
a0 Similarity Metric 5 is based on question con-
cepts rather than question terms. In order to
translate question terms into concepts, we re-
placed (a) question stems (i.e. a WH-word +
NP construction) with expected answer types
(taken from the answer type hierarchy em-
ployed by FERRET’s Q/A system) and (b)
named entities with corresponding their corre-
sponding classes. Remaining nouns and verbs
were also replaced with their WordNet seman-
tic classes, as well. Each concept was then as-
sociated with a weight: concepts derived from
named entities classes were weighted heavier
than concepts from answer types, which were
in turn weighted heavier than concepts taken
from WordNet clases. Similarity was then com-
puted across “matching” concepts. 5 The resul-
tant similarity score was based on three vari-
ables:a83
= sum of the weights of all concepts matched
between a user query (a84
a13
) and a QUAB query
(a84a86a85 );
a87 = sum of the weights of all unmatched con-
cepts in a84
a13
;
a88 = sum of the weights of all unmatched con-
cepts in a84a86a85 ;
The similarity between a84
a13
and a84a89a85 was calcu-
lated as
a83
a51
a7a33
a13
a39
a87
a8 a51
a7a33
a85
a39
a88
a8 , where
a33
a13
and
a33
a85 were used as coefficients to penalize the con-
tribution of unmatched concepts in a84
a13
and a84a86a85
respectively. 6
a0 Similarity Metric 6 is based on the fact that the
5In the case of ambiguous nouns and verbs associated with
multiple WordNet classes, all possible classes for a term were
considered in matching.
6We set
a90
a64 = 0.4 and
a90a23a91 = 0.1 in our experiments.
211
Q1: Does Iran have an indigenous CW program?
(1b) Has the plant at Qazvin been linked to CW production?
(1c) What CW does Iran produce?
(1a) How did Iran start its CW program?
Q2: Where are Iran’s CW facilities located? (2a) What factories in Iran could produce CW?
(2b) Where are Iran’s stockpiles of CW?
(2c) Where has Iran bought equipment to produce CW?
Q3: What is Iran’s goal for its CW program? (3a) What motivated Iran to expand its chemical weapons program?
(3b) How do CW figure into Iran’s long−term strategic plan?
(3c) What are Iran’s future CW plans?
QUABs:
QUABs:
QUABs:
Answer(A3):
Answer(A2):
Answer (A1):
Although Iran is making a concerted effort to attain an independent production capability for all aspects of chemical
weapons program, it remains dependent on foreign sources for chemical warfare−related technologies.
According to several sources, Iran’s primary suspected chemical weapons production facility is located in the city of Damghan.
In their pursuit of regional hegemony, Iran and Iraq probably regard CW weapons and missiles as necessary to support their
political and military objectives. Possession of chemical weapons would likely lead to increased intimidation of their Gulf,
neighbors, as well as increased willingness to confront the United States.
Figure 6: A sample interactive Q/A dialogue.
QUAB questions are clustered based on their
mapping to a vector of important concepts in
the QUAB.The clustering was done using the
K-Nearest Neighbor (KNN) method (Dudani,
1976). Instead of measuring the similarity be-
tween the user question and each question in
the QUAB, similarities are computed only be-
tween the user question and the centroid of
each cluster.
a0 Similarity Metric 7 was derived from the re-
sults of Similarity Metrics 5 and 6 above. In
this case, if the QUAB question (a84 a85 ) that was
deemed to be most similar to a user question
(a84
a13
) under Similarity Metric 5 is contained
in the cluster of QUAB questions deemed to
be most similar to a84
a13
under Similarity Metric
6, then a84a86a85 receives a cluster adjustment score
in order to boost its ranking within its QUAB
cluster. We calculate the cluster adjustment
score as a0a2a1a41a3 a0a14a5 a1 a38a66 a7a84a89a85a16a8 a10 a7 a0 a3a41 a8 a23 a7 a49 a51 a4 a16 a8a57a8a57a6
a7
a0
a3a41a3a2
a23
a4 a16 a8 , where a4 a16 represents the difference
in rank between the centroid of the cluster and
the previous rank of the QUAB question a84 a85 .
In the currently-implemented version of FERRET,
we used Similarity Metric 5 to automatically iden-
tify the set of 10 QUAB questions that were most
similar to a user’s question. These question-and-
answer pairs were then returned to the user – along
with answers from FERRET’s automatic Q/A system
– as potential continuations of the Q/A dialogue. We
used the remaining 6 similarity metrics described in
this section to manually assess the impact of simi-
larity on a Q/A dialogue.
5 Experiments with Interactive Q/A
Dialogues
To date, we have used FERRET to produce over 90
Q/A dialogues with human users. Figure 6 illustrates
three turns from a real dialogue from a human user
investigating Iran’s chemical weapons prorgram. As
it can be seen coherence can be established between
the user’s questions and the system’s answers (e.g.
Q3 is related to both A1 and A3) as well as between
the QUABs and the user’s follow-up questions (e.g.
QUAB (1b) is more related to Q2 than either Q1 or
A1). Coherence alone is not sufficient to analyze the
quality of interactions, however.
In order to better understand interactive Q/A dia-
logues, we have conducted three sets of experiments
with human users of FERRET. In these experiments,
users were allotted two hours to interact with Ferret
to gather information requested by a dialogue sce-
nario similar to the one presented in Figure 2. In
Experiment 1 (E1), 8 U.S. Navy Reserve (USNR)
intelligence analysts used FERRET to research 8 dif-
ferent scenarios related to chemical and biological
weapons. Experiment 2 and Experiment 3 consid-
ered several of the same scenarios addressed in E1:
E2 included 24 mixed teams of analysts and novice
users working with 2 scenarios, while E3 featured 4
USNR analysts working with 6 of the original 8 sce-
narios. (Details for each experiment are provided in
Table 2.) Users were also given a task to focus their
212
research; in E1 and E3, users prepared a short report
detailing their findings; in E2, users were given a list
of “challenge” questions to answer.
Exp Users QUABs? Scenarios Topics
E1 8 Yes 8 Egypt BW, Russia CW, South
Africa CW, India CW, North
Korea CBW, Pakistan CW,
Libya CW, Iran CW
E2 24 Yes 2 Egypt BW, Russia CW
E3 4 No 6 Egypt BW, Russia CW, North
Korea CBW, Pakistan CW
India CW, Libya CW, Iran CW
Table 2: Experiment details
In E1 and E2, users had access to a total of 3210
QUAB questions that had been hand-created by de-
velopers for each the 8 dialogue scenarios. (Table 3
provides totals for each scenario.) In E3, users per-
formed research with a version of FERRET that in-
cluded no QUABs at all.
Scenario Handcrafted QUABs
INDIA 460
LIBYA 414
IRAN 522
NORTH KOREA 316
PAKISTAN 322
SOUTH AFRICA 454
RUSSIA 366
EGYPT 356
Testing Total 3210
Table 3: QUAB distribution over scenarios
We have evaluated FERRET by measuring effi-
ciency, effectiveness, and user satisfaction:
Efficiency FERRET’s QUAB collection enabled
users in our experiments to find more relevant infor-
mation by asking fewer questions. When manually-
created QUABs were available (E1 and E2), users
submitted an average of 12.25 questions each ses-
sion. When no QUABs were available (E3), users
entered a total of 44.5 questions per session. Table 4
lists the number of QUAB question-answer pairs se-
lected by users and the number of user questions en-
tered by users during the 8 scenarios considered in
E1. In E2, freed from the task of writing a research
report, users asked significantly (p a0 0.05) fewer
questions and selected fewer QUABs than they did
in E1. (See Table 5).
Effectiveness QUAB question-answer pairs also
improved the overall accuracy of the answers re-
turned by FERRET. To measure the effectiveness of
a Q/A dialogue, human annotators were used to per-
form a post-hoc analysis of how relevant the QUAB
pairs returned by FERRET were to each question
Country n QUAB User Q Total
(avg.) (avg.) (avg.)
India 2 21.5 13.0 34.5
Libya 2 12.0 9.0 21.0
Iran 2 18.5 11.0 29.5
N.Korea 2 16.5 7.5 34.0
Pakistan 2 29.5 15.5 45.0
S.Africa 2 14.5 6.0 20.5
Russia 2 13.5 15.5 29.0
Egypt 2 15.0 20.5 35.5
TOTAL(E1) 16 17.63 12.25 29.88
Table 4: Efficiency of Dialogues in Experiment 1
Country n QUAB User Q Total
(avg.) (avg.) (avg.)
Russia 24 8.2 5.5 13.7
Egypt 24 10.8 7.6 18.4
TOTAL(E2) 48 9.50 6.55 16.05
Table 5: Efficiency of Dialogues in Experiment 2
entered by a user: each QUAB pair returned was
graded as “relevant” or “irrelevant” to a user ques-
tion in a forced-choice task. Aggregate relevance
scores were used to calculate (1) the percentage of
relevant QUAB pairs returned and (2) the mean re-
ciprocal rank (MRR) for each user question. MRR is
defined as a60a43 a37
a1
a70
a60
a60
a19
a63
, whree a0a8a1 is the lowest rank of
any relevant answer for the a3 a1a3a2 user query7. Table 6
describes the performance of FERRET when each of
the 7 similarity measures presented in Section 4 are
used to return QUAB pairs in response to a query.
When only answers from FERRET’s automatic Q/A
system were available to users, only 15.7% of sys-
tem responses were deemed to be relevant to a user’s
query. In contrast, when manually-generated QUAB
pairs were introduced, as high as 84% of the sys-
tem’s responses were deemed to be relevant. The
results listed in Table 6 show that the best metric is
Similarity Metric 5. Thse results suggest that the
selection of relevant questions depends on sophis-
ticated similarity measures that rely on conceptual
hierarchies and semantic recognizers.
We evaluated the quality of each of the four
sets of automatically-generated QUABs in a sim-
ilar fashion. For each question submitted by a
user in E1, E2, and E3, we collected the top 5
QUAB question-answer pairs (as determined by
Similarity Metric 5) that FERRET returned. As with
the manually-generated QUABs, the automatically-
7We chose MRR as our scoring metric because it reflects the
fact that a user is most likely to examine the first few answers
from any system, but that all correct answers returned by the
system have some value because users will sometimes examine
a very large list of query results.
213
% of Top 5 Responses % of Top 1 Responses MRR
Relevant to User Q Relevant to User Q
Without QUAB 15.73% 26.85% 0.325
Similarity 1 82.61% 60.63% 0.703
Similarity 2 79.95% 58.45% 0.681
Similarity 3 79.47% 56.04% 0.664
Similarity 4 78.26% 46.14% 0.592
Similarity 5 84.06% 68.36% 0.753
Similarity 6 81.64% 56.04% 0.671
Similarity 7 84.54% 64.01% 0.730
Table 6: Effectiveness of dialogs
generated pairs were submitted to human assessors
who annotated each as “relevant” or irrelevant to the
user’s query. Aggregate scores are presented in Ta-
ble 7.
Egypt Russia
Approach % of Top 5 % of Top 5
Responses Rel. MRR Responses Rel. MRR
to User Q to User Q
Approach 1 40.01% 0.295 60.25% 0.310
Approach 2 36.00% 0.243 72.00% 0.475
Approach 3 44.62% 0.271 60.00% 0.297
Approach 4 68.05% 0.510 68.00% 0.406
Table 7: Quality of QUABs acquired automatically
User Satisfaction Users were consistently satis-
fied with their interactions with FERRET. In all three
experiments, respondents claimed that they found
that FERRET (1) gave meaningful answers, (2) pro-
vided useful suggestions, (3) helped answer spe-
cific questions, and (4) promoted their general un-
derstanding of the issues considered in the scenario.
Complete results of this study are presented in Ta-
ble 88.
Factor E1 E2 E3
Promoted understanding 3.40 3.20 3.75
Helped with specific questions 3.70 3.60 3.25
Make good use of questions 3.40 3.55 3.0
Gave new scenario insights 3.00 3.10 2.2
Gave good collection coverage 3.75 3.70 3.75
Stimulated user thinking 3.50 3.20 2.75
Easy to use 3.50 3.55 4.10
Expanded understanding 3.40 3.20 3.00
Gave meaningful answers 4.10 3.60 2.75
Was helpful 4.00 3.75 3.25
Helped with new search methods 2.75 3.05 2.25
Provided novel suggestions 3.25 3.40 2.65
Is ready for work environment 2.85 2.80 3.25
Would speed up work 3.25 3.25 3.00
Overall like of system 3.75 3.60 3.75
Table 8: User Satisfaction Survey Results
6 Conclusions
We believe that the quality of Q/A interactions de-
pends on the modeling of scenario topics. An ideal
model is provided by question-answer databases
(QUABs) that are created off-line and then used to
8Evaluation scale: 1-does not describe the system, 5-
completely describes the system
make suggestions to a user of potential relevant con-
tinuations of a discourse. In this paper, we have
presented FERRET, an interactive Q/A system which
makes use of a novel Q/A architecture that integrates
QUAB question-answer pairs into the processing of
questions. Experiments with FERRET have shown
that, in addition to being rapidly adopted by users as
valid suggestions, the incorporation of QUABs into
Q/A can greatly improve the overall accuracy of an
interactive Q/A dialogue.
References
S. Dudani. 1976. The distance-weighted k-nearest-neighbour
rule. IEEE Transactions on Systems, Man, and Cybernetics,
SMC-6(4):325–327.
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams,
and J. Bensley. 2003. Answer Mining by Combining Ex-
traction Techniques with Abductive Reasoning. In Proceed-
ings of the Twelfth Text Retrieval Conference (TREC 2003).
Sanda Harabagiu. 2004. Incremental Topic Representations.
In Proceedings of the 20th COLING Conference, Geneva,
Switzerland.
Marti Hearst. 1994. Multi-Paragraph Segmentation of Exposi-
tory Text. In Proceedings of the 32nd Meeting of the Associ-
ation for Computational Linguistics, pages 9–16.
Megumi Kameyama. 1997. Recognizing Referential Links: An
Information Extraction Perspective. In Workshop of Opera-
tional Factors in Practical, Robust Anaphora Resolution for
Unrestricted Texts, (ACL-97/EACL-97), pages 46–53.
Chin-Yew Lin and Eduard Hovy. 2000. The Automated Acqui-
sition of Topic Signatures for Text Summarization. In Pro-
ceedings of the 18th COLING Conference, pages 495–501.
S. Lytinen and N. Tomuro. 2002. The Use of Question Types
to Match Questions in FAQFinder. In Papers from the 2002
AAAI Spring Symposium on Mining Answers from Texts and
Knowledge Bases, pages 46–53.
Srini Narayanan and Sanda Harabagiu. 2004. Question An-
swering Based on Semantic Structures. In Proceedings of
the 20th COLING Conference, Geneva, Switzerland.
Mihai Surdeanu and Sanda M. Harabagiu. 2002. Infratructure
for open-domanin information extraction. In Conference for
Human Language Technology (HLT-2002).
Mihai Surdeanu, Sanda M. Harabagiu, John Williams, and Paul
Aarseth. 2003. Using predicate-argument structures for in-
formation extraction. In ACL, pages 8–15.
Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja
Huttunen. 2000. Automatic Acquisition of Domain Knowl-
edge for Information Extraction. In Proceedings of the 18th
COLING Conference, pages 940–946.
Roman Yangarber. 2003. Counter-Training in Discovery of
Semantic Patterns. In Proceedings of the 41th Meeting of the
Association for Computational Linguistics, pages 343–350.
214
