Proceedings of the Interactive Question Answering Workshop at HLT-NAACL 2006, pages 25–32,
New York City, NY, USA. June 2006. c©2006 Association for Computational Linguistics
EnhancedInteractiveQuestion-AnsweringwithConditionalRandomFields
AndrewHicklandSandaHarabagiu
LanguageComputerCorporation
Richardson,Texas75080
andy@languagecomputer.com
Abstract
This paper describes a new methodology
forenhancingthequalityandrelevanceof
suggestions provided to users of interac-
tive Q/A systems. We show that by using
Conditional Random Fields to combine
relevance feedback gathered from users
along with information derived from dis-
course structure and coherence, we can
accurately identify irrelevant suggestions
withnearly90%F-measure.
1 Introduction
Today’s interactive question-answering (Q/A) sys-
tems enable users to pose questions in the context
ofextendeddialoguesinordertoobtaininformation
relevanttocomplexresearchscenarios. Whenwork-
ing with an interactive Q/A system, users formulate
sequences of questions which they believe will re-
turn answers that will let them reach certain infor-
mationgoals.
Users need more than answers, however: while
they might be cognizant of many of the different
types of information that they need, few – if any –
users are capable of identifying all of the questions
thatmustbeaskedandansweredforaparticularsce-
nario. In order to take full advantage of the Q/A
capabilities of current systems, users need access to
sources of domain-specific knowledge that will ex-
pose them to new concepts and ideas and will allow
themtoaskbetterquestions.
Inpreviouswork(Hickletal.,2004;Harabagiuet
al.,2005a),wehavearguedthatinteractivequestion-
answering systems should be based on a predictive
dialogue architecture which can be used to provide
userswithbothpreciseanswerstotheirquestionsas
well as suggestions of relevant research topics that
couldbeexploredthroughoutthecourseofaninter-
activeQ/Adialogue.
Typically,thequalityofinteractiveQ/Adialogues
hasbeenmeasuredinthreeways: (1)efficiency,de-
fined as the number of questions that the user must
posetofindparticularinformation,(2)effectiveness,
definedbytherelevanceoftheanswerreturned,and
(3)usersatisfaction(ScholtzandMorse,2003).
In our experiments with an interactive Q/A sys-
tem, (known as FERRET), we found that perfor-
mance in each of these areas improves as users are
provided with suggestions that are relevant to their
domain of interest. In FERRET, suggestions are
made to users in the form of predictive question-
answer pairs (known as QUABs) which are either
generated automatically from the set of documents
returnedforaquery(usingtechniquesfirstdescribed
in (Harabagiu et al., 2005a)), or are selected from a
largedatabaseofquestions-answerpairscreatedoff-
line(priortoadialogue)byhumanannotators.
Figure 1 presents an example of ten QUABs
that were returned by FERRET in response to the
question “How are EU countries responding to the
worldwide increase of job outsourcing to India?”.
While FERRET’s QUABs are intended to provide
users with relevant information about a domain of
interest, we can see from Figure 1 that users do not
always agree on which QUAB suggestions are rel-
evant. For example, while someone unfamiliar to
the notion of “job outsourcing” could benefit from
25
Relevant?
QUABQuestion
User
1
User
2
NO YES QUAB
1
: WhatEUcountriesareoutsourcingjobstoIndia?
YES YES QUAB
2
: What EU countries have made public statements
againstoutsourcingjobstoIndia?
NO YES QUAB
3
: Whatisjoboutsourcing?
YES YES QUAB
4
: WhyareEUcompaniesoutsourcingjobstoIndia?
NO NO QUAB
5
: What measures has the U.S. Congress taken to stem
thetideofjoboutsourcingtoIndia?
YES NO QUAB
6
: How could the anti-globalization movements in EU
countries impact the likelihood that the EU Parliament will
takestepstopreventjoboutsourcingtoIndia?
YES YES QUAB
7
: Which sectors of the EU economy could be most
affectedbyjoboutsourcing?
YES YES QUAB
8
: How has public opinion changed in the EU on job
outsourcingissuesoverthepast10years?
YES YES QUAB
9
: What statements has French President Jacques
Chiracmadeaboutjoboutsourcing?
YES YES QUAB
10
: HowhastheEUbeenaffectedbyanti-joboutsourc-
ingsentimentsintheU.S.?
Figure1: ExamplesofQUABs.
a QUAB like QUAB
3
: “What is job outsourcing?”,
weexpectthatamoreexperiencedresearcherwould
find this definition to be uninformative and poten-
tially irrelevant to his or her particular information
needs. In contrast, a complex QUAB like QUAB
6
:
“Howcouldtheanti-globalizationmovementsinEU
countries impact the likelihood that the EU Parlia-
ment will take steps to prevent job outsourcing to
India?” could provide a domain expert with rel-
evant information, but would not provide enough
backgroundinformationtosatisfyanoviceuserwho
mightnotbeabletointerpretthisinformationinthe
appropriatecontext.
In this paper, we present results of a new set of
experimentsthatseektocombinefeedbackgathered
from users with a relevance classifier based on con-
ditionalrandomfields(CRF)inordertoprovidesug-
gestionstousersthatarenotonlyrelatedtothetopic
of their interactive Q/A dialogue, but provide them
withthenewtypesofinformationtheyneedtoknow.
Section 2 presents the functionality of several
of FERRET’s modules and describes the NLP tech-
niquesforprocessingquestionsaswellastheframe-
workforacquiringdomainknowledge. InSection3
wepresenttwocasestudiesthathighlighttheimpact
ofuserbackground. Section4describesanewclass
of user interaction models for interactive Q/A and
presentsdetailsofourCRF-basedclassifier. Section
5 presents results from experiments which demon-
strate that user modeling can enhance the quality
of suggestions provided to both expert and novice
users. Section6summarizestheconclusions.
2 The FERRET Interactive
Question-AnsweringSystem
We believe that the quality of interactions produced
by an interactive Q/A system can be enhanced by
predicting the range of questions that a user might
ask while researching a particular topic. By provid-
ing suggestions from a large database of question-
answer pairs related to a user’s particular area of
interest, interactive Q/A systems can help users
gathertheinformationtheyneedmost–withoutthe
need for complex, mixed-initiative clarification dia-
logues.
FERRET uses a large collection of QUAB
question-answerpairsinordertoprovideuserswith
suggestionsofnewresearchtopicsthatcouldbeex-
plored over the course of a dialogue. For example,
whenauserasksaquestionlikeWhat is the result of
theEuropeandebateonoutsourcingtoIndia? (asil-
lustratedin(Q1)inTable1),FERRETreturnsasetof
answers (including (A1) and proposes the questions
in (Q2), (Q3), and (Q4) as suggestions of possible
continuations of the dialogue. Users then have the
freedom to choose how the dialogue should be con-
tinued, either by (1) ignoring the suggestions made
by the system, (2) selecting one of the proposed
QUAB questions and examining its associated an-
swer,or(3)resubmittingthetextoftheQUABques-
tion to FERRET’s automatic Q/A system in order to
retrieveabrand-newsetofanswers.
(Q1)WhatistheresultoftheEuropeandebateonoutsourcingtoIndia?
(A1) Supporters of economic openness understand how outsourcing can
strengthenthecompetitivenessofEuropeancompanies,aswellasbenefit
jobsandgrowthinIndia.
(Q2) Has the number of customer service jobs outsourced to India in-
creasedsince1990?
(Q3) How many telecom jobs were outsourced to India from EU-based
companiesinthelast10years?
(Q4) Which European Union countries have experienced the most job
lossesduetooutsourcingoverthepast10years?
Table1: SampleQ/ADialogue.
FERRET was designed to evaluate how databases
of topic-relevant suggestions could be used to en-
hance the overall quality of Q/A dialogues. Fig-
ure 2 illustrates the architecture of the FERRET sys-
tem. Questions submitted to FERRET are initially
processedbyadialogueshellwhich(1)decomposes
complexquestionsintosetsofsimplerquestions(us-
ing techniques first described in (Harabagiu et al.,
2005a)),(2)establishesdiscourse-levelrelationsbe-
tween the current question and the set of questions
26
Previous
Dialogue Context
Management
Management
Dialogue Act
Decomposition
Question
Topic Partitioning
and Representation
Answer
Fusion
Database (QUAB)
Question−Answer
Information
Extraction
Question
Similarity
Conversation
Scenario
Predictive
Questions
Answer
Fusion
Online Question Answering Off−line Question Answering
Answer
Question
Document
Collection
Dialogue Shell Predictive Dialogue
(PQN)
Network
Predictive
Question
Figure2: FERRET -APredictiveInteractiveQuestion-AnsweringArchitecture.
already entered into the discourse, and (3) identifies
a set of basic dialogue acts that are used to manage
theoverallcourseoftheinteractionwithauser.
Output from FERRET’s dialogue shell is sent to
an automatic question-answering system which is
used to find answers to the user’s question(s). FER-
RET uses a version of LCC’s PALANTIR question-
answering system (Harabagiu et al., 2005b) in or-
der to provide answers to questions in documents.
Before being returned to users, answer passages are
submitted to an answer fusion module, which filters
redundantanswersandcombinesanswerswithcom-
patible information content into single coherent an-
swers.
Questionsandrelationalinformationextractedby
the dialogue shell are also sent to a predictive dia-
logue module,whichidentifiestheQUABsthatbest
meet the user’s expected information requirements.
At the core of the FERRET’s predictive dialogue
moduleisthePredictiveDialogueNetwork(PQN),a
largedatabaseofQUABsthatwereeithergenerated
off-line by human annotators or created on-line by
FERRET (either during the current dialogue or dur-
ing some previous dialogue)
1
. In order to generate
QUABs automatically, documents identified from
FERRET’s automatic Q/A system are first submit-
ted to a Topic Representation module, which com-
putes both topic signatures (Lin and Hovy, 2000)
and enhanced topic signatures (Harabagiu, 2004) in
order to identify a set of topic-relevant passages.
Passages are then submitted to an Information Ex-
traction module, which annotates texts with a wide
1
TechniquesusedbyhumanannotatorsforcreatingQUABs
were first described in (Hickl et al., 2004); full details of FER-
RET’sautomaticQUABgenerationcomponentsareprovidedin
(Harabagiuetal.,2005a).
range of lexical, semantic, and syntactic informa-
tion, including (1) morphological information, (2)
namedentityinformationfromLCC’s CICEROLITE
named entity recognition system, (3) semantic de-
pendencies extracted from LCC’s PropBank-style
semantic parser, and (4) syntactic parse informa-
tion. Passagesarethentransformedintonaturallan-
guage questions using a set of question formation
heuristics; the resultant QUABs are then stored in
the PQN. Since we believe that the same set of re-
lations that hold between questions in a dialogue
should also hold between pairs of individual ques-
tions taken in isolation, discourse relations are dis-
covered between each newly-generated QUAB and
the set of QUABs stored in the PQN. FERRET’s
Question Similarity module then uses the similar-
ity function described in (Harabagiu et al., 2005a) –
along with relational information stored in the PQN
– in order to identify the QUABs that represent the
most informative possible continuations of the dia-
logue. QUABs are then ranked in terms of their rel-
evancetotheuser’ssubmittedquestionandreturned
totheuser.
3 TwoTypesofUsersofInteractiveQ/A
Systems
In order to return answers that are responsive to
users’ information needs, interactive Q/A systems
need to be sensitive to the different questioning
strategies that users employ over the course of a di-
alogue. Since users gathering information on the
same topic can have significantly different informa-
tion needs, interactive Q/A systems need to be able
to accommodate a wide range of question types in
order to help users find the specific information that
27
SQ − How are European Union countries responding to the worldwide increase in job outsourcing to countries like India?
EQ2 − What impact has public opposition to globalization in
EU countries had on companies to relocate EU jobs to India?
EQ4 − What economic advantages could EU
countries realize by outsourcing jobs to India?
NQ1 − What countries in the European Union are outsourcing jobs to India?
NQ2 − How many jobs have been outsourced to India?
NQ3 − What industries have been most
active in outsourcing jobs to India?
NQ4 − Are the companies that are outsourcing
jobs to India based in EU countries?
NQ5 − What could European Countries do to
respond to increases in job outsourcing to India?
NQ6 − Do European Union Countries view job
outsourcing to countries like India as a problem?
EQ1 − Is the European Union likely to implement protectionist
policies to keep EU companies from outsourcing jobs to India?
EQ3 − What economic ties has the EU maintained historically with India?
EQ5 − Will the EU adopt either any of the the U.S.’s or
Japan’s anti−outsourcing policies in the near future?
ease tensions over immiggration in many EU countries?
EQ6 − Could the increasing outsourcing of jobs to India
Figure3: ExpertUserInteractionsVersusNoviceUserInteractionswithaQ/ASystem.
theyarelookingfor.
In past experiments with users of interactive Q/A
systems (Hickl et al., 2004), we have found that a
user’s access to sources of domain-specific knowl-
edge significantly affects the types of questions that
auserislikelytosubmittoaQ/Asystem. Userspar-
ticipate in information-seeking dialogues with Q/A
systems in order to learn “new” things – that is, to
acquire information that they do not currently pos-
sess. Users initiate a set of speech acts which allow
them to maximize the amount of new information
they obtain from the system while simultaneously
minimizingtheamountofredundant(orpreviously-
acquired) information they encounter. Our experi-
mentshaveshownthatQ/Asystemsneedtobesen-
sitive to two kinds of users: (1) expert users, who
interact with a system based on a working knowl-
edge of the conceptual structure of a domain, and
(2) novice users, who are presumed to have lim-
ited to no foreknowledge of the concepts associ-
ated with the domain. We have found that novice
users that possess little or no familiarity with a do-
main employ markedly different questioning strate-
giesthanexpertuserswhopossessextensiveknowl-
edge of a domain: while novices focus their atten-
tion in queries that will allow them to discover ba-
sic domain concepts, experts spend their time ask-
ing questions that enable them to evaluate their hy-
potheses in the context of a the currently available
information. The experts tend to ask questions that
refer to the more abstract domain concepts or the
complex relations between concepts. In a similar
fashion,wehavediscoveredthatuserswhohaveac-
cesstostructuredsourcesofdomain-specificknowl-
edge (e.g. knowledge bases, conceptual networks
orontologies,ormixed-initiativedialogues)canend
upemployingmore“expert-like”questioningstrate-
gies, despite the amount of domain-specific knowl-
edgetheypossess.
In real-world settings, the knowledge that expert
users possess enables them to formulate a set of hy-
potheses – or belief states – that correspond to each
of their perceived information needs at a given mo-
ment in the dialogue context. As can be seen in the
dialogues presented in Figure 3, expert users gener-
allyformulatequestionswhichseektovalidatethese
beliefstatesinthecontextofadocumentcollection.
Given the global information need in S
1
, it seems
reasonable to presume that questions like EQ
1
and
EQ
2
aremotivatedbyauser’sexpectationthat pro-
tectionist policies or public opposition to globaliza-
tion could impact a European Union country’s will-
ingness to take steps to stem job outsourcing to In-
dia. Likewise, questions like EQ
5
are designed to
providetheuserwithinformationthatcandecidebe-
tween two competing belief states: in this case, the
user wants to know whether the European Union is
morelikelytomodeltheUnitedStatesorJapaninits
policies towards job outsourcing. In contrast, with-
out a pre-existing body of domain-specific knowl-
edge to derive reasonable hypotheses from, novice
users ask questions that enable them to discover
the concepts (and the relations between concepts)
needed to formulate new, more specific hypotheses
and questions. Returning again to Figure 3, we can
see that questions like NQ
1
and NQ
3
are designed
to discover new knowledge that the user does not
currently possess, while questions like NQ
6
try to
28
establish whether or not the user’s hypothesis (i.e.
namely, that EU countries view job outsourcing to
India as an problem) is valid and deserves further
consideration.
4 UserInteractionModelsforRelevance
Estimation
Unlike systems that utilize mixed initiative dia-
logues in order to determine a user’s information
needs(SmallandStrzalkowski,2004),systems(like
FERRET) which rely on interactions based on pre-
dictive questioning have traditionally not incorpo-
ratedtechniquesthatallowthemtogatherrelevance
feedback from users. In this section, we describe
howwehaveusedanewsetofuserinteractionmod-
els (UIM) in conjunction with a relevance classifier
based on conditional random fields (CRF) (McCal-
lum, 2003; Sha and Pereira, 2003) in order to im-
prove the relevance of the QUAB suggestions that
FERRET returnsinresponsetoauser’squery.
Webelievethatsystemsbasedonpredictiveques-
tioning can derive feedback from users in three
ways. First, systems can learn which suggestions
or answers are relevant to a user’s domain of inter-
estbytrackingwhichelementsusersselectthrough-
out the course of a dialogue. With FERRET, each
answer or suggestion presented to a user is associ-
ated with a hyperlink that links to the original text
that the answer or QUAB was derived from. While
usersdonotalwaysfollowlinksassociatedwithpas-
sages they deem to be relevant to their query, we
expect that the set of selected elements are gener-
ally more likely to be relevant to the user’s interests
than unselected elements. Second, since interactive
Q/A systems are often used to gather information
forinclusioninwrittenreports,systemscanidentify
relevant content by tracking the text passages that
userscopytootherapplications,suchastexteditors
orwordprocessors. Finally,predictiveQ/Asystems
can gather explicit feedback from users through the
graphical user interface itself. In a recent version of
FERRET,weexperimentedwithaddinga“relevance
checkbox” to each answer or QUAB element pre-
sented to a user; users were then asked to provide
feedback to the system by selecting the checkboxes
associatedwithanswersthat theydeemedtobepar-
ticularlyrelevanttothetopictheywereresearching.
4.1 UserInteractionModels
We have experimented with three models that we
haveusedtogatherfeedbackfromusersofFERRET.
ThemodelsareillustratedinFigure4.
UIM
1
: Underthismodel,thesetofQUABsthatuserscopiedfromwereselected
asrelevant;allQUABsnotcopiedfromwereannotatedasirrelevant.
UIM
2
: Underthismodel,QUABsthatusersviewedwereconsideredtoberele-
vant;QUABsthatremainedunviewedwereannotatedasirrelevant.
UIM
3
: Under this model, QUABs that were either viewed or copied from were
markedasrelevant;allotherQUABswereannotatedasirrelevant.
Figure4: UserInteractionModels.
With FERRET, users are presented with as many
as ten QUABs for every question they submit to the
system. QUABs – whether they be generated auto-
matically by FERRET’s QUAB generation module,
or selected from FERRET’s knowledge base of over
10,000 manually-generated question/answer pairs –
are presented in terms of their conceptual similarity
to the original question. Conceptual similarity (as
firstdescribedin(Harabagiuetal.,2005a))iscalcu-
lated using the version of the cosine similarity for-
mulapresentedinFigure5.
Conceptual Similarity weights content terms in Q
1
and Q
2
using tfidf
(w
i
= w(t
i
) = (1 + log(tf
i
))
log N
df
i
), where N is the number of
questions in the QUAB collection, while df
i
is equal to the number of
questions containing t
i
and tf
i
is the number of times t
i
appears in Q
1
and Q
2
. The questions Q
1
and Q
2
can be transformed into two vectors,
v
q
= 〈w
q
1
, w
q
2
, ..., w
q
m
〉and v
u
= 〈w
u
1
, w
u
2
, ..., w
u
n
〉;Thesim-
ilarity between Q
1
and Q
2
is measured as the cosine measure between their
correspondingvectors:
cos(v
q
, v
u
) = (
summationtext
i
w
q
i
w
u
i
)/((
summationtext
i
w
2
q
i
)
1
2 × (
summationtext
i
w
2
u
i
)
1
2 )
Figure5: ConceptualSimilarity.
In the three models from Figure 4, we allowed
users to perform research as they normally would.
Instead of requiring users to provide explicit forms
of feedback, features were derived from the set of
hyper-links that users selected and the text passages
thatuserscopiedtothesystemclipboard.
Following(Kristjanssonetal.,2004)weanalyzed
theperformanceofeachofthesethreemodelsusing
a new metric derived from the number of relevant
QUABs that were predicted to be returned for each
model. We calculated this metric – which we refer
to as the Expected Number of Irrelevant QUABs –
usingtheformula:
p
0
(n) =
10
summationdisplay
k=1
kp
0
(k) (1)
p
1
(n) = (1 − p
0
(0)) +
10
summationdisplay
k=1
kp
1
(k) (2)
29
where p
m
(n) is equal to the probability of finding
n irrelevant QUABs in a set of 10 suggestions re-
turned to the user given m rounds of interaction.
p
0
(n) (equation1)isequaltotheprobabilitythatall
QUABs are relevant initially, while p
1
(n) (equation
2) is equal to the probability of finding an irrelevant
QUAB after the set of QUABs has been interacted
with by a user. For the purposes of this paper, we
assumed that all of the QUABs initially returned by
FERRET were relevant, and that p
0
(0) = 1.0. This
enabled us to calculate p
1
(n) for each of the three
modelsprovidedinFigure4.
4.2 RelevanceEstimationusingConditional
RandomFields
Following work done by (Kristjansson et al., 2004),
we used the feedback gathered in Section 4.1 to es-
timate the probability that a QUAB selected from
FERRET’S PQN is, in fact, relevant to a user’s orig-
inal query. We assume that humans gauge the rel-
evance of QUAB suggestions returned by the sys-
tembyevaluatingtheinformativenessoftheQUAB
with regards to the set of queries and suggestions
that have occurred previously in the discourse. A
QUAB, then, is deemed relevant when it conveys
content that is sufficiently informative to the user,
given what the user knows (i.e. the user’s level of
expertise) and what the user expects to receive as
answersfromthesystem.
Our approach treats a QUAB suggestion
as a single node in a sequence of questions
〈Q
n−1
, Q
n
, QUAB〉 and classifies the QUAB as
relevant or irrelevant based on features from the
entiresequence.
We have performed relevance estimation us-
ing Conditional Random Fields (CRF). Given a
random variable x (corresponding to data points
{x
1
, . . . , x
n
}) and another random variable y (cor-
responding to a set of labels {y
1
, . . . , y
n
}), CRFs
can be used to calculate the conditional probability
p(y|x). Given a sequence {x
1
, . . . , x
n
} and set of
labels {y
1
, . . . , y
n
}, p(y|x) canbedefinedas:
p(y|x) =
1
z
0
exp
parenleftBigg
N
summationdisplay
n=1
summationdisplay
k
λ
k
f
k
(y
i−1
, y
i
, x, n)
parenrightBigg
(3)
where z
0
isanormalizationfactorandλ
k
isaweight
learnedforeachfeaturevector f
k
(y
i−1
, y
i
, x, n).
We trained our CRF model in the following way.
If we assume that Λ is a set of feature weights
(λ
0
, . . . ,λ
k
), then we expect that we can use maxi-
mum likelihood to estimate values for Λ given a set
oftrainingdatapairs (x, y).
Training is accomplished by maximizing the log-
likelihood of each labeled data point as in the fol-
lowingequation:
w
Λ
=
N
summationdisplay
i=1
log(p
Λ
(x
i
|y
i
)) (4)
Again, following (Kristjansson et al., 2004), we
used the CRF Viterbi algorithm to find the most
likely sequence of data points assigned to each la-
belcategoryusingtheformula:
y
∗
= arg max
y
p
Λ
(y|x) (5)
Motivated by the types of discourse relations that
appear to exist between states in an interactive Q/A
dialogue, we introduced a large number of features
to estimate relevance for each QUAB suggestion.
ThefeaturesweusedarepresentedinFigure6
(a)RankofQUAB: therank(1,...,10)oftheQUABinquestion.
(b)Similarity: similarityofQUAB, Q
n
andQUAB, Q
n−1
.
(c) Relationlikelihood: equal to the likelihood of each predicate-argument
structureincludedinQUABgivenallQUABscontainedin FERRET’sQUAB;
calculated for Arg-0, Arg-1, and ArgM-TMP for each predicate found in
QUAB suggestions. (Predicate-argument relations were identified using a se-
manticparsertrainedonPropBank(Palmeretal.,2005)annotations.)
(d)ConditionalExpectedAnswerTypelikelihood: equaltothejointprobabil-
ity p(EAT
QUAB
|EAT
question
) calculated from a corpus of dialogues
collectedfromhumanusersof FERRET.
(e) Termsincommon: real-valued feature equal to the number of terms in
commonbetweentheQUABandboth Q
n
and Q
n−1
.
(f) NamedEntitiesincommon: same as terms in common, but calculated for
named entities detected by LCC’s CIEROLITE named entity recognition sys-
tem.
Figure6: RelevanceFeatures.
In the next section, we describe how we utilized
the user interaction model described in Subsection
4.1 in conjunction with this subsection in order to
improve the relevance of QUAB suggestions re-
turnedtousers.
5 ExperimentalResults
In this section, we describe results fromtwo experi-
mentsthatwereconductedusingdatacollectedfrom
humaninteractionswith FERRET.
In order to evaluate the effectiveness of our rel-
evance classifier, we gathered a total of 1000 ques-
tions from human dialogues with FERRET. 500 of
30
these came from interactions (41 dialogues) where
the user was a self-described “expert” on the topic;
another selection of 500 questions came from a to-
tal of 23 dialogues resulting from interactions with
userswhodescribedthemselvesas“novice”orwere
otherwise unfamiliar with a topic. In order to
validate the user’s self-assessment, we selected 5
QUABs at random from the set of manually created
QUABsassembledforeachtopic. Userswereasked
toprovidewrittenanswerstothosequestions. Users
that were judged to have correctly answered three
out of five questions were considered “experts” for
thepurposeofourexperiments. Table2presentsthe
breakdownofquestionsacrossthesetwoconditions.
UserType UniqueTopics #Dialogues Avg#ofQs/dialogue TotalQs
Expert 12 41 12.20 500
Novice 8 23 21.74 500
Total 12 64 15.63 1000
Table2: QuestionBreakdown.
Each of these experiments were run using a ver-
sionofFERRETthatreturnedthetop10mostsimilar
QUABs from a database that combined manually-
created QUABs with the automatically-generated
QUABs created for the user’s question. While a to-
tal of 10,000 QUABs were returned to users during
theseexperiments,only3,998oftheseQUABswere
unique(39.98%).
We conducted two kinds of experiments with
users. In the first set of experiments, users were
asked to mark all of the relevant QUABs that FER-
RET returned in response to questions submitted by
users. After performing research on a particular
scenario, expert and novice users were then sup-
plied with as many as 65 questions (and associ-
ated QUABs) taken from previously-completed di-
aloguesonthesamescenario;userswerethenasked
to select checkboxes associated with QUABs that
were relevant. In addition, we also had 2 linguists
(who were familiar with all of the research sce-
narios but did not research any of them) perform
the same task for all of the collected questions and
QUABs. Resultsfromthesethreesetsofannotations
arefoundinTable3.
UserType Users #Qs #QUABs #rel. QUABs %relevant ENIQ(P
1
)
Expert 6 250 2500 699 27.96% 5.88
Novice 4 250 2500 953 38.12% 3.73
Linguists 2 500 5000 2240 44.80% 3.53
Table3: UserComparison.
As expected, experts believed QUABs to be sig-
nificantly(p < 0.05)lessrelevantthannovices,who
found approximately 38.12% of QUABs to be rel-
evant to the original question submitted by a user.
In contrast, the two linguists found 44.8% of the
QUABs to be relevant. This number may be arti-
ficially high: since the linguists did not engage in
actual Q/A dialogues for each of the scenarios they
were annotating, they may not have been appropri-
atelypreparedtomakearelevanceassessment.
Inthesecondsetofexperiments,weusedtheUIM
in Figure 4 to train CRF-based relevance classifiers.
We obtained training data for UIM
1
(“copy-and-
paste”-based), UIM
2
(“click”-based), and UIM
3
(“hybrid”) from 16 different dialogue histories col-
lected from 8 different novice users. During these
dialogues, users were asked to perform research as
they normally would; no special instructions were
given to users to provide additional relevance feed-
back to the system. After the dialogues were com-
pleted, QUABs that were copied from or clicked
were annotated as “relevant” examples (according
to each UIM); the remaining QUABs were anno-
tated as “irrelevant”. Once features (as described
in Table 3) were extracted and the classifiers were
trained,theywereevaluatedonasetof1000QUABs
(500 “relevant”, 500 “irrelevant”) selected at ran-
dom from the annotations performed in the first ex-
periment. Table 4 presents results from these two
classifiers.
UIM
1
P R F(β = 1)
Irrelevant 0.9523 0.9448 0.9485
Relevant 0.3137 0.3478 0.3299
UIM
2
P R F(β = 1)
Irrelevant 0.8520 0.8442 0.8788
Relevant 0.3214 0.4285 0.3673
UIM
3
P R F(β = 1)
Irrelevant 0.9384 0.9114 0.9247
Relevant 0.3751 0.3961 0.3853
Table4: ExperimentalResultsfrom3UserModels.
Ourresultssuggestthatfeedbackgatheredfroma
user’s ”normal” interactions with FERRET could be
used to provide valuable input to a relevance classi-
fierforQUABsWhen“copy-and-paste”eventswere
used to train the classifier, the system detected in-
stancesofirrelevantQUABswithover80%F.When
themuchmorefrequent“clicking”eventswereused
to train the classifier, irrelevant QUABs were de-
tected at over 90%F for both UIM
2
and UIM
3
. In
each of these three cases, however, detection of rel-
31
evant QUABs lagged behind significantly: relevant
QUABs were detected at 42% F in UIM
1
at nearly
33%FunderUIM
2
andat39%underUIM
3
.
We feel that these results suggest that the detec-
tionofrelevantQUABs(orthefilteringofirrelevant
QUABs) may be feasible, even without requiring
users to provide additional forms of explicit feed-
back to the system. While we acknowledge that
trainingmodelsonthesetypesofeventsmaynotal-
ways provide reliable sources of training data – es-
pecially as users copy or click on QUAB passages
that may not be relevant to their interests in the re-
search scenario, we believe the initial performance
of these suggests that accurate forms of relevance
feedback can be gathered without the use of mixed-
initiativeclarificationdialogues.
6 Conclusions
Inthispaper,wehavepresentedamethodologythat
combines feedback that was gathered from users in
conjunction with a CRF-based classifier in order to
enhance the quality of suggestions returned to users
of interactive Q/A systems. We have shown that
theirrelevantQUABsuggestionscanbeidentifiedat
over 90% when systems combine information from
auser’sinteractionwithsemanticandpragmaticfea-
turesderivedfromthestructureandcoherenceofan
interactiveQ/Adialogue.
7 Acknowledgments
This material is based upon work funded in whole
or in part by the U.S. Government and any opin-
ions,findings,conclusions,orrecommendationsex-
pressed in this material are those of the authors and
donotnecessarilyreflecttheviewsoftheU.S.Gov-
ernment.

References
Sanda Harabagiu, Andrew Hickl, John Lehmann, and
Dan Moldovan. 2005a. Experiments with Interac-
tive Question-Answering. In Proceedings of the 43rd
Annual Meeting of the Association for Computational
Linguistics (ACL’05).
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, A.
Hickl,andP.Wang. 2005b. EmployingTwoQuestion
AnsweringSystemsinTREC2005. InProceedings of
the Fourteenth Text REtrieval Conference.
Sanda Harabagiu. 2004. Incremental Topic Represen-
tations. In Proceedings of the 20th COLING Confer-
ence.
AndrewHickl,JohnLehmann,JohnWilliams,andSanda
Harabagiu. 2004. Experiments with Interactive
Question-Answering in Complex Scenarios. In Pro-
ceedings of the Workshop on the Pragmatics of Ques-
tion Answering at HLT-NAACL 2004.
T. Kristjansson, A. Culotta, P. Viola, and A. McCallum.
2004. Interactive information extraction with con-
strained conditional random fields. In Proceedings of
AAAI-2004.
Chin-Yew Lin and Eduard Hovy. 2000. The Automated
Acquisition of Topic Signatures for Text Summariza-
tion. InProceedings of the 18th COLING Conference.
A. McCallum. 2003. Efficiently inducing features of
conditionalrandomfields. InProceedingsoftheNine-
teenth Conference on Uncertainty in Artificial Intelli-
gence (UAI03).
M. Palmer, D. Gildea, and P. Kingsbury. 2005. The
Proposition Bank: An Annotated Corpus of Semantic
Roles. In Computational Linguistics,31(1):71–106.
Jean Scholtz and Emile Morse. 2003. Using consumer
demandstobridgethegapbetweensoftwareengineer-
ing and usability engineering. In Software Process:
Improvement and Practice,8(2):89–98.
F.ShaandF.Pereira. 2003. Shallow parsing with condi-
tional random fields. In Proceedings of HLT-NAACL-
2003.
Sharon Small and Tomek Strzalkowski. 2004. HITIQA:
Towards analytical question answering. In Proceed-
ings of Coling 2004.
