Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 57–64,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Towards Conversational QA: Automatic Identification of Problematic
Situations and User Intent
∗
Joyce Y. Chai Chen Zhang Tyler Baldwin
Department of Computer Science and Engineering
Michigan State University
East Lansing, MI 48824
{jchai, zhangch6, baldwi96}@cse.msu.edu
Abstract
To enable conversational QA, it is impor-
tant to examine key issues addressed in
conversational systems in the context of
question answering. In conversational sys-
tems, understanding user intent is criti-
cal to the success of interaction. Recent
studies have also shown that the capabil-
ity to automatically identify problematic
situations during interaction can signifi-
cantly improve the system performance.
Therefore, this paper investigates the new
implications of user intent and problem-
atic situations in the context of question
answering. Our studies indicate that, in
basic interactive QA, there are different
types of user intent that are tied to dif-
ferent kinds of system performance (e.g.,
problematic/error free situations). Once
users are motivated to find specific infor-
mation related to their information goals,
the interaction context can provide useful
cues for the system to automatically iden-
tify problematic situations and user intent.
1 Introduction
Interactive question answering (QA) has been
identified as one of the important directions in QA
research (Burger et al., 2001). One ultimate goal is
to support intelligent conversation between a user
and a QA system to better facilitate user informa-
tion needs. However, except for a few systems that
use dialog to address complex questions (Small et
al., 2003; Harabagiu et al., 2005), the general di-
alog capabilities have been lacking in most ques-
∗
This work was partially supported by IIS-0347548 from
the National Science Foundation.
tion answering systems. To move towards conver-
sational QA, it is important to examine key issues
relevant to conversational systems in the context
of interactive question answering.
This paper focuses on two issues related to con-
versational QA. The first issue is concerned with
user intent. In conversational systems, understand-
ing user intent is the key to the success of the inter-
action. In the context of interactive QA, one ques-
tion is what type of user intent should be captured.
Unlike most dialog systems where user intent can
be characterized by dialog acts such as question,
reply, and statement, in interactive QA, user in-
puts are already in the form of question. Then
the problems become whether there are different
types of intent behind these questions that should
be handled differently by a QA system and how to
automatically identify them.
The second issue is concerned with problem-
atic situations during interaction. In spoken di-
alog systems, many problematic situations could
arise from insufficient speech recognition and lan-
guage understanding performance. Recent work
has shown that the capability to automatically
identify problematic situations (e.g., speech recog-
nition errors) can help control and adapt dialog
strategies to improve performance (Litman and
Pan, 2000). Similarly, QA systems also face chal-
lenges of technology limitation from language un-
derstanding and information retrieval. Thus one
question is, in the context of interactive QA, how
to characterize problematic situations and auto-
matically identify them when they occur.
In interactive QA, these two issues are inter-
twined. Questions formed by a user not only de-
pend on his/her information goals, but are also in-
fluenced by the answers from the system. Prob-
lematic situations will impact user intent in the
57
follow-up questions, which will further influence
system performance. Both the awareness of prob-
lematic situations and understanding of user in-
tent will allow QA systems to adapt better strate-
gies during interaction and move towards intelli-
gent conversational QA.
To address these two questions, we conducted
a user study where users interacted with a con-
trolled QA system to find information of inter-
est. These controlled studies allowed us to fo-
cus on the interaction aspect rather than informa-
tion retrieval or answer extraction aspects. Our
studies indicate that in basic interactive QA where
users always ask questions and the system always
provides some kind of answers, there are differ-
ent types of user intent that are tied to differ-
ent kinds of system performance (e.g., problem-
atic/error free situations). Once users are moti-
vated to find specific information related to their
information goals, the interaction context can pro-
vide useful cues for the system to automatically
identify problematic situations and user intent.
2 Related Work
Open domain question answering (QA) systems
are designed to automatically locate answers from
large collections of documents to users’ natural
language questions. In the past few years, au-
tomated question answering techniques have ad-
vanced tremendously, partly motivated by a se-
ries of evaluations conducted at the Text Retrieval
Conference (TREC) (Voorhees, 2001; Voorhees,
2004). To better facilitate user information needs,
recent trends in QA research have shifted towards
complex, context-based, and interactive question
answering (Voorhees, 2001; Small et al., 2003;
Harabagiu et al., 2005). For example, NIST initi-
ated a special task on context question answering
in TREC 10 (Voorhees, 2001), which later became
a regular task in TREC 2004 (Voorhees, 2004) and
2005. The motivation is that users tend to ask a
sequence of related questions rather than isolated
single questions to satisfy their information needs.
Therefore, the context QA task was designed to
investigate the system capability to track context
through a series of questions. Based on context
QA, some work has been done to identify clarifica-
tion relations between questions (Boni and Man-
andhar, 2003). However context QA is different
from interactive QA in that context questions are
specified ahead of time rather than incrementally
as in an interactive setting.
Interactive QA has been applied to process com-
plex questions. For analytical and non-factual
questions, it is hard to anticipate answers. Clari-
fication dialogues can be applied to negotiate with
users about the intent of their questions (Small et
al., 2003). Recently, an architecture for interactive
question answering has been proposed based on a
notion of predictive questioning (Harabagiu et al.,
2005). The idea is that, given a complex ques-
tion, the system can automatically identify a set of
potential follow-up questions from a large collec-
tion of question-answer pairs. The empirical re-
sults have shown the system with predictive ques-
tioning is more efficient and effective for users to
accomplish information seeking tasks in a partic-
ular domain (Harabagiu et al., 2005).
The work reported in this paper addresses a
different aspect of interactive question answering.
Both issues raised earlier (Section 1) are inspired
by earlier work on intelligent conversational sys-
tems. Automated identification of user intent has
played an important role in conversational sys-
tems. Tremendous amounts of work has focused
on this aspect (Stolcke et al., 2000). To improve
dialog performance, much effort has also been put
on techniques to automatically detect errors during
interaction. It has shown that during human ma-
chine dialog, there are sufficient cues for machines
to automatically identify error conditions (Levow,
1998; Litman et al., 1999; Hirschberg et al., 2001;
Walker et al., 2002). The awareness of erroneous
situations can help systems make intelligent de-
cisions about how to best guide human partners
through the conversation and accomplish the tasks.
Motivated by these earlier studies, the goal of this
paper is to investigate whether these two issues can
be applied in question answering to facilitate intel-
ligent conversational QA.
3 User Studies
We conducted a user study to collect data concern-
ing user behavior in a basic interactive QA set-
ting. We are particularly interested in how users
respond to different system performance and its
implication in identifying problematic situations
and user intent. As a starting point, we charac-
terize system performance as either problematic,
which indicates the answer has some problem, or
error-free, which indicates the answer is correct.
In this section, we first describe the methodology
58
and the system used in this effort and then discuss
the observed user behavior and its relation to prob-
lematic situations and user intent.
3.1 Methodology and System
The system used in our experiments has a user in-
terface that takes a natural language question and
presents an answer passage. Currently, our inter-
face only presents to the user the top one retrieved
result. This simplification on one hand helps us
focus on the investigation of user responses to dif-
ferent system performances and on the other hand
represents a possible situation where a list of po-
tential answers may not be practical (e.g., through
PDA or telephone line).
We implemented a Wizard-of-Oz (WOZ) mech-
anism in the interaction loop to control and simu-
late problematic situations. Users were not aware
of the existence of this human wizard and were
led to believe they were interacting with a real
QA system. This controlled setting allowed us
to focus on the interaction aspect rather than in-
formation retrieval or answer extraction aspect of
question answering. More specifically, during in-
teraction after each question was issued, a ran-
dom number generator was used to decide if a
problematic situation should be introduced. If
the number indicated no, the wizard would re-
trieve a passage from a database with correct ques-
tion/answer pairs. Note that in our experiments
we used specific task scenarios (described later),
so it was possible to anticipate user information
needs and create this database. If the number in-
dicated that a problematic situation should be in-
troduced, then the Lemur retrieval engine
1
was
used on the AQUAINT collection to retrieve the
answer. Our assumption is that AQUAINT data
are not likely to provide an exact answer given our
specific scenarios, but they can provide a passage
that is most related to the question. The use of the
random number generator was to control the ratio
between the occurrence of problematic situations
and error-free situations. In our initial investiga-
tion, since we are interested in observing user be-
havior in problematic situations, we set the ratio as
50/50. In our future work, we will vary this ratio
(e.g., 70/30) to reflect the performance of state-of-
the-art factoid QA and investigate the implication
of this ratio in automated performance assessment.
1
http://www-2.cs.cmu.edu/ lemur/
3.2 Experiments
Eleven users participated in our study. Each user
was asked to interact with our system to com-
plete information seeking tasks related to four
specific scenarios: the 2004 presidential debates,
Tom Cruise, Hawaii, and Pompeii. The exper-
imental scenarios were further divided into two
types: structured and unstructured. In the struc-
tured task scenarios (for topics Tom Cruise and
Pompeii), users had to fill in blanks on a dia-
gram pertaining to the given topic. Using the dia-
gram was to avoid the influence of these scenarios
on the language formation of the relevant ques-
tions. Because users must find certain informa-
tion, they were constrained in the range of ques-
tions in which they could ask, but not the way they
ask those questions. The task was completed when
all of the blanks on the diagram were filled. The
structured scenarios were designed to mimic the
real information seeking practice in which users
have real motivation to find specific information
related to their information goals. In the unstruc-
tured scenarios (for topics the 2004 presidential
debates and Hawaii), users were given a general
topic to investigate, but were not required to find
specific information. This gave the user the abil-
ity to ask a much wider range of questions than
the structured scenarios. Users were generally in
an exploration mode when performing these un-
structured tasks. They were not motivated to find
specific information and were content with any in-
formation provided by the system. In our view,
the unstructured scenarios are less representative
of the true information seeking situations.
3.3 Observations and Analysis
From our studies, a total of 44 interaction sessions
with 456 questions were collected. Figure 1 shows
an example of a fragment of interaction related to
Tom Cruise. In this example, both problematic sit-
uations applied to answers (e.g., Problematic and
Error-Free) and user intent (described later) ap-
plied to questions are annotated.
There are several observations from this data.
First, questions formed during interactive QA tend
to be self-contained and free of definite noun
phrases, pronouns, or ellipsis. Only one question
in the entire data set has a pronoun (i.e., What
are the best movies with Tom Cruise in them?).
Even in this case, the pronoun them did not re-
fer to any entities that occurred previously in the
59
# Question/Answer Annotation
Q1 When was Tom Cruise born?
Q2 What year was Tom Cruise born? Re-try
Q3 What does Tom Cruise do for a living? Continue
Q4 What is Tom Cruise's profession? Re-try
Q5 What are the best movies with Tom Cruise 
in them?
Switch
A1 … Based on the memoir of combat veteran 
Ron Kovic, the film stars Tom Cruise as 
Kovic, whose gunshot wound in Vietnam 
left him paralyzed from the chest down.
….a powerfully intimate portrait that 
unfolds on an epic scale, Born on the 
Fourth of July is arguably Stone's best film 
(if you can forgive its often strident tone), 
…..
Proble-
matic
A2 …Thomas Cruise Mapother IV was born
on the 3rd of July, 1962 (eerily similar to 
his film Born on the 4th of July), in 
Syracuse, New York. He was the only boy
of four children….
Error-Free
A3 …you get a very nice role that shows you
differently, not the heavy, and you're
working with a George Clooney or a Tom
Cruise or a Nicolas Cage or a Martin
Scorsese. I can live with that …
Proble-
matic
A4 …So we may agree that Cruise’s
professional standing accounts for some 
measure of his fame….
Proble-
matic
A5 …Cruise proved his dramatic talents in the 
1988 drama Rain Man, where he co-
starred with Oscar-winner Dustin Hoffman. 
Oliver Stone's Born on the Fourth of July
(1989) earned him a Best Actor Oscar 
nomination for his hard-hitting portrayal of 
anti-war activist Ron Kovic….
Error-Free
Figure 1: An example fragment of interaction
QA process. This phenomenon could be caused by
how the answers are presented. Unlike specific an-
swer entities, the answer passages provided by our
system do not support the natural use of referring
expressions in the follow-up questions. Another
possible explanation could be that in an interac-
tive environment, users seem to be more aware of
the potential limitation of a computer system and
thus tend to specify self-contained questions in a
hope to reduce the system’s inference load.
The second observation is about user behavior
in response to different system performances (i.e.,
problematic or error-free situations). We were
hoping to see different strategies users might ap-
ply to deal with the problematic situations. How-
ever, based on the data, we found that when a prob-
lem occurred, users either rephrased their ques-
tions (i.e., the same question expressed in a dif-
ferent way) or gave up the question and went on
specifying a new question. (Here we use Rephrase
and New to denote these two kinds of behaviors.)
We have not observed any sub-dialogs initiated by
Problematic Error-free Total
New Switch Continue
unstruct. 29 90 119
struct. 29 133 162
entire 58 223 281
Rephrase Re-try Negotiate
unstruct. 19 4 23
struct. 102 6 108
entire 121 10 131
Total-unst 48 94 142
Total-st 131 139 270
Total-ent 179 233 412
Table 1: Categorization of user intent with the cor-
responding number of occurrences from the un-
structured scenarios, the structured scenarios, and
the entire dataset.
the user to clarify a previous question or answer.
One possible explanation is that the current inves-
tigation was conducted in a basic interactive mode
where the system was only capable of providing
some sort of answers. This may limit users’ expec-
tation in the kind of questions that can be handled
by the system. Our assumption is that, once the
QA system becomes more intelligent and able to
carry on conversation, different types of questions
(i.e., other than rephrase or new) will be observed.
This hypothesis certainly needs to be validated in
a conversational setting.
The third observation is that the rephrased ques-
tions seem to strongly correlate with problematic
situations, although not always. New questions
cannot distinguish a problematic situation from
an error-free situation. Table 1 shows the statis-
tics from our data about different combinations
of new/rephrase questions and performance situ-
ations
2
. What is interesting is that these different
combinations can reflect different types of user in-
tent behind the questions. More specifically, given
a question, four types of user intent can be cap-
tured with respect to the context (e.g., the previous
question and answer)
Continue indicates that the user is satisfied with
the previous answer and now moves on to this
new question.
Switch indicates that the user has given up on the
previous question and now moves on to this
2
The last question from each interaction session is not in-
cluded in these statistics because there is no follow-up ques-
tion after that.
60
new question.
Re-try indicates that the user is not satisfied with
the previous answer and now tries to get a
better answer.
Negotiate indicates that the user is not satisfied
with the previous answer (although it ap-
pears to be correct from the system’s point
of view) and now tries to get a better answer
for his/her own needs.
Table 1 summarizes these different types of
intent together with the number of correspond-
ing occurrences from both structured and unstruc-
tured scenarios. Since in the unstructured sce-
narios it was hard to anticipate user’s questions
and therefore take a correct action to respond to a
problematic/error-free situation, the distribution of
these two situations is much more skewed than the
distribution for the structured scenarios. Also as
mentioned earlier, in unstructured scenarios, users
lacked the motivation to pursue specific informa-
tion, so the ratio between switch and re-try is much
larger than that observed in the structured scenar-
ios. Nevertheless, we did observe different user
behavior in response to different situations. As
discussed later in Section 5, identifying these fine-
grained intents will allow QA systems to be more
proactive in helping users find satisfying answers.
4 Automatic Identification of
Problematic Situations and User Intent
Given the discussion above, the next question is
how to automatically identify problematic situa-
tions and user intent. We formulate this as a classi-
fication problem. Given a question Q
i
, its answer
A
i
, and the follow-up question Q
i+1
:
(1) Automatic identification of problematic situa-
tions is to decide whether A
i
is problematic (i.e.,
correct or incorrect) based on the follow-up ques-
tion Q
i+1
and the interaction context. This is a
binary classification problem.
(2) Automatic identification of user intent is to
identify the intent of Q
i+1
given the interaction
context. Because we only have very limited in-
stances of Negotiate (see Table 1), we currently
merge Negotiate with Re-try since both of them
represent a situation where a better answer is re-
quested. Thus, this problem becomes a trinary
classification problem.
To build these classifiers, we identified a set of
features, which are illustrated next.
4.1 Features
Given a question Q
i
, its answer A
i
, and the follow-
up question Q
i+1
, the following set of features are
used:
Target matching(TM): a binary feature indicat-
ing whether the target type of Q
i+1
is the same as
the target type of Q
i
. Our data shows that the rep-
etition of the target type may indicate a rephrase,
which could signal a problematic situation has just
happened.
Named entity matching (NEM): a binary feature
indicating whether all the named entities in Q
i+1
also appear in Q
i
. If no new named entity is in-
troduced in Q
i+1
, it is likely Q
i+1
is a rephrase of
Q
i
.
Similarity between questions (SQ): a numeric
feature measuring the similarity between Q
i+1
and
Q
i
. Our assumption is that the higher the simi-
larity is, the more likely the current question is a
rephrase to the previous one.
Similarity between content words of questions
(SQC): this feature is similar to the previous fea-
ture (i.e., SQ) except that the similarity measure-
ment is based on the content words excluding
named entities. This is to prevent the similarity
measurement from being dominated by the named
entities.
Similarity between Q
i
and A
i
(SA): this feature
measures how close the retrieved passage matches
the question. Our assumption is that although a re-
trieved passage is the most relevant passage com-
pared to others, it still may not contain the answer
(e.g., when an answer does not even exist in the
data collection).
Similarity between Q
i
and A
i
based on the con-
tent words (SAC): this feature is essentially the
same as the previous feature (SA) except that the
similarity is calculated after named entities are re-
moved from the questions and answers.
Note that since our data is currently collected
from simulation studies, we do not have the confi-
dence score from the retrieval engine associated
with every answer. In practice, the confidence
score can be used as an additional feature.
Since our focus is not on the similarity measure-
ment but rather the use of the measurement in the
classification models, our current similarity mea-
surement is based on a simple approach that mea-
sures commonality and difference between two
objects as proposed by Lin (1998). More specifi-
cally, the following equation is applied to measure
61
the similarity between two chunks of text T
1
and
T
2
:
sim
1
(T
1
,T
2
)=
−log P(T
1
∩ T
2
)
−log P(T
1
∪ T
2
)
Assume the occurrence of each word is indepen-
dent, then:
sim
1
(T
1
,T
2
)=
−
summationtext
w∈T
1
∩T
2
log P(w)
−
summationtext
w∈T
1
∪T
2
log P(w)
where P(w) was calculated based on the data used
in the previous TREC evaluations.
4.2 Identification of Problematic Situations
To identify problematic situations, we experi-
mented with three different classifiers: Maxi-
mum Entropy Model (MEM) from MALLET
3
,
SVM from SVM-Light
4
, and Decision Trees from
WEKA
5
. A leave-one-out validation was applied
where one interaction session was used for testing
and the remaining interaction sessions were used
for training.
Table 2 shows the performance of the three
models based on different combinations of fea-
tures in terms of classification accuracy. The base-
line result is the performance achieved by sim-
ply assigning the most frequently occurred class.
For the unstructured scenarios, the performance
of the classifiers is rather poor, which indicates
that it is quite difficult to make any generaliza-
tion based on the current feature sets when users
are less motivated in finding specific information.
For the structured scenarios, the best performance
for each model is highlighted in bold in Table 2.
The Decision Tree model achieves the best per-
formance of 77.8% in identifying problematic sit-
uations, which is more than 25% better than the
baseline performance.
4.3 Identification of User Intent
To identify user intent, we formulate the problem
as follows: given an observation feature vector f
where each element of the vector corresponds to
a feature described earlier, the goal is to identify
an intent c
∗
from a set of intents I ={Continue,
Switch, Re-try/Negotiate} that satisfies the follow-
ing equation:
c
∗
=argmax
c∈I
P(c|f)
3
http://mallet.cs.umass.edu/index.php/
4
http://svmlight.joachims.org/
5
http://www.cs.waikato.ac.nz/ml/weka/
Our assumption is that user intent for a ques-
tion can be potentially influenced by the intent
from a preceding question. For example, Switch
is likely to follow Re-try. Therefore, we have im-
plemented a Maximum Entropy Markov Model
(MEMM) (McCallum et al., 2000) to take the se-
quence of interactions into account.
Given a sequence of questions Q
1
, Q
2
,uptoQ
t
,
there is an observation feature vector f
i
associated
with each Q
i
. In MEMM, the prediction of user
intent c
t
for Q
t
not only depends on the observa-
tion f
t
, but also the intent c
t−1
from the preceding
question Q
t−1
. In fact, this approach finds the best
sequence of user intent C
∗
for Q
1
up to Q
t
based
on a sequence of observations f
1
, f
2
,..., f
t
as fol-
lows:
C
∗
=argmax
C∈I
tP(C|f
1
, f
2
,..., f
t
)
where C is a sequence of intent and I
t
is the set of
all possible sequences of intent with length t.
To find this sequence of intent C
∗
, MEMM
keeps a variable α
t
(i) which is defined to be the
maximum probability of seeing a particular se-
quence of intent ending at intent i (i ∈ I) for
question Q
t
, given the observation sequence for
questions Q
1
up to Q
t
:
α
t
(i)= max
c
1
,...,c
t−1
P(c
1
,...,c
t−1
,c
t
= i|f
1
,...,f
t
)
This variable can be calculated by a dynamic
optimization procedure similar to the Viterbi algo-
rithm in the Hidden Markov Model:
α
t
(i)=max
j
α
t−1
(j) × P(c
t
= i|c
t−1
= j, f
t
)
where P(c
t
= i|c
t−1
= j, f
t
) is estimated by the
Maximum Entropy Model.
Table 3 shows the best results of identifying
user intent based on the Maximum Entropy Model
and MEMM using the leave-one-out approach.
The results have shown that both models did not
work for the data collected from unstructured sce-
narios (i.e., the baseline accuracy for intent iden-
tification is 63.4%). For structured scenarios, in
terms of the overall accuracy, both models per-
formed significantly better than the baseline (i.e.,
49.3%). The MEMM worked only slightly better
than the MEM. Given our limited data, it is not
conclusive whether the transitions between ques-
tions will help identify user intent in a basic inter-
active mode. However, we expect to see more in-
fluence from the transitions in fully conversational
QA.
62
MEM SVM DTree
Features un s ent un s ent un s ent
Baseline 66.2 51.5 56.3 66.2 51.5 56.3 66.2 51.5 56.3
TM, SQC 50.0 57.4 54.9 53.5 60.0 57.8 53.5 55.9 55.1
NEM, SQC 37.3 74.4 61.7 37.3 74.4 61.7 37.3 74.4 61.7
TM, SQ 61.3 64.8 63.6 57.0 64.1 61.7 59.9 64.4 62.9
NEM, SQC, SAC 40.8 76.7 64.3 38.0 74.4 61.9 49.3 77.8 68.0
TM, SQ, SAC 59.2 67.4 64.6 61.3 66.3 64.6 62.7 65.6 64.6
TM, NEM, SQC 54.2 75.2 68.0 54.2 75.2 68.0 53.5 74.4 67.2
TM, SQ, SA 63.4 71.9 68.9 58.5 71.5 67.0 67.6 75.6 72.8
TM, NEM, SQC, SAC 54.9 75.6 68.4 54.2 75.2 68.0 55.6 74.4 68.0
* un - unstructured, s - structured, ent - entire
Table 2: Performance of automatic identification of problematic situations
MEM MEMM
un s un s
CONTINUE P 64.4 69.7 67.3 70.8
R 96.7 85.8 80.0 88.8
F 77.3 76.8 73.1 78.7
RE-TRY P 28.6 76.2 37.1 79.0
/NEGOTIATE R 8.7 74.1 56.5 73.1
F 13.3 75.1 44.8 75.9
SWITCH P --- 50.0
R 000 3.6
F --- 6.7
Overall accuracy 62.7 72.2 59.9 73.7
* un - unstructured, s - structured
Table 3: Performance of automatic identification
of user intent
5 Implications of Problematic Situations
and User Intent
Automated identification of problematic situations
and user intent have potential implications in the
design of conversational QA systems. Identifica-
tion of problematic situations can be considered as
implicit feedback. The system can use this feed-
back to improve its answer retrieval performance
and proactively adapt its strategy to cope with
problematic situations. One might think that an
alternative way is to explicitly ask users for feed-
back. However, this explicit approach will defeat
the purpose of intelligent conversational systems.
Soliciting feedback after each question not only
will frustrate users and lengthen the interaction,
but also will interrupt the flow of user thoughts and
conversation. Therefore, our focus here is to inves-
tigate the more challenging end of implicit feed-
back. In practice, the explicit feedback and im-
plicit feedback should be intelligently combined.
For example, if the confidence for automatically
identifying a problematic situation or an error-free
situation is low, then perhaps explicit feedback can
be solicited.
Automatic identification of user intent also has
important implications in building intelligent con-
versational QA systems. For example, if Con-
tinue is identified during interaction, then the sys-
tem can automatically collect the question answer
pairs for potential future use. If Switch is identi-
fied, the system may put aside the question that has
not been correctly answered and proactively come
back to that question later after more information
is gathered. If Re-try is identified, the system may
avoid repeating the same answer and at the same
time may take the initiative to guide users on how
to rephrase a question. If Negotiate is identified,
the system may want to investigate the user’s par-
ticular needs that may be different from the gen-
eral needs. Overall, different strategies can be de-
veloped to address problematic situations and dif-
ferent intents. We will investigate these strategies
in our future work.
This paper reports our initial effort in investi-
gating interactive QA from a conversational point
of view. The current investigation has several
simplifications. First, our current work has fo-
cused on factoid questions where it is relatively
easy to judge a problematic or error-free situation.
However, as discussed in earlier work (Small et
al., 2003), sometimes it is very hard to judge the
truthfulness of an answer, especially for analyti-
cal questions. Therefore, our future work will ex-
amine the new implications of problematic situa-
tions and user intent for analytical questions. Sec-
63
ond, our current investigation is based on a ba-
sic interactive mode. As mentioned earlier, once
the QA systems become more intelligent and con-
versational, more varieties of user intent are an-
ticipated. How to characterize and automatically
identify more complex user intent under these dif-
ferent situations is another direction of our future
work.
6 Conclusion
This paper presents our initial investigation on
automatic identification of problematic situations
and user intent in interactive QA. Our results have
shown that, once users are motivated in finding
specific information related to their information
goals, user behavior and interaction context can
help automatically identify problematic situations
and user intent. Although our current investigation
is based on the data collected from a controlled
study, the same approaches can be applied dur-
ing online processing as the question answering
proceeds. The identified problematic situations
and/or user intent will provide immediate feed-
back for a QA system to adjust its behavior and
adapt better strategies to cope with different situa-
tions. This is an important step toward intelligent
conversational question answering.
References
Marco De Boni and Suresh Manandhar. 2003. An
analysis of clarification dialogues for question an-
swering. In Proceedings of HLT-NAACL 2003,
pages 48–55.
John Burger, Claire Cardie, Vinay Chaudhri, Robert
Gaizauskas, Sanda Harabagiu, David Israel, Chris-
tian Jacquemin, Chin-Yew Lin, Steve Maiorano,
George Miller, Dan Moldovan, Bill Ogden, John
Prager, Ellen Riloff, Amit Singhal, Rohini Shrihari,
Tomek Strzalkowski, Ellen Voorhees, and Ralph
Weishedel. 2001. Issues, tasks and program struc-
tures to roadmap research in question & answering.
In NIST Roadmap Document.
Sanda Harabagiu, Andrew Hickl, John Lehmann, and
Dan Moldovan. 2005. Experiments with interactive
question-answering. In Proceedings of the 43rd An-
nual Meeting of the Association for Computational
Linguistics (ACL’05), pages 205–214, Ann Arbor,
Michigan, June. Association for Computational Lin-
guistics.
Julia Hirschberg, Diane J. Litman, and Marc Swerts.
2001. Identifying user corrections automatically
in spoken dialogue systems. In Proceedings of
the Second Meeting of the North American Chap-
ter of the Association for Computational Linguistics
(NAACL’01).
Gina-Anne Levow. 1998. Characterizeing and recog-
nizing spoken corrections in human-computer dia-
logue. In Proceedings of the 36th Annual Meet-
ing of the Association of Computational Linguistics
(COLING/ACL-98), pages 736–742.
Dekang Lin. 1998. An information-theoretic defini-
tion of similarity. In Proceedings of International
Conference on Machine Learning, Madison, Wis-
consin, July.
Diane J. Litman and Shimei Pan. 2000. Predicting
and adapting to poor speech recognition in a spo-
ken dialogue system. In Proceedings of the Seven-
teenth National Conference on Artificial Intelligence
(AAAI-2000), pages 722–728.
Diane J. Litman, Marilyn A. Walker, and Michael S.
Kearns. 1999. Automatic detection of poor speech
recognition at the dialogue level. In Proceedings of
the 37th Annual meeting of the Association of Com-
putational Linguistics (ACL-99), pages 309–316.
Andrew McCallum, Dayne Freitag, and Fernando
Pereira. 2000. Maximum entropy markov mod-
els for information extraction and segmentation. In
Proceedings of Internatioanl Conference on Ma-
chine Learning (ICML 2000), pages 591–598.
Sharon Small, Ting Liu, Nobuyuki Shimizu, and
Tomek Strzalkowski. 2003. HITIQA: An interac-
tive question answering system: A preliminary re-
port. In Proceedings of the ACL 2003 Workshop on
Multilingual Summarization and Question Answer-
ing.
Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth
Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Tay-
lor, Rachel Martin, Marie Meteer, and Carol Van
Ess-Dykema. 2000. Dialogue act modeling for au-
tomatic tagging and recognition of conversational
speech. In Computational Linguistics, volume 26.
Ellen Voorhees. 2001. Overview of TREC 2001 ques-
tion answering track. In Proceedings of TREC.
Ellen Voorhees. 2004. Overview of TREC 2004. In
Proceedings of TREC.
Marilyn Walker, Irene Langkilde-Geary, Helen Wright
Hastie, Jerry Wright, and Allen Gorin. 2002. Auto-
matically training a problematic dialogue predictor
for the HMIHY spoken dialog system. In Journal of
Artificial Intelligence Research.
64
