Monolingual Web-based Factoid Question Answering in Chinese,
Swedish, English and Japanese
E.W.D. Whittaker J. Hamonic D. Yang T. Klingberg S. Furui
Dept. of Computer Science
Tokyo Institute of Technology
2-12-1, Ookayama, Meguro-ku
Tokyo 152-8552 Japan
CUedw,yuuki,raymond,tor,furuiCV@furui.cs.titech.ac.jp
Abstract
In this paper we extend the application
of our statistical pattern classification ap-
proach to question answering (QA) which
has previously been applied successfully
to English and Japanese to develop two
prototype QA systems in Chinese and
Swedish. We show what data is necessary
to achieve this and also evaluate the per-
formance of the two new systems using a
translation of the TREC 2003 factoid QA
task. While performance for Chinese and
Swedish is found to be lower than that for
the more developed English and Japanese
systems we explain why this is the case
and offer solutions for their improvement.
All systems form the basis of our pub-
licly accessible web-based multilingual
QA system at http://asked.jp.
1 Introduction
Much of the research into automatic question an-
swering (QA) has understandably concentrated on
the English language with little regard to portabil-
ity or efficacy in other languages. It is only rela-
tively recently, with the introduction of the CLEF
and NTCIR QA evaluations, that researchers have
started to look at porting and evaluating the tech-
niques that have been shown to work well for En-
glish to other languages.
One of the major drawbacks of porting an En-
glish language QA system or approach to other
languages is often the lack of the corresponding
NLP tools in the target language. For instance,
parsers and named-entity (NE)-taggers, which are
typical components in many QA systems, are cer-
tainly not available for all the world’s languages.
Trainable parsers and NE-taggers similarly require
appropriate training data which, if not available,
is costly and requires specialized knowledge to
produce. Language-specific databases are also a
common feature of many systems, some of which
have taken many man-years to construct and ver-
ify. Such a component in many English-language
systems, for example, is WordNet. While porting
WordNet to other languages has been started in the
Euro and Global WordNet projects they still only
cover a relatively small number of languages.
In this paper we describe the application of our
data-driven approach to QA which was developed
right from the outset with the aim of portability
and robustness in mind. This statistical pattern
classification approach to QA is essentially lan-
guage independent and trainable given appropriate
language-specific training data. No assumptions
about the language are made by the model except
that some notion of words or space-separated to-
kens must exist or be introduced where it is ab-
sent. Our only other requirements to build a QA
system in a new target language are: a large col-
lection of text data in the target language that can
be searched for answers (e.g. the web), and a list
of example questions-and-answers (q-and-a) in the
target language. Given these data sources the re-
maining components can be obtained automati-
cally for each language.
So, in contrast to other contemporary ap-
proaches to QA our English language system
does not use WordNet as in (Hovy et al., 2001;
Moldovan et al., 2002), NE extraction, or any
other linguistic information e.g. from semantic
analysis (Hovy et al., 2001) or from question pars-
ing (Hovy et al., 2001; Moldovan et al., 2002) and
uses capitalised (where appropriate for the lan-
guage) word tokens as the only features for mod-
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
45
elling. For our Japanese system, although we cur-
rently use Chasen to segment Japanese charac-
ter sequences into units that resemble words, we
make no use of any morphological information
as used for example in (Fuchigami et al., 2004).
Moreover, it should be noted that our approach
is at the same time very different to other purely
web-based approaches such as askMSR (Brill et
al., 2002) and Aranea (Lin et al., 2002). For exam-
ple, we use entire documents rather than the snip-
pets of text returned by web search engines; we do
not use structured document sources or databases
and we do not transform the query in any way ei-
ther by term re-ordering or by modifying the tense
of verbs. These basic principles apply to each of
our language-specific QA systems thus simplify-
ing and accelerating development.
The approach to QA that we adopt has previ-
ously been described in (Whittaker et al., 2005a;
Whittaker et al., 2005b; Whittaker et al., 2005c)
where the details of the mathematical model and
how it was trained for English and Japanese were
given. Our approach has also been successfully
evaluated in the text retrieval conference (TREC)
2005 QA track evaluation (Voorhees, 2003) where
our group placed eleventh out of thirty partici-
pants (Whittaker et al., 2005a). Although the
TREC QA task is substantially different to web-
based QA the TREC evaluation confirmed that our
approach works and also provides an objective as-
sessment of its quality. Similarly, for our Japanese
language system we have previously evaluated the
performance of our approach on the NTCIR-3
QAC-1 task (Whittaker et al., 2005c). Although
our Japanese experiments were applied retrospec-
tively, the results would have placed us in the mid-
range of participating systems in that year’s eval-
uation. In this paper we present additional ex-
periments on Chinese and Swedish and explain
how our statistical pattern classification approach
to QA was successfully applied to these two new
languages. Using our approach and given ap-
propriate training data it is found that a reason-
ably proficient developer can build a QA system
in a new language in around 10 hours. Evalua-
tion of the Chinese and Swedish systems is per-
formed using a translation of the first 200 fac-
toid questions from the TREC 2003 evaluation
which we have also made available online. We
compare these results both qualitatively and quan-
titatively against results obtained previously for
English and Japanese. The systems, built us-
ing this method, form the basis of our multi-
language web demo which is publicly available at
http://asked.jp.
An outline of the remainder of this paper is as
follows: we briefly describe our statistical pattern
classification approach to QA in Section 2 repeat-
ing the important elements of our approach as nec-
essary to understand the remainder of the paper.
In Section 3 we describe the basic building blocks
of our QA system and how they can typically be
trained. We also give a breakdown of the data
used to train each language specific QA system.
In Section 4 we present the results of experiments
on Chinese, Swedish, English and Japanese and in
Section 5 we compare and analyse these results.
We wrap up with a conclusion and further work in
Sections 6 and 7.
2 Statistical pattern classification
approach to QA
The answer to a question depends primarily on
the question itself but also on many other factors
such as the person asking the question, the loca-
tion of the person, what questions the person has
asked before, and so on. Although such factors
are clearly relevant in a real-world scenario they
are difficult to model and also to test in an off-
line mode, for example, in the context of the NT-
CIR and TREC evaluations. We therefore choose
to consider only the dependence of an answer BT
on the question C9, where each is considered to
be a string of D0
BT
words BT BP CP
BD
BNBMBMBMBNCP
D0
BT
and D0
C9
words C9 BP D5
BD
BNBMBMBMBND5
D0
C9
, respectively. In particu-
lar, we hypothesize that the answer BT depends on
two sets of features CF BP CFB4C9B5 and CG BP CGB4C9B5
as follows:
C8B4BT CY C9B5BPC8B4BT CY CFBNCGB5BN (1)
where CF BP DB
BD
BNBMBMBMBNDB
D0
CF
can be thought of as a
set of D0
CF
features describing the “question-type”
part of C9 such as who, when, where, which, etc.
and CG BP DC
BD
BNBMBMBMBNDC
D0
CG
is a set of D0
CG
features
comprising the “information-bearing” part of C9
i.e. what the question is actually about and what it
refers to. For example, in the questions, “Where
was Tom Cruise married?” and “When was
Tom Cruise married?” the information-bearing
component is identical in both cases whereas the
question-type component is different.
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
46
Finding the best answer
CM
BT involves a search
over all BT for the one which maximizes the prob-
ability of the above model:
CM
BT BP CPD6CVD1CPDC
BT
C8B4BT CY CFBNCGB5BM (2)
Using Bayes rule, making further conditional
independence assumptions and assuming uniform
prior probabilities, which therefore do not affect
the optimisation criterion, we obtain the final op-
timisation criterion (see (Whittaker et al., 2005a)
for more details):
CPD6CVD1CPDC
BT
C8B4BT CY CGB5
DG DFDE DH
D6CTD8D6CXCTDACPD0
D1D3CSCTD0
A1C8B4CF CY BTB5
DG DFDE DH
CUCXD0D8CTD6
D1D3CSCTD0
BM (3)
The C8B4BT CY CGB5 model is essentially a language
model which models the probability of an answer
sequence BT given a set of information-bearing fea-
tures CG. It models the proximity of BT to features
in CG. We call this model the retrieval model and
examine it further in Section 2.1.
The C8B4CF CY BTB5 model matches an answer BT
with features in the question-type set CF. Roughly
speaking this model relates ways of asking a ques-
tion with classes of valid answers. For example, it
associates dates, or days of the week with when-
type questions. In general, there are many valid
and equiprobable BT for a given CF so this compo-
nent can only re-rank candidate answers retrieved
by the retrieval model. Consequently, we call it the
filter model and examine it further in Section 2.2.
2.1 Retrieval model
The retrieval model essentially models the prox-
imity of BT to features in CG. Since BT BP
CP
BD
BNBMBMBMBNCP
D0
BT
we are actually modelling the distri-
bution of multi-word sequences. This should be
borne in mind in the following discussion when-
ever BT is used. As mentioned above, we currently
use a deterministic information-feature mapping
function CG BP CGB4C9B5. This mapping only gener-
ates word D1-tuples (D1 BP BDBNBEBNBMBMBM) from single
words in C9 that are not present in a stoplist of 50-
100 high-frequency words. For more details on
the exact form of the retrieval model please refer
to (Whittaker et al., 2005a).
2.2 Filter model
A set of CYCE
CF
CY single-word features is extracted
based on frequency of occurrence in question data.
Some examples include: HOW, MANY, WHEN,
WHO, UNTIL etc. The question-type mapping
function CFB4C9B5 extracts D2-tuples (D2 BPBDBNBEBNBMBMBM)of
question-type features from the question C9, such
as HOW MANY and UNTIL WHEN.
Modelling the complex relationship between CF
and BT directly is non-trivial. We therefore intro-
duce an intermediate variable representing classes
of example questions-and-answers (q-and-a) CR
CT
for
CT BP BDBMBMBMCYBV
BX
CY drawn from the set BV
BX
, and to fa-
cilitate modelling we say that CF is conditionally
independent of BT given CR
CT
as follows:
C8B4CF CY BTB5BP
CYBV
BX
CY
CG
CTBPBD
C8B4CF CY CR
CT
B5A1C8B4CR
CT
CY BTB5BM (4)
Given a set BX of example q-and-a D8
CY
for
CY BP BDBMBMBMCYBXCY where D8
CY
BP B4D5
CY
BNCP
CY
B5 BP
B4D5
CY
BD
BNBMBMBMBND5
CY
D0
C9
CY
BNCP
CY
BD
BNBMBMBMBNCP
CY
D0
BT
CY
B5 we define a mapping
function CU BM BX AX BV
BX
by CUB4D8
CY
B5 BP CT. Each
class CR
CT
BP B4DB
CT
BD
BNBMBMBMBNDB
CT
D0
CF
CT
BN CP
CT
BD
BNBMBMBMBNCP
CT
D0
BT
CT
B5 is then
obtained by CR
CT
BP
CB
CYBMCUB4D8
CY
B5BPCT
CFB4D5
CY
B5
D0
BT
CY
CB
CXBPBD
CP
CY
CX
. In all the
experiments in this paper no clustering of the q-
and-a is actually performed so each q-and-a ex-
ample forms its own unique class i.e. each CR
CT
cor-
responds to a single D8
CY
and vice-versa.
Assuming conditional independence of the an-
swer words in class CR
CT
given BT and making the
modelling assumption that the CYth answer word CP
CT
CY
in the example class CR
CT
is dependent only on the
CYth answer word in BT we obtain:
C8B4CF CY BTB5BP
CYBV
BX
CY
CG
CTBPBD
C8B4CF CY CR
CT
B5A1
D0
BT
CT
CH
CYBPBD
C8B4CP
CT
CY
CY CP
CY
B5
BP
CYBV
BX
CY
CG
CTBPBD
C8B4CF CY CR
CT
B5
D0
BT
CT
CH
CYBPBD
CYBV
BT
CY
CG
CPBPBD
C8B4CP
CT
CY
CY CR
CP
B5C8B4CR
CP
CY CP
CY
B5BN
(5)
where CR
CP
is a concrete class in the set of CYBV
BT
CY an-
swer classes BV
BT
, and assuming CP
CT
CY
is conditionally
independent of CR
CP
given CP
CY
. The system using the
formulation of filter model given by Equation (5)
is referred to as model ONE. The model given by
Equation (4) is referred to as model TWO, how-
ever, we are only concerned with model ONE in
this paper.
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
47
Answer typing, such as it exists in our model
ONE system, is performed by the filter model and
is effected by matching the input query against
our set of example questions BV
BX
. Each one of
these example questions CR
CT
has an answer asso-
ciated with it which in turn is expanded via the
BV
BT
classes into a set of possible answers for each
question. At each step this matching process is
entirely probabilistic as given by Equation (5). To
take a simple example, if we are presented with
the input query “Who was the U.S. President who
first used Camp David?” the first step effectively
matches the words (and bigrams, trigrams) in that
query against the words (and bigrams, trigrams)
in our example questions with a probability of its
match assigned to all example questions. Suppose,
for example, that the top-scoring example question
in our set is “Who was the U.S. President who re-
signed as a result of the Watergate affair?” since
the first six words in each question match each
other and will likely result in a high probability
being assigned. The corresponding answer to this
example question is “Richard Nixon”. So the next
step expands “Richard” using BV
BT
and results in
a high score being assigned to a class of words
which tend to share the feature of being male first
names, one of which is “Franklin”. Expanding
the second word in the answer “Nixon” using BV
BT
will possibly result in a high score being assigned
to a class of words that share the feature of being
the surnames of U.S. presidents, one of which is
“Roosevelt”. In this way, we end up with a high
score assigned by the filter model to “Franklin
Roosevelt” but also to “Abraham Lincoln”, “Bill
Clinton”, etc. In combination with the retrieval
model, we hope that the documents obtained for
this query assign a higher retrieval model score
to “Franklin Roosevelt” over the names of other
U.S. presidents and thus output it in first place
with the highest overall probability. While this ap-
proach works well for names, dates and short place
names it does fall down, for example, on names
of books, plays and films where there is typically
less of a clear correspondence between the words
in any given position of two answers. This situa-
tion could be avoided by using multi-word answer
strings and not making the position-dependence
modelling assumption that was made to arrive at
Equation (5) but this has its own drawbacks.
The above description of the operation of the
filter model highlights the need for homogeneous
classes of BV
BT
of sufficiently wide coverage. In the
next section we describe a way in which this can
be achieved in an efficient, data-driven manner for
essentially any language.
2.3 Obtaining BV
BT
As we saw in the previous section the BV
BT
are the
closest thing we have to named entities since they
define classes of words that share some similarity
with each other. However, in some sense they are
more flexible than named entities since any word
can actually belong to any class but with a certain
probability. In this way we don’t rule out with a
zero probability the possibility of a word belong-
ing to some class, just that a word is more likely
to belong to some classes than others. In addition,
the entities are not actually named i.e. we do not
impose our own label on the classes so we do not
explicitly have a class of first names, or a class of
days of the week. Although we hope that we will
end up with classes containing such similar words
we do not make it explicit and we do not label what
each class of words is supposed to represent.
In keeping with our data-driven philosophy
and related objective to make our approach as
language-independent as possible we use an ag-
glomerative clustering algorithm to derive classes
automatically from data. The set of potential an-
swer words CE
BV
BT
that are clustered, should ideally
cover all possible words that might ever be an-
swers to questions. We therefore take the most fre-
quent CYCE
BV
BT
CY words from a language-specific cor-
pus CC comprising CYCCCY word tokens as our set of
potential answers. The “seeds” for the clusters are
chosen to be the most frequent CYBV
BT
CY words in CC.
The algorithm then uses the co-occurrence proba-
bilities of words in CC to group together words with
similar co-occurrence statistics. For each word CP
in CE
BV
BT
the co-occurrence probability C8B4CP
CX
CY CPBNCSB5
is the probability of CP
CX
given CP occurring CS words
away. If CS is positive, CP
CX
occurs after CP, and if
negative, before CP. We then construct a vector
of co-occurrences with maximum separation be-
tween words BW, as follows:
DICP BPCJC8B4CP
BD
CY CPBNBDB5BNBMBMBMBNC8B4CP
CA
CY CPBNBDB5BNC8B4CP
BD
CY CPBNA0BDB5BNBMBMBMBN
C8B4CP
CA
CY CPBNA0BDB5BNBMBMBMBMBMBMBNC8B4CP
BD
CY CPBNBWB5BNBMBMBMBNC8B4CP
CA
CY CPBNBWB5BN
C8B4CP
BD
CY CPBNA0BWB5BNBMBMBMBNC8B4CP
CA
CY CPBNA0BWB5CLBM (6)
Rather than storing BEBWCE
BE
BV
BT
elements we can com-
pute most terms efficiently and on-the-fly using
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
48
Language CC CYCCCY CYCE
BV
BT
CY CYBV
BX
CY CYBV
BT
CY
Chinese TREC Mandarin (Rogers, 2000) 68M 33k 7k 1000
Swedish PAROLE (University, 1997) 19M 367k 5k 1000
English AQUAINT (Voorhees, 2002) 300M 215k 290k 500
Japanese MAINICHI (Fukumoto et al., 2002) 150M 300k 270k 5000
Table 1: Description of each of the four monolingual QA systems.
the Katz back-off method (Katz, 1987) and ab-
solute discounting for estimating the probabilities
of unobserved events. To find the distance be-
tween two vectors, for efficiency, we use an C4
BD
distance metric: BWB4DICP
CX
BNDICP
CY
B5 BP CYDICP
CX
A0 DICP
CY
CY. Merging
two vectors then involves a straightforward update
of the co-occurrence counts for a cluster and re-
computing the affected conditional probabilities
and back-off weights. The clustering algorithm is
described below:
Algorithm 2.1 Cluster words to generate BV
BT
initialize most frequent words to CYBV
BT
CY
classes
for i:= 1 to CYCE
BV
BT
CY
for j:= 1 to CYBV
BT
CY
compute BWB4DICP
CX
BNDICR
CY
B5
move DICP
CX
to
DI
CR
CQCTD7D8
CY
update
DI
CR
CQCTD7D8
CY
Although the algorithm is currently applied off-
line before system deployment it could also be eas-
ily adapted to handle multi-word sequences and
could also be applied at run-time thus reducing the
effects of out-of-vocabulary answers.
3 System components
There are four basic ingredients to building a QA
system using our approach: (1) a collection of ex-
ample q-and-a (BV
BX
) pairs used for answer-typing
(answers need not necessarily be correct but must
be of the correct answer type); (2) a classification
of words (or word-like units cf. Japanese/Chinese)
into classes of similar words (BV
BT
) e.g. a class of
country names, of given names, of numbers etc.;
(3) a list of question words (qlist) such as “WHO”,
“WHERE”, “WHEN” etc.; and (4) a stop list
of words that should be ignored by the retrieval
model (stoplist) as described in Section 2.1.
The q-and-a (BV
BX
) for different languages can
often be found on the web or in commercial quiz
software. To date, we have not examined how
many or what distribution of types of example q-
and-a are important for good performance. How-
ever, intuitively, one expects that the richer and
more varied the questions the better the perfor-
mance will be. For the time being we aim to in-
clude as many examples as possible and hope this
covers the kinds of questions that are asked in re-
ality (or at least in the evaluations). In principle, in
a working system it should also be possible to feed
a user’s question together with the user’s response
back into the system to improve performance in an
unsupervised manner, so that performance would
gradually improve over time.
To obtain the classes of answers BV
BT
for each
language Algorithm 2.1 is applied. The qlist is
generated by taking the most frequently occurring
terms from the question portion of the example
q-and-a. The stoplist is formed from the 50-100
most frequently occurring words in CC.
Google is used to retrieve documents for all
questions, except Japanese, where an index of the
NTCIR 2001 web snapshot is used instead. The
question is passed as-is to Google after the re-
moval of stop words. The top CD documents are
downloaded in their entirety, HTML markup re-
moved, the text cleaned and upper-cased (where
appropriate). For consistency, all data in our sys-
tem is encoded using UTF-8.
The data source and relevant system details for
each language-specific QA system are given in Ta-
ble 1. For the Japanese system Chasen
1
is used to
segment character sequences into word-like units.
For Chinese each sentence is mapped to a se-
quence of space-separated characters.
4 Experimental work
A substantial amount of time has gone into in-
vestigating combinations and ranges of param-
eters which gave good performance on English
development questions. These same parameters
were largely used as-is for the Japanese system
with only minor optimisations performed on de-
1
http://chasen.naist.jp/hiki/ChaSen
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
49
Top Chinese Swedish English Japanese
Answers “TREC 2003” “TREC 2003” TREC 2003 TREC 2005 QAC-1
1 14 (7%) 23 (11.5%) 99 (24.0%) 64 (17.7%) 37 (18.5%)
5 24 (12.0%) 34 (17.0%) 175 (42.4%) 121 (33.4%) 61 (30.5%)
10 33 (16.5%) 46 (23.0%) 196 (47.5%) 151 (41.7%) 69 (34.5%)
20 39 (19.5%) 50 (25.0%) 216 (52.3%) 167 (46.1%) 81 (40.5%)
Total Questions 200 200 413 362 200
Table 2: Performance of each language-specific system in the top 1, 5, 10 and 20 answers output.
velopment questions disjoint from any evaluation
set. For the two new QA systems in Chinese and
Swedish we used exactly the same parameters that
had been found to work well for the English sys-
tem and performed no optimisation, mainly due to
a lack of questions that we could hold-out for de-
velopment. This is clearly sub-optimal but will be
addressed in future work.
It has been found for English and Japanese
that system performance consistently increases the
more documents that are used to search for an an-
swer. Experiments with English used CD BP BHBCBC
documents and for Japanese CD BP BHBCBCBC documents
was used. Due to time constraints, however, for
the Swedish experiments we used a maximum of
CD BPBDBCBCdocuments. For the Chinese experiments
we were forced to use a maximum of only CD BPBEBC
documents. This was because the segmentation
of Chinese text into individual characters resulted
in a huge number of character combinations be-
ing generated in the retrieval model and due to
memory limitations the only way round this prob-
lem was to limit the number of documents used.
In addition, for an answer to be output the num-
ber of times it had to occur in the data was set to
5 for Swedish and 2 for Chinese. These values
were based on the number of documents we were
using and our prior experience with English and
Japanese.
4.1 Test data
For evaluating the English and Japanese systems
there were obvious candidates for evaluation sets
well-known to the QA community since the TREC
and NTCIR QA evaluations have been running
now for several years. For the English experi-
ments we performed two sets of experiments us-
ing the TREC 2003 and TREC 2005 factoid QA
questions. For the Japanese experiments we used
the formal run of the NTCIR-3 QAC-1 task.
For Chinese and Swedish there were no such
standard test sets available. We therefore decided
to translate the first 200 questions in the TREC
2003 QA factoid task into Chinese and Swedish.
For Swedish this was relatively straightforward.
For Chinese, we didn’t know the Chinese trans-
lation of the names, places etc. for about 40%
of the original TREC questions. Such questions
were therefore “translated” into a roughly equiv-
alent question about something similar in China.
However, the final set of Chinese questions still
contained 17% of questions that had a U.S. back-
ground. Correct answers were found by a hu-
man to 91% and 93% of questions for Chinese
and Swedish, respectively. We hope that using
the TREC questions should at least give the reader
some idea of the kind of the questions that were
asked so they can appreciate the difficulty of find-
ing the answers in Chinese and Swedish web data.
To aid comprehension of the task we are tackling
we have made the questions and the answers we
used for evaluation freely available
2
for anyone
else interested in running similar experiments.
4.2 Results
The TREC, NTCIR and CLEF QA evaluations
have all converged on the use of exact and sup-
ported answers as a metric of QA system perfor-
mance. In web-based QA we are also concerned
with document support but primarily with answer
correctness so for now we only use answer correct-
ness as our evaluation metric. However, we also
adopt the notion of inexact answers to mark as in-
correct, answers that have extraneous or missing
parts such as units of measurement or in the case
of, say Chinese, an extra or missing character.
In Table 2 we present the number and percent-
age of answers that were marked correct in the top
1, 5, 10 and 20 answers output by each of the four
monolingual systems.
2
http://asked.jp/About.html
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
50
5 Discussion
Probably the first observation one makes looking
at Table 2 is that the performance of the Chinese
and Swedish systems is significantly lower than
the English and Japanese systems. On the face
of it this is not so surprising since far more docu-
ments and many more hours went into developing
the English and Japanese systems for which the
system parameters were also optimised on held-
out question data prior to evaluation.
One significant issue common to our evaluation
of the Chinese and Swedish systems is the use of
a translation of the TREC 2003 questions. Al-
though it was hoped this would aid comprehen-
sion of the task we were attempting it had the
predictable problem that not enough, or any (in
some cases), relevant Chinese and Swedish web
data could be found to answer some questions.
Particularly problematic were U.S. centric ques-
tions about baseball and U.S. geography of which
there were around 5% and 10%, respectively, in
the Swedish test set. However, aside from the
choice of test data there are several other possi-
ble reasons for the drop in performance and we’ll
examine them separately for Chinese and Swedish
in the following sections.
5.1 Chinese
One of the most important issues in building a Chi-
nese QA system was deciding on the segmenta-
tion for text in which there are no spaces between
words. In contrast to our Japanese system where
we used a freely available segmentation tool we
chose to segment Chinese text into single char-
acters and to not even try to recover a notion of
words from the text. Clearly this is far from satis-
factory but the redundancy aspect of our approach
does compensate for this to some extent by favour-
ing frequent character sequences which are more
likely to represent words. The second major prob-
lem we had was downloading sufficient and rel-
evant data for each question. The largest culprit
here was our restriction on using only 20 docu-
ments, the reasons for which were explained in
Section 4. However, the quality of the documents
themselves was also a problem: only 47.5% of
questions had data in which the correct answer was
observed at least once in the data and this was re-
duced to 43.5% when an answer had to occur 2 or
more times. This places a much lower value on the
performance upper bound (i.e. 47.5%) compared
to English where it is typically 80-90% and con-
sequently puts the results we obtained in a more
positive light. Another interesting observation was
that there were very few inexact answers (10% in
the top 20) which shows that the modelling as-
sumption made in Equation (5) was not detrimen-
tal for Chinese even when segmenting by charac-
ters.
5.2 Swedish
The general structure of Swedish and its charac-
ter set are very similar to English so there were
very few modifications that needed to be made
to port the English system to Swedish. In some
senses Swedish (and indeed many other Euro-
pean languages) are much easier than English for
a purely statistical keyword-based system since
they do not have the same phenomenon that En-
glish has where many
3
questions use does/did as
in “Who/when/where does/did ...?” etc. questions.
Swedish does, however, suffer from some of the
same problems as English such as the need to
re-order words in a question to produce a more
statement-like formulation. A more significant
problem is that Swedish is a language spoken by
only about 9 million people and correspondingly
there were far fewer training questions and an-
swers freely available on the web and much less
web-data available for finding answers. Although
our maximum number of documents was set to
100, we only downloaded 48 documents per ques-
tion on average (s.d. of 30). In a data-driven ap-
proach that relies to a large extent on data redun-
dancy and the philosophy that “the more data the
better” to achieve good performance this is always
likely to be problematic. Indeed, of the documents
that were downloaded for each question 67.5%
contained a correct answer. This was reduced
to 50% for correct answers occuring the thresh-
old number of 5 times or more. Clearly a more
linguistics-based system suffers less from this lack
of redundancy but as pointed out earlier the prob-
lem is simply transferred to the development stage
instead, where the initial data or linguistic knowl-
edge needed to train a system might well be lack-
ing.
6 Conclusion
In this paper we have shown that usable perfor-
mance from two prototype QA systems developed
3
For example, 28% of TREC 2003 factoid questions.
EACL 2006 Workshop on Multilingual Question Answering - MLQA06
51
for Chinese and Swedish could be obtained in a
matter of hours using our data-driven approach
to question answering. While the performance of
such systems was found to be below that of more
developed systems that used the same approach
we feel that this demonstrates the effectiveness of
our language independent approach for web-based
QA. One of the interesting observations from our
experiments is the inherent difficulty of finding an-
swers to questions when there is very little data
available in the target language in which to search
for answers. While one could argue that some-
one in Sweden or China may not be interested in
finding out the winning team of the 1996 World
Series this clearly misses the point and instead
makes a compelling argument for cross-language
QA (CLQA) whose profile has been steadily in-
creasing due to the recent CLEF and NTCIR eval-
uations. While most attempts at CLQA have so
far concentrated on translating the query into En-
glish and performing monolingual English QA on
the translated query, we feel that an approach that
also combines a simple system trained in the man-
ner described in this paper would almost certainly
benefit overall performance.
7 Further work
In the future we aim to improve the Chinese and
Swedish systems by increasing the amount and
quality of data used for training and also the data
used for searching for answers as well as solving
the explosion in memory usage that occurred with
large amounts of Chinese text. We also intend to
optimise the system parameters for each system
using the questions and answers used for training
with a rotating form of cross-validation so as to
maximise use of the little data we have available.
Given the rapid development time and reasonable
performance of the approach outlined in this paper
combined with the fact that specialized linguistic
knowledge is unnecessary we also plan the devel-
opment of yet more monolingual systems in other
languages and hope to participate in the upcoming
CLEF QA track.
8 Acknowledgments
This research was supported by JSPS and the
Japanese government 21st century COE pro-
gramme. The authors also with to thank Dietrich
Klakow for his contributions to the model devel-
opment.

References
E. Brill, S. Dumais, and M. Banko. 2002. An
Analysis of the AskMSR Question-answering Sys-
tem. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing
(EMNLP).
M. Fuchigami, H. Ohnuma, and A. Ikeno. 2004. Oki
QA System for QAC-2. In Proceedings of NTCIR-4
Workshop.
J. Fukumoto, T. Kato, and F. Masui. 2002. Ques-
tion Answering Challenge (QAC-1) An Evaluation
of Question Answering Task at NTCIR Workshop 3.
In Proceedings of NTCIR-3 Workshop.
E. Hovy, U. Hermjakob, and Lin C-Y. 2001. The Use
of External Knowledge in Factoid QA. In Proceed-
ings of the TREC 2001 Conference.
S. M. Katz. 1987. Estimation of Probabilities from
Sparse Data for the Language Model Component of
a Speech Recognizer. IEEE Transactions on Acous-
tic, Speech, and Signal Processing, 35(3):400–401.
J. Lin, A. Fernandes, B. Katz, G. Marton, and S. Tellex.
2002. Extracting Answers from the Web Us-
ing Knowledge Annotation and Knowledge Min-
ing Techniques. In Proceedings of the Eleventh
Text REtrieval Conference (TREC2002), Gaithers-
burg, Maryland, November.
D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu,
F. Lacatusu, A. Novischi, A. Badulescu, and
O. Bolohan. 2002. LCC Tools for Question An-
swering. In Proceedings of the TREC 2002 Confer-
ence.
Willie Rogers. 2000. TREC Mandarin. Linguistic
Data Consortium.
Sprakbanken Gothenburg University. 1997.
PAROLE Swedish Language Corpus.
http://spraakbanken.gu.se/lbeng.html.
E. Voorhees. 2002. Overview of the TREC 2002 Ques-
tion Answering Track. In Proceedings of the TREC
2002 Conference.
E. Voorhees. 2003. Overview of the TREC 2003 Ques-
tion Answering Track. In Proceedings of the TREC
2003 Conference.
E.W.D. Whittaker, P. Chatain, S. Furui, and D. Klakow.
2005a. TREC2005 Question Answering Experi-
ments at Tokyo Institute of Technology. In Proceed-
ings of the 14th Text Retrieval Conference.
E.W.D. Whittaker, S. Furui, and D. Klakow. 2005b.
A Statistical Pattern Recognition Approach to Ques-
tion Answering using Web Data. In Proceedings of
Cyberworlds.
E.W.D. Whittaker, J. Hamonic, and S. Furui. 2005c. A
Unified Approach to Japanese and English Question
Answering. In Proceedings of NTCIR-5.
