Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 307–314, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Combining Deep Linguistics Analysis and Surface Pattern Learning:
A Hybrid Approach to Chinese Definitional Question Answering
Fuchun Peng, Ralph Weischedel, Ana Licuanan, Jinxi Xu
BBN Technologies
50 Moulton Street, Cambridge, MA, 02138
a0 fpeng, rweisched, alicuan, jxu
a1 @bbn.com
Abstract
We explore a hybrid approach for Chinese
definitional question answering by com-
bining deep linguistic analysis with sur-
face pattern learning. We answer four
questions in this study: 1) How helpful are
linguistic analysis and pattern learning? 2)
What kind of questions can be answered
by pattern matching? 3) How much an-
notation is required for a pattern-based
system to achieve good performance? 4)
What linguistic features are most useful?
Extensive experiments are conducted on
biographical questions and other defini-
tional questions. Major findings include:
1) linguistic analysis and pattern learning
are complementary; both are required to
make a good definitional QA system; 2)
pattern matching is very effective in an-
swering biographical questions while less
effective for other definitional questions;
3) only a small amount of annotation is
required for a pattern learning system to
achieve good performance on biographi-
cal questions; 4) the most useful linguistic
features are copulas and appositives; re-
lations also play an important role; only
some propositions convey vital facts.
1 Introduction
Due to the ever increasing large amounts of online
textual data, learning from textual data is becom-
ing more and more important. Traditional document
retrieval systems return a set of relevant documents
and leave the users to locate the specific information
they are interested in. Question answering, which
combines traditional document retrieval and infor-
mation extraction, solves this problem directly by
returning users the specific answers. Research in
textual question answering has made substantial ad-
vances in the past few years (Voorhees, 2004).
Most question answering research has been focus-
ing on factoid questions where the goal is to return
a list of facts about a concept. Definitional ques-
tions, however, remain largely unexplored. Defini-
tional questions differ from factoid questions in that
the goal is to return the relevant “answer nuggets”
of information about a query. Identifying such an-
swer nuggets requires more advanced language pro-
cessing techniques. Definitional QA systems are
not only interesting as a research challenge. They
also have the potential to be a valuable comple-
ment to static knowledge sources like encyclopedias.
This is because they create definitions dynamically,
and thus answer definitional questions about terms
which are new or emerging (Blair-Goldensoha et
al., 2004).
One success in factoid question answering
is pattern based systems, either manually con-
structed (Soubbotin and Soubbotin, 2002) or ma-
chine learned (Cui et al., 2004). However, it is
unknown whether such pure pattern based systems
work well on definitional questions where answers
are more diverse.
Deep linguistic analysis has been found useful in
factoid question answering (Moldovan et al., 2002)
and has been used for definitional questions (Xu et
al., 2004; Harabagiu et al., 2003). Linguistic analy-
307
sis is useful because full parsing captures long dis-
tance dependencies between the answers and the
query terms, and provides more information for in-
ference. However, merely linguistic analysis may
not be enough. First, current state of the art lin-
guistic analysis such as parsing, co-reference, and
relation extraction is still far below human perfor-
mance. Errors made in this stage will propagate and
lower system accuracy. Second, answers to some
types of definitional questions may have strong local
dependencies that can be better captured by surface
patterns. Thus we believe that combining linguistic
analysis and pattern learning would be complemen-
tary and be beneficial to the whole system.
Work in combining linguistic analysis with pat-
terns include Weischedel et al. (2004) and Jijkoun et
al. (2004) where manually constructed patterns are
used to augment linguistic features. However, man-
ual pattern construction critically depends on the do-
main knowledge of the pattern designer and often
has low coverage (Jijkoun et al., 2004). Automatic
pattern derivation is more appealing (Ravichandran
and Hovy, 2002).
In this work, we explore a hybrid approach to
combining deep linguistic analysis with automatic
pattern learning. We are interested in answering
the following four questions for Chinese definitional
question answering:
a0 How helpful are linguistic analysis and pattern
learning in definitional question answering?
a0 If pattern learning is useful, what kind of ques-
tion can pattern matching answer?
a0 How much human annotation is required for a
pattern based system to achieve reasonable per-
formance?
a0 If linguistic analysis is helpful, what linguistic
features are most useful?
To our knowledge, this is the first formal study of
these questions in Chinese definitional QA. To an-
swer these questions, we perform extensive experi-
ments on Chinese TDT4 data (Linguistic Data Con-
sortium, 2002-2003). We separate definitional ques-
tions into biographical (Who-is) questions and other
definitional (What-is) questions. We annotate some
question-answer snippets for pattern learning and
we perform deep linguistic analysis including pars-
ing, tagging, name entity recognition, co-reference,
and relation detection.
2 A Hybrid Approach to Definitional Ques-
tion Answering
The architecture of our QA system is shown in Fig-
ure 1. Given a question, we first use simple rules to
classify it as a “Who-is” or “What-is” question and
detect key words. Then we use a HMM-based IR
system (Miller et al., 1999) for document retrieval
by treating the question keywords as a query. To
speed up processing, we only use the top 1000 rel-
evant documents. We then select relevant sentences
among the returned relevant documents. A sentence
is considered relevant if it contains the query key-
word or contains a word that is co-referent to the
query term. Coreference is determined using an in-
formation extraction engine, SERIF (Ramshaw et
al., 2001). We then conduct deep linguistic anal-
ysis and pattern matching to extract candidate an-
swers. We rank all candidate answers by predeter-
mined feature ordering. At the same time, we per-
form redundancy detection based on a1 -gram over-
lap.
2.1 Deep Linguistic Analysis
We use SERIF (Ramshaw et al., 2001), a linguistic
analysis engine, to perform full parsing, name entity
detection, relation detection, and co-reference reso-
lution. We extract the following linguistic features:
1. Copula: a copula is a linking verb such as “is”
or “become”. An example of a copula feature
is “Bill Gates is the CEO of Microsoft”. In this
case, “CEO of Microsoft” will be extracted as
an answer to “Who is Bill Gates?”. To extract
copulas, SERIF traverses the parse trees of the
sentences and extracts copulas based on rules.
In Chinese, the rule for identifying a copula is
the POS tag “VC”, standing for “Verb Copula”.
The only copula verb in Chinese is “ a2 ”.
2. Apposition: appositions are a pair of noun
phrases in which one modifies the other. For
example, In “Tony Blair, the British Prime Min-
ister, ...”, the phrase “the British Prime Min-
ister” is in apposition to “Blair”. Extraction
of appositive features is similar to that of cop-
ula. SERIF traverses the parse tree and iden-
tifies appositives based on rules. A detailed
description of the algorithm is documented
308
Question Classification
Document Retrieval
Linguistic Analysis
Semantic Processing
Phrase Ranking
Redundancy Remove
Lists of Response
Answer Annotation
Name Tagging
Parsing
Preposition finding
Co−reference
Relation Extraction Training data
TreeBank
Name Annotation
Linguistic motivated
Pattern motivated
Question
Pattern MatchingPattern Learning
Figure 1: Question answering system structure
in (Ramshaw et al., 2001).
3. Proposition: propositions represent predicate-
argument structures and take the form:
predicate(a0a2a1 a3a5a4a7a6 : a8a9a0a11a10 a6 , ..., a0a2a1 a3a5a4a13a12 : a8a14a0a15a10 a12 ). The
most common roles include logical subject,
logical object, and object of a prepositional
phrase that modifies the predicate. For ex-
ample, “Smith went to Spain” is represented
as a proposition, went(logical subject: Smith,
PP-to: Spain).
4. Relations: The SERIF linguistic analysis en-
gine also extracts relations between two ob-
jects. SERIF can extract 24 binary relations
defined in the ACE guidelines (Linguistic Data
Consortium, 2002), such as spouse-of, staff-of,
parent-of, management-of and so forth. Based
on question types, we use different relations, as
listed in Table 1.
Relations used for Who-Is questions
ROLE/MANAGEMENT, ROLE/GENERAL-STAFF,
ROLE/CITIZEN-OF, ROLE/FOUNDER,
ROLE/OWNER, AT/RESIDENCE,
SOC/SPOUSE, SOC/PARENT,
ROLE/MEMBER, SOC/OTHER-PROFESSIONAL
Relation used for What-Is questions
AT/BASED-IN, AT/LOCATED, PART/PART-OF
Table 1: Relations used in our system
Many relevant sentences do not contain the query
key words. Instead, they contain words that are co-
referent to the query. For example, in “Yesterday UN
Secretary General Anan Requested Every Side...,
He said ... ”. The pronoun “He” in the second sen-
tence refers to “Anan” in the first sentence. To select
such sentences, we conduct co-reference resolution
using SERIF.
In addition, SERIF also provides name tagging,
identifying 29 types of entity names or descriptions,
such as locations, persons, organizations, and dis-
eases.
We also select complete sentences mentioning the
term being defined as backup answers if no other
features are identified.
The component performance of our linguistic
analysis is shown in Table 2.
Pre. Recall F
Parsing 0.813 0.828 0.820
Co-reference 0.920 0.897 0.908
Name-entity detection 0.765 0.753 0.759
Table 2: Linguistic analysis component performance
for Chinese
2.2 Surface Pattern Learning
We use two kinds of patterns: manually constructed
patterns and automatically derived patterns. A man-
ual pattern is a commonly used linguistic expression
that specifies aliases, super/subclass and member-
ship relations of a term (Xu et al., 2004). For exam-
ple, the expression “tsunamis, also known as tidal
waves” gives an alternative term for tsunamis. We
309
use 23 manual patterns for Who-is questions and 14
manual patterns for What-is questions.
We also classify some special propositions as
manual patterns since they are specified by compu-
tational linguists. After a proposition is extracted,
it is matched against a list of predefined predicates.
If it is on the list, it is considered special and will
be ranked higher. In total, we designed 22 spe-
cial propositions for Who-is questions, such as a0
a1 (become),
a2a4a3
a1 (elected as), and
a5a7a6 (resign),
14 for What-is questions, such as a8a10a9 (located at),
a11a13a12
a9 (created at), and a14a16a15
a1 (also known as).
However, it is hard to manually construct such
patterns since it largely depends on the knowledge
of the pattern designer. Thus, we prefer patterns
that can be automatically derived from training data.
Some annotators labeled question-answer snippets.
Given a query question, the annotators were asked
to highlight the strings that can answer the question.
Though such a process still requires annotators to
have knowledge of what can be answers, it does not
require a computational linguist. Our pattern learn-
ing procedure is illustrated in Figure 2.
Generate Answer Snippet
Pattern Generalization
Pattern Selection
POS Tagging
Merging POS Tagging
and Answer Tagging
Answer Annotation
Figure 2: Surface Pattern Learning
Here we give an example to illustrate how pat-
tern learning works. The first step is annotation. An
example of Chinese answer annotation with English
translation is shown in Figure 3. Question words are
assigned the tag QTERM, answer words are tagged
ANSWER, and all other words are assigned BKGD,
standing for background words (not shown in the ex-
ample to make the annotation more readable).
To obtain patterns, we conduct full parsing to ob-
tain the full parse tree for a sentence. In our current
Chinese annotation: a17a19a18a21a20a23a22 “a24a21a25a23a26a21a27a21a28 ”a29 (a30a32a31a23a31a34a33
a35 ANSWER)(
a36a38a37a40a39a19a41a40a42 QTERM),a43a45a44a47a46a48a18a19a20a19a49a38a50
a51a21a52a54a53
a22a40a24a19a55a19a56a45a57a45a58a19a59
English translation: (U.S. Secretary of the State ANWER)
(Albright QTERM), who visited North Korea for the “ice-
breaking trip”, had a historical meeting with the leader of
North Korea, Kim Jong Il.
Figure 3: Answer annotation example
patterns, we only use POS tagging information, but
other higher level information could also be used.
The segmented and POS tagged sentence is shown
in Figure 4. Each word is assigned a POS tag as
defined by the Penn Chinese Treebank guidelines.
(a17 P)(a18 a20 NR)(a22 a24 VV)(“ PU)(a25 VV)(a26
NN)(a27a60a28 NN)(” PU)(a29 DEC)(a30a61a31 NR)(a31a7a33
a35
NR)(a36a60a37 NR)(a39a60a41a60a42 NR)(, PU) (a43a62a44 NT)(a46
DT)(a18a21a20 NR)(a49a21a50 NR)(a51a63a52a4a53 NN)(a64a23a24 VV)(a55a21a56
a57 JJ)(a58a19a59 NN).
Figure 4: POS tagging
Next we combine the POS tags and the answer
tags by appending these two tags to create a new tag,
as shown in Figure 5.
(a17 P/BKGD)(a18a65a20 NR/BKGD)(a22a65a24 VV/BKGD)(“
PU/BKGD)(a25 VV/BKGD)(a26 NN/BKGD)(a27 a28
NN/BKGD)(” PU/BKGD)(a29 DEC/BKGD)(a30
a31 NR/ANSWER)(a31 a33
a35 NR/ANSWER)(
a36 a37
NR/QTERM)(a39a13a41a66a42 NR/QTERM)(, PU/BKGD) (a43
a44 NT/BKGD)(a46 DT/BKGD)(a18a67a20 NR/BKGD)(a49a68a50
NR/BKGD)(a69 a52a16a53 NN/BKGD)(a22a21a24 VV/BKGD)(a55a70a56
a57 JJ/BKGD)(a58a19a59 NN/BKGD)
Figure 5: Combined POS and Answer tagging
We can then obtain an answer snippet from this
training sample. Here we obtain the snippet (a71a73a72
a72a75a74a77a76 NR/ANSWER)(TERM).
We generalize a pattern using three heuristics (this
particular example does not generalize). First, we
replace all Chinese sequences longer than 3 charac-
ters with their POS tags, under the theory that long
sequences are too specific. Second, we also replace
NT (time noun, such as a78a4a79 ), DT (determiner, such
as a80 ,a81 ), cardinals (CD, such as a82 ,a83 ,a84 ) and M
310
(measurement word such as a0a2a1a4a3 ) with their POS
tags. Third, we ignore adjectives.
After obtaining all patterns, we run them on the
training data to calculate their precision and recall.
We select patterns whose precision is above 0.6 and
which fire at least 5 times in training data (parame-
ters are determined with a held out dataset).
3 Experiments
3.1 Data Sets
We produced a list of questions and asked annota-
tors to identify answer snippets from TDT4 data. To
produce as many training answer snippets as pos-
sible, annotators were asked to label answers ex-
haustively; that is, the same answer can be labeled
multiple times in different places. However, we re-
move duplicate answers for test questions since we
are only interested in unique answers in evaluation.
We separate questions into two types, biographi-
cal (Who-is) questions, and other definitional ques-
tions (What-is). For “Who-is” questions, we used
204 questions for pattern learning, 10 for parame-
ter tuning and another 42 questions for testing. For
“What-is” questions, we used 44 for training and an-
other 44 for testing.
3.2 Evaluation
The TREC question answering evaluation is based
on human judgments (Voorhees, 2004). However,
such a manual procedure is costly and time consum-
ing. Recently, researchers have started automatic
question answering evaluation (Xu et al., 2004;
Lin and Demner-Fushman, 2005; Soricut and Brill,
2004). We use Rouge, an automatic evaluation met-
ric that was originally used for summarization eval-
uation (Lin and Hovy, 2003) and was recently found
useful for evaluating definitional question answer-
ing (Xu et al., 2004). Rouge is based on a1 -gram
co-occurrence. An a1 -gram is a sequence of a1 con-
secutive Chinese characters.
Given a reference answer a5 and a system answer
a6 , the Rouge score is defined as follows:
a7a9a8a11a10a13a12a15a14a17a16a18a7a20a19a22a21a23a19a25a24a27a26a29a28a31a30
a32a33
a33
a34 a35
a36
a37a39a38a41a40a41a42
a8a43a10a13a44a46a45a22a47a49a48a51a50a53a52a22a54a13a16a18a7a20a19a55a21a41a19a25a44a56a26
a42
a8a43a10a13a44a46a45a57a16a18a7a20a19a58a44a56a26
where a59 is the maximum length of a1 -grams,
a60
a1a13a61 a1a63a62a57a64a20a65a67a66a18a68a55a69a71a70a22a5a73a72
a6
a72 a1a75a74 is the number of common a1 -
grams of a5 and a6 , and a60 a1a13a61 a1a63a62a17a70a22a5a73a72 a1a75a74 is the number
of a1 -grams in a5 . If a59 is too small, stop words and
bi-grams of such words will dominate the score; If
a59 is too large, there will be many questions without
answers. We select a59 to be 3, 4, 5 and 6.
To make scores of different systems comparable,
we truncate system output for the same question
by the same cutoff length. We score answers trun-
cated at length a76 times that of the reference answers,
where a76 is set to be 1, 2, and 3. The rationale is that
people would like to read at least the same length
of the reference answer. On the other hand, since
the state of the art system answer is still far from
human performance, it is reasonable to produce an-
swers somewhat longer than the references (Xu et
al., 2004).
In summary, we run experiments with parameters
a59a78a77a80a79a81a72a51a82a23a72a11a83a81a72a11a84 and a76a85a77a4a86a87a72a11a88a81a72a11a79 , and take the average
over all of the 12 runs.
3.3 Overall Results
We set the pure linguistic analysis based system as
the baseline and compare it to other configurations.
Table 3 and Table 4 show the results on “Who-is”
and “What-is” questions respectively. The baseline
(Run 1) is the result of using pure linguistic features;
Run 2 is the result of adding manual patterns to the
baseline system; Run 3 is the result of using learned
patterns only. Run 4 is the result of adding learned
patterns to the baseline system. Run 5 is the result
of adding both manual patterns and learned patterns
to the system.
The first question we want to answer is how help-
ful the linguistic analysis and pattern learning are
for definitional QA. Comparing Run 1 and 3, we
can see that both pure linguistic analysis and pure
pattern based systems achieve comparable perfor-
mance; Combining them together improves perfor-
mance (Run 4) for “who is” questions, but only
slightly for “what is” questions. This indicates that
linguistic analysis and pattern learning are comple-
mentary to each other, and both are helpful for bio-
graphical QA.
The second question we want to answer is what
kind of questions can be answered with pattern
matching. From these two tables, we can see
that patterns are very effective in “Who-is” ques-
tions while less effective in “What-is” questions.
Learned patterns improve the baseline from 0.3399
311
to 0.3860; manual patterns improve the baseline to
0.3657; combining both manual and learned patterns
improve it to 0.4026, an improvement of 18.4%
compared to the baseline. However, the effect of
patterns on “What-is” is smaller, with an improve-
ment of only 3.5%. However, the baseline perfor-
mance on “What-is” is also much worse than that
of “Who-is” questions. We will analyze the reasons
in Section 4.3. This indicates that answering gen-
eral definitional questions is much more challenging
than answering biographical questions and deserves
more research.
Run Run description Rouge
(1) Baseline 0.3399
(2) (1)+ manual patterns 0.3657
(3) Learned patterns 0.3549
(4) (1)+ learned patterns 0.3860
(5) (2)+ learned patterns 0.4026
Table 3: Results on Who-is (Biographical) Ques-
tions
Run Run description Rouge
(1) Baseline 0.2126
(2) (1)+ manual patterns 0.2153
(3) Learned patterns 0.2117
(4) (1)+ learned patterns 0.2167
(5) (2)+ learned patterns 0.2201
Table 4: Results on “What-is” (Other Definitional)
Questions
4 Analysis
4.1 How much annotation is needed
The third question is how much annotation is needed
for a pattern based system to achieve good perfor-
mance. We run experiments with portions of train-
ing data on biographical questions, which produce
different number of patterns. Table 5 shows the de-
tails of the number of training snippets used and the
number of patterns produced and selected. The per-
formance of different system is illustrated in Fig-
ure 6. With only 10% of the training data (549 snip-
pets, about two person days of annotation), learned
patterns achieve good performance of 0.3285, con-
sidering the performance of 0.3399 of a well tuned
system with deep linguistic features. Performance
saturates with 2742 training snippets (50% train-
ing, 10 person days annotation) at a Rouge score
of 0.3590, comparable to the performance of a well
tuned system with full linguistic features and man-
ual patterns (Run 2 in Table 3). There could even
be a slight, insignificant performance decrease with
more training data because our sampling is sequen-
tial instead of random. Some portions of training
data might be more useful than others.
Training Patterns Patterns
snippets learned selected
10% train 549 56 33
30% train 1645 144 88
50% train 2742 211 135
70% train 3839 281 183
90% train 4935 343 222
100% train 5483 381 266
Table 5: Number of patterns with different size of
training data
Figure 6: How much annotation is required (mea-
sured on biographical questions)
4.2 Contributions of different features
The fourth question we want to answer is: what fea-
tures are most useful in definitional question answer-
ing? To evaluate the contribution of each individ-
ual feature, we turn off all other features and test
the system on a held out data (10 questions). We
calculate the coverage of each feature, measured by
Rouge. We also calculate the precision of each fea-
ture with the following formula, which is very sim-
ilar to Rouge except that the denominator here is
based on system output a60 a1a13a61 a1a63a62a17a70 a6 a72 a1a75a74 instead of ref-
erence a60 a1a13a61 a1a63a62a17a70a22a5a73a72 a1a75a74 . The notations are the same as
312
those in Rouge.
a0a2a1a43a14
a42
a3a5a4a6a3a8a43a44 a16a18a7a20a19a55a21a41a19a58a24a27a26 a28 a30
a32a33
a33
a34 a35
a36
a37a39a38a41a40 a42
a8a43a10a87a44 a45 a47a49a48a51a50a53a52a22a54 a16a18a7 a19a58a21a41a19a25a44a56a26
a42
a8a11a10a87a44 a45 a16 a21a41a19a58a44a56a26
Figure 7 is the precision-recall scatter plot of the
features measured on “who is” questions. Interest-
ingly, the learned patterns have the highest coverage
and precision. The copula feature has the second
highest precision; however, it has the lowest cover-
age. This is because there are not many copulas in
the dataset. Appositive and manual pattern features
have the same level of contribution. Surprisingly,
the relation feature has a high coverage. This sug-
gests that relations could be more useful if relation
detection were more accurate; general propositions
are not more useful than whole sentences since al-
most every sentence has a proposition, and since the
high value propositions are identified by the lexical
head of the proposition and grouped with the manual
patterns.
Figure 7: Feature precision recall scatter plot (mea-
sured on the biographical questions)
4.3 Who-is versus What-is questions
We have seen that “What-is” questions are more
challenging than “Who-is” questions. We compare
the precision and coverage of each feature for “Who-
is” and “What-is” in Table 6 and Table 7. We see that
although the precisions of the features are higher
for “What-is”, their coverage is too low. The most
useful features for “What-is” questions are propo-
sitions and raw sentences, which are the worst two
features for “Who-is”. Basically, this means that
most of the answers for “What-is” are from whole
sentences. Neither linguistic analysis nor pattern
matching works as efficiently as in biographical
questions.
feature who-is what-is
copula 0.567 0.797
appositive 0.3460 0.3657
proposition 0.1162 0.1837
relation 0.3509 0.4422
sentence 0.1074 0.1556
learned patterns 0.6542 0.6858
Table 6: Feature Precision Comparison
feature who-is what-is
copula 0.055 0.049
appositive 0.2028 0.0026
proposition 0.2101 0.1683
relation 0.2722 0.043
sentence 0.1619 0.1717
learned patterns 0.3517 0.0860
Table 7: Feature Coverage Comparison
To identify the challenges of “What-is” questions,
we conducted an error analysis. The answers for
“What-is” are much more diverse and are hard to
capture. For example, the reference answers for the
question of “a7a9a8 a2a11a10a13a12a15a14a17a16a19a18a21a20 / What is the in-
ternational space station?” include the weight of the
space station, the distance from the space station to
the earth, the inner structure of the space station, and
the cost of its construction. Such attributes are hard
to capture with patterns, and they do not contain any
of the useful linguistic features we currently have
(copula, appositive, proposition, relation). Identify-
ing more useful features for such answers remains
for future work.
5 Related Work
Ravichandran and Hovy (2002) presents a method
that learns patterns from online data using some seed
questions and answer anchors. The advantage is
that it does not require human annotation. How-
ever, it only works for certain types of questions that
313
have fixed anchors, such as “where was X born”.
For general definitional questions, we do not know
what the anchors should be. Thus we prefer using
small amounts of human annotation to derive pat-
terns. Cui et al. (2004) uses a similar approach for
unsupervised pattern learning and generalization to
soft pattern matching. However, the method is actu-
ally used for sentence selection rather than answer
snippet selection. Combining information extrac-
tion with surface patterns has also seen some suc-
cess. Jikoun et al. (2004) shows that information
extraction can help improve the recall of a pattern
based system. Xu et al. (2004) also shows that man-
ually constructed patterns are very important in an-
swering English definitional questions. Hildebrandt
et al. (2004) uses manual surface patterns for tar-
get extraction to augment database and dictionary
lookup. Blair-Goldensohn et al. (2004) apply su-
pervised learning for definitional predicates and then
apply summarization methods for question answer-
ing.
6 Conclusions and Future Work
We have explored a hybrid approach for definitional
question answering by combining deep linguistic
analysis and surface pattern learning. For the first
time, we have answered four questions regarding
Chinese definitional QA: deep linguistic analysis
and automatic pattern learning are complementary
and may be combined; patterns are powerful in an-
swering biographical questions; only a small amount
of annotation (2 days) is required to obtain good per-
formance in a biographical QA system; copulas and
appositions are the most useful linguistic features;
relation extraction also helps.
Answering “What-is” questions is more challeng-
ing than answering “Who-is” questions. To improve
the performance on “What-is” questions, we could
divide “What-is” questions into finer classes such
as organization, location, disease, and general sub-
stance, and process them specifically.
Our current pattern matching is based on simple
POS tagging which captures only limited syntactic
information. We generalize words to their corre-
sponding POS tags. Another possible improvement
is to generalize using automatically derived word
clusters, which provide semantic information.
Acknowledgements This material is based upon work sup-
ported by the Advanced Research and Development Activity
(ARDA) under Contract No. NBCHC040039. We are grate-
ful to Linnea Micciulla for proof reading and three anonymous
reviewers for suggestions on improving the paper.
References
S. Blair-Goldensoha, K. McKeown, and A. Hazen
Schlaikjer. 2004. Answering Definitional Questions:
A Hybrid Approach. New Directions In Question An-
swering., pages 47–58.
H. Cui, M. Kan, and T. Chua. 2004. Unsupervised
Learning of Soft Patterns for Definitional Question
Answering. In WWW 2004, pages 90–99.
S. Harabagiu, D. Moldovan, C. Clark, M. Bowden,
J. Williams, and J. Bensley. 2003. Answer Mining
by Combining Extraction Techniques with Abductive
Reasoning. In TREC2003 Proceedings.
W. Hildebrandt, B. Katz, and J. Lin. 2004. Answer-
ing Definition Questions with Multiple Knowledge
Sources. In HLT-NAACL 2004, pages 49–56.
V. Jijkoun, M. Rijke, and J. Mur. 2004. Information
Extraction for Question Answering: Improving Recall
Through Syntactic Patterns. In COLING 2004.
J. Lin and D. Demner-Fushman. 2005. Automati-
cally Evaluating Answers to Definition Questions. In
ACL2005. to appear.
C. Lin and E. Hovy. 2003. Automatic Evaluation of
Summaries Using N-gram Co-occurrence Statistics.
In HLT-NAACL 2003.
D. Miller, T. Leek, and R. Schwartz. 1999. A Hidden
Markov Model Information Retrieval System. In SI-
GIR 1999, pages 214 – 221.
D. Moldovan, M. Pasca, S. Harabagiu, and M. Sur-
deanu. 2002. Performance Issues and Error Analysis
in an Open-Domain Question Answering System. In
ACL2002.
L. Ramshaw, E. Boshee, S. Bautus, S. Miller, R. Stone,
R. Weischedel, and A. Zamanian. 2001. Experi-
ments in Multi-Model Automatic Content Extraction.
In HLT2001.
D. Ravichandran and E. Hovy. 2002. Learning surface
text patterns for a Question Answering System. In
ACL2002, pages 41–47.
R. Soricut and E. Brill. 2004. A Unified Framework For
Automatic Evaluation Using N-Gram Co-occurrence
Statistics. In ACL 2004, pages 613–620.
M. Soubbotin and S. Soubbotin. 2002. Use of Patterns
for Detection of Likely Answer Strings: A Systematic
Approach. In TREC2002 Proceedings.
E. Voorhees. 2004. Overview of the TREC 2003 Ques-
tion Answering Track. In TREC Proceedings.
R. Weischedel, J. Xu, and A. Licuanan. 2004. A Hybrid
Approach to Answering Biographical Questions. New
Directions In Question Answering., pages 59–70.
J. Xu, R. Weischedel, and A. Licuanan. 2004. Evaluation
of an Extraction-based Approach to Answering Defini-
tional Questions. In SIGIR 2004, pages 418–424.
314
