Automatic Derivation of Surface Text Patterns for a Maximum Entropy
Based Question Answering System
Deepak Ravichandrana0
USC Information Sciences Institute
4676, Admiralty Way
Marina del Rey, CA, 90292
ravichan@isi.edu
Abraham Ittycheriah and Salim Roukos
IBM TJ Watson Research Center
Yorktown Heights, NY, 10598
a1 abei,roukos
a2 @us.ibm.com
Abstract
In this paper we investigate the use of surface
text patterns for a Maximum Entropy based
Question Answering (QA) system. These text
patterns are collected automatically in an un-
supervised fashion using a collection of trivia
question and answer pairs as seeds. These pat-
terns are used to generate features for a statis-
tical question answering system. We report our
results on the TREC-10 question set.
1 Introduction
Several QA systems have investigated the use of text pat-
terns for QA (Soubbotin and Soubbotin, 2001), (Soub-
botin and Soubbotin, 2002), (Ravichandran and Hovy,
2002). For example, for questions like “When was
Gandhi born?”, typical answers are “Gandhi was born in
1869” and “Gandhi (1869-1948)”. These examples sug-
gest that the text patterns such as “a3 NAMEa4 was born in
a3 BIRTHDATEa4 ” and “a3 NAMEa4 (a3 BIRTHDATEa4 -
a3 DEATHYEARa4 )” when formulated as regular expres-
sions, can be used to select the answer phrase to ques-
tions. Another approach to a QA system is learning cor-
respondences between question and answer pairs. IBM’s
Statistical QA (Ittycheriah et al., 2001a) system uses a
probabilistic model trainable from Question-Answer sen-
tence pairs. The training is performed under a Maximum
Entropy model, using bag of words, syntactic and name
entity features. This QA system does not employ the use
of patterns. In this paper, we explore the inclusion of
surface text patterns into the framework of a statistical
question answering system.
2 KM Corpus
A corpus of question-answer pairs was obtained from
Knowledge Master (1999). We refer to this corpus as the
1Work done while the author was an intern at IBM TJ Wat-
son Research Center during Summer 2002.
KM database. Each of the pairs in KM represents a trivia
question and its corresponding answer, such as the ones
used in the trivia card game. The question-answer pairs
in KM were filtered to retain only questions that look
similar to the ones presented in the TREC task2. Some
examples of QA pairs in KM:
1. Which country was invaded by the Libyan troops in
1983? - Chad
2. Who led the 1930 Salt March in India? - Mohandas
Gandhi
3 Unsupervised Construction of Training
Set for Pattern Extraction
We use an unsupervised technique that uses the QA in
KM as seeds to learn patterns. This method was first de-
scribed in Ravichandran and Hovy (2002). However, in
this work we have enriched the pattern format by induc-
ing specific semantic types of QTerms, and have learned
many more patterns using the KM.
3.1 Algorithm for sentence construction
1. For every question, we run a Named Entity Tagger
HMMNE 3 and identify chunks of words, that sig-
nify entities. Each such entity obtained from the
Question is defined as a Question term (QTerm).
The Answer Term (ATerm) is the Answer given by
the KM corpus.
2. Each of the question-answer pairs is submitted as
query to a popular Internet search engine4. We use
the top 50 relevant documents after stripping off the
HTML tags. The text is then tokenized to smoothen
white space variations and chopped to individual
sentences.
3. For every sentence obtained from Step (3) apply
2This was done by retaining only those questions that had
10 words or less, and were not multiple choice.
3In these experiments we use HMMNE, a named entity tag-
ger similar to the BBN’s Identifinder HMM Tagger (Bikel et al.,
1999).
4Alta Vista http://www.altavista.com
HMMNE and retain only those sentences that con-
tains at least one of the QTerms plus the ATerm.
For example, we obtain the following sentences for the
QA pair “Which country was invaded by the Libyan
troops in 1983? - Chad”:
1. More than 7,000 Libyan troops entered Chad.
2. An OUA peacekeeping force of 3,500 troops replaced the
Libyan forces in the remainder of Chad.
3. In the summer of 1983, GUNT forces launched an offen-
sive against government positions in northern and eastern
Chad.
The underlined words indicate the QTerms and the
ATerms that helped to select the sentence as a potential
way of answering the Question. The algorithm described
above was applied to each of the 16,228 QA pairs in our
KM database. A total of more than 250K sentences was
obtained.
3.2 Sentence Canonicalization
Every sentence obtained from the sentence construction
algorithm is canonicalized. Canonicalization of a sen-
tence is performed on the basis of the information pro-
vided by HMMNE, the QTerms and the ATerm. Canon-
icalization in this context may be defined as the general-
ization of a sentence based on the following process:
1. Apply HMMNE to each sentence obtained from the
sentence construction algorithm.
2. Identify the QTerms and ATerm in the answer sen-
tence.
3. Replace the ATerm by the tag “a3 ANSWERa4 ”.
4. Replace each identified Named Entity by the class
of entity it represents.
5. If a given Named Entity is also a QTerm, indicate it
by the tag “QT”.
The following example illustrates canonicalization.
Consider the sentence:
More than 7,000 Libyan troops entered Chad.
The application of HMMNE results in:
More than a5 NUMEX TYPE=CARDINALa6 7,000
a5 /NUMEXa6 a5 HUMAN TYPE=PEOPLEa6 Libyan
a5 /HUMANa6 troops entered a5 ENAMEX TYPE=
COUNTRYa6 Chada5 /ENAMEXa6 .
The canonicalization step gives the sentence:
More than a5 CARDINALa6a7a5 PEOPLE QTa6 troops en-
tered a5 ANSWERa6 .
3.3 Pattern Extraction
Pattern extraction algorithm.
1. Every sentence obtained from sentence canon-
icalization algorithm is delimited by the tags
“a3 STARTa4 ” and “a3 ENDa4 ” and then passed
through a Suffix Tree. The Suffix Tree algorithm
obtains the counts of all sub-strings of the sentence.
2. From the Suffix Tree we obtain only those sub-
strings that are at least a trigram, contain both the
“a3 ANSWERa4 ” and the “a3 QTa4 ” tag and have at
least a count of 3 occurrences.
Source Number of Questions
Trec8 200
Trec9 500
KM 4200
Table 1: Training source and sizes.
Some examples of patterns obtained from the Suffix Tree
algorithm are as follows:
1. son of a5 PERSON QTa6 and a5 ANSWERa6
2. of the a5 ANSWERa6a8a5 DISEASE QTa6
3. of a5 ANSWERa6 at a5 LOCATION QTa6
4. a5 ANSWERa6 was the a5 ORDINALa6a9a5 OCCUPATION
QTa6 to
5. a5 ANSWERa6 was elected a5 OCCUPATION QTa6 of the
a5 LOCATION QTa6
6. a5 ANSWERa6 was a prolific a5 OCCUPATION QTa6
7. a5 LOCATION QTa6 , a5 ANSWERa6
8. a5 ANSWERa6 , a5 LOCATION QTa6
9. a5 STARTa6a10a5 ANSWERa6 served as a5 OCCUPATION
QTa6 from a5 DATEa6
10. a5 STARTa6a11a5 ANSWERa6 is the a5 PEOPLE QTa6
name for
A set of 22,353 such patterns were obtained by the ap-
plication of the pattern extraction algorithm from more
than 250,000 sentences. Some patterns are very general
and applicable to many questions, such as the ones in ex-
amples (7) and (8) while others are more specific to a
few questions, such as examples (9) and (10). Having
obtained these patterns we now can learn the appropriate
“weights” to use these patterns in a Question Answering
System.
4 Maximum Entropy Training
For these experiments we use the Maximum Entropy for-
mulation (Della Pietra et al., 1995) and model the distri-
bution (Ittycheriah, 2001b),
a12a14a13a16a15a18a17a19a21a20a23a22a25a24a27a26a29a28a31a30a21a12a14a13a16a15a18a17a32a18a20a33a19a21a20a34a22a25a24a35a12a14a13a36a32a21a17a19a21a20a33a22a25a24 (1)
The patterns derived above are used as features to model
the distribution a37a39a38a41a40a43a42a44a46a45a48a47a49a45a48a50a52a51 , which predicts the “correct-
ness” of the configuration of the question, a47 , the predicted
answer tag, a44 , and the answer candidate, a50 . The training
data for the algorithm consists of TREC-8, TREC-9, and
a subset of the KM questions which have been judged to
have answers in the TREC corpus5. The total number of
questions available for training is shown in Table 1.
We perform 3 sets of experiment with different choice
of feature sets for training:
1. In the first experiment, the patterns obtained auto-
matically from the web are trained along with the
expected type of answer using the Maximum En-
tropy Framework. We refer to this system as the
Pat Only System. This feature collection consisted
5Tagging of answers was done in a semi automatic way by
human judges.
Number of questions correct
Rank PAT ONLY IBM TREC11 ME PAT
1 117 157 167
2 24 21 32
3 16 21 14
4 16 22 11
5 8 8 10
MRR 0.29934 0.37573 0.39703
Table 2: Results on TREC-10.
of roughly 22,353 pattern features along with the
30 different expected answer types (the ones recog-
nized by HMMNE).
2. In the second experiment we use a Statistical QA
system that contains bag of words, syntactic and
named-entity features. We refer to this system as
the IBM TREC11 System. Details of this system
appear in (Ittycheriah and Roukos, 2002). This sys-
tem has approximately 8,000 features.
3. In the third experiment we add the patterns as ad-
ditional features to the base system IBM TREC11
and train the system. We refer to this system as the
ME PAT System. Hence, the total number of fea-
tures in this system is equal to the sum of the ones
in Pat Only and IBM TREC11 system.
These systems were trained on TREC-9 and KM and for
picking the optimum model we used TREC-8 as held-out
test data.
5 Results on TREC-10
We then tested the model on TREC-10. We tabulate the
results in Table 2. The TREC-10 collection consisted of
500 questions. The Rank column indicates the number
of questions answered by the QA systems with that par-
ticular rank. Finally the Mean Rank Reciprocal (MRR)
scores are reported.
6 Conclusion and Future Work
Not surprisingly, the PAT ONLY system shows only av-
erage performance as compared to other TREC-10 sys-
tems. This is because the system has no information
about the question except about its expected answer-
type. Hence, the PAT ONLY system would answer all the
questions involving TIME such as: “When was A born?”,
“When did A die?”, “Which year did A start attending
college?”, “When did A author book B?”with the same
answer!
Nonetheless, the ME PAT results show that surface
text patterns are useful for a Question Answering System.
Although in these experiments a feature set of 22,353
patterns was trained on approximately 210,000 instances,
only 1500 patterns was actually found in the final train-
ing data which had a count of at least 8 instances. This
suggests that the approach used here to train weights suf-
fers from the problem of having very little training data
as compared to the number of features. A much better ap-
proach would be to train the weights of the patterns from
the unsupervised collection itself. However, the effect of
noise introduced due to such unsupervised training is un-
clear.
The above technique represents a very clean approach
to integrating the use of patterns into a QA system. Most
of the rule based systems take years to engineer and are
very difficult to duplicate. However, a good statistical
system can be duplicated to give good performance in a
relatively short amount of time.
References
D. Bikel, R. Schwartz and R. Weischedel. 1999. An Al-
gorithm that Learns What´s in a Name. Machine Learn-
ing Special Issue on NL Learning, 34, 1–3.
A. Ittycheriah, M. Franz, W. Zhu, A. Ratnaparki and R.
Mammone. 2001a. Question Answering Using Maxi-
mum Entropy Components Proceedings of the NAACL
Conference, Pittsburgh, PA, 33–39.
A. Ittycheriah. 2001b. Trainable Question Answering
System. PhD Thesis, Rutgers, The State University of
New Jersey, New Brunswick, NJ.
A. Ittycheriah and S. Roukos. 2002. IBM’s Statistical
Question Answering System for TREC-11. Proceed-
ings of the TREC-11 Conference, NIST, Gaithersburg,
MD, 394–401.
KnowledgeMaster. 1999. http://www.greatauk.com.
Academic Hallmarks.
S. Della Pietra, V. Della Pietra, and J. Lafferty. 1995. In-
ducing Features of Random Fields. Technical Report,
Department of Computer Science, Carnegie-Mellon
University, CMU-CS-95–144.
D. Ravichandran and E. Hovy. 2002. Surface Text Pat-
terns for a Question Answering system. Proceedings
of the ACL Conference. Philadelphia, PA, 425–432.
41–47.
M.M. Soubbotin and S.M. Soubbotin. 2001. Patterns
of Potential Answer Expressions as Clues to the Right
Answer. Proceedings of the TREC-10 Conference.
NIST, Gaithersburg, MD, 134–143.
M.M. Soubbotin and S.M. Soubbotin. 2002. Use of Pat-
terns for detection of likely Answer Strings: A System-
atic Approach Answer. Proceedings of the TREC-2002
Conference. NIST, Gaithersburg, MD, 175–182.
