Question Terminology and Representation for
Question Type Classication
Noriko Tomuro
DePaul University
School of Computer Science, Telecommunications and Information Systems
243 S. Wabash Ave.
Chicago, IL 60604 U.S.A.
tomuro@cs.depaul.edu
Abstract
Question terminology isasetoftermswhichap-
pear in keywords, idioms and xed expressions
commonly observed in questions. This paper
investigateswaystoautomaticallyextractques-
tionterminologyfromacorpusofquestionsand
represent them for the purposeof classifyingby
question type. Ourkeyinterestistoseewhether
or not semantic features can enhance the repre-
sentation of strongly lexical nature of question
sentences. We compare two feature sets: one
with lexical features only, and another with a
mixture of lexical and semantic features. For
evaluation, we measure the classication accu-
racy made bytwomachine learning algorithms,
C5.0 and PEBLS, by using a procedure called
domain cross-validation, which eectively mea-
sures the domain transferability of features.
1 Introduction
In Information Retrieval (IR), text categoriza-
tion and clustering, documents are usually in-
dexed and represented by domain terminology:
terms which are particular to the domain/topic
ofadocument. However, whendocumentsmust
be retrieved or categorized according to criteria
whichdonotcorrespondtothedomains,suchas
genre (text style) (Kessler et al., 1997;; Finn et
al., 2002) or subjectivity (e.g. opinion vs. fac-
tual description) (Wiebe, 2000), we must use
dierent, domain-independentfeatures to index
and represent documents. In those tasks, selec-
tion of the features is in fact one of the most
critical factors which aect the performance of
a system.
Question type classication is one of such
tasks, where categories are question types (e.g.
'how-to', 'why' and 'where'). In recent years,
question type has been successfully used in
many Question-Answering (Q&A) systems for
determining the kind of entity or concept be-
ing asked and extracting an appropriate answer
(Voorhees, 2000;; Harabagiu et al., 2000;; Hovy
et al., 2001). Just like genre, question types
cut across domains;; for instance, we can ask
'how-to' questions in the cooking domain, the
legaldomainetc. However, featuresthatconsti-
tutequestiontypesaredierentfromthoseused
for genre classication (typically part-of-speech
or meta-lingusitic features) in that features are
strongly lexical due to the large amount of id-
iosyncrasy (keywords, idioms or syntactic con-
structions) that is frequently observed in ques-
tionsentences. Forexample,wecaneasilythink
of question patterns such as \What is the best
way to .." and \What do I have to do to ..". In
this regard, terms whichidentify question type
are considered to form a terminology of their
own, whichwe dene as question terminology.
Terms in question terminology have some
characteristics. First, they are mostly domain-
independent, non-content words. Second, they
include many closed-class words (such as in-
terrogatives, modals and pronouns), and some
open-class words (e.g. the noun \way" and the
verb \do"). In a way, question terminology is a
complement of domain terminology.
Automaticextractionofquestionterminology
isaratherdiculttask,sincequestiontermsare
mixed in with content terms. Another compli-
cating factor is paraphrasing { there are many
ways to ask the same question. For example,
- \How can I clean teapots?"
- \In what way can we clean teapots?"
- \What is the best way to clean teapots?"
- \What method is used for cleaning teapots?"
- \How do I go about cleaning teapots?"
In this paper, we present the results of our
investigation on how to automatically extract
questionterminologyfromacorpusofquestions
and represent them for the purpose of classi-
fying by question type. It is an extension of
ourpreviouswork(Tomuro andLytinen, 2001),
wherewecomparedautomaticandmanualtech-
niques to select features from questions, but
only (stemmed) words were considered for fea-
tures. The focus of the current work is to in-
vestigate the kind(s) of features, rather than
selection techniques, which are best suited for
representing questions for classication. Specif-
ically, from a large dataset of questions, we au-
tomatically extracted two sets of features: one
set consisting of terms (i.e., lexical features)
only, and another set consisting of a mixture of
termsandsemantic concepts (i.e., semantic fea-
tures). Our particularinterest isto see whether
ornotsemanticconceptscanenhancetherepre-
sentation of strongly lexical nature of question
sentences. To this end, we apply two machine
learning algorithms (C5.0 (Quinlan, 1994) and
PEBLS (Cost and Salzberg, 1993)), and com-
paretheclassicationaccuracyproducedforthe
two feature sets. The results show that there is
no signicant increase by either algorithm by
the addition of semantic features.
The original motivation behind our work on
question terminology was to improve the re-
trievalaccuracyofoursystemcalledFAQFinder
(Burkeetal.,1997;; LytinenandTomuro, 2002).
FAQFinder is a web-based, natural language
Q&A system which uses Usenet Frequently
Asked Questions (FAQ) les to answer users'
questions. Figures1and2showanexampleses-
sion with FAQFinder. First, the user enters a
question in natural language. The system then
searches the FAQ les for questions that are
similar to the user's. Based on the results of
the search, FAQFinder displays a maximum of
5 FAQ questions which are ranked the highest
by the system's similarity measure. Currently
FAQFinderincorporatesquestiontypeasoneof
the four metrics in measuring the similaritybe-
tween the user's question and FAQ questions.
1
In the present implementation, the system uses
a small set of manually selected words to deter-
mine the type of a question. The goal of our
work here is to derive optimal features which
wouldproduceimprovedclassicationaccuracy.
1
The other three metrics are vector similarity, seman-
tic similarity and coverage (Lytinen and Tomuro, 2002).
Figure 1: User question entered as a natural
language query to FAQFinder
Figure 2: The 5 best-matching FAQ questions
2 Question Types
Inourwork,wedened12questiontypesbelow.
1. DEF (definition) 7. PRC (procedure)
2. REF (reference) 8. MNR (manner)
3. TME (time) 9. DEG (degree)
4. LOC (location) 10. ATR (atrans)
5. ENT (entity) 11. INT (interval)
6. RSN (reason) 12. YNQ (yes-no)
Descriptive denitions of these types are
found in (Tomuro and Lytinen, 2001). Table
1 shows example FAQ questions which we had
used to develop the question types. Note that
our question types are general question cate-
gories. They are aimed to cover a wide variety
of questions entered bytheFAQFinder users.
3 Selection of Feature Sets
Inourcurrentwork,weutilizedtwofeaturesets:
onesetconsistingoflexicalfeaturesonly(LEX),
and another set consisting of a mixture of lexi-
cal features and semantic concepts (LEXSEM).
Obviously, there are manyknown keywords, id-
ioms and xed expressions commonly observed
in question sentences. However, categorization
of some of our 12 question types seem to de-
pend on open-class words, for instance, \What
does mpg mean?" (DEF) and \What does Bel-
gium import and export?" (REF). To distin-
guish those types, semantic features seem eec-
tive. Semantic features could also be useful as
back-o features sincethey allowfor generaliza-
tion. For example, in WordNet (Miller, 1990),
thenoun\know-how"isencodedasahypernym
of \method", \methodology", \solution" and
\technique". By selecting such abstract con-
cepts as semantic features, we can cover a va-
riety of paraphrases even for xed expressions,
andsupplementthe coverage of lexicalfeatures.
Weselectedthetwofeaturesetsinthefollow-
ing two steps. In the rst step, using a dataset
of 5105 example questions taken from 485 FAQ
les/domains, we rst manually tagged each
question by question type, and then automat-
ically derived the initial lexical set and initial
semantic set. Then in the second step, we re-
nedthoseinitialsetsbypruningirrelevantfea-
tures and derived two subsets: LEX from the
initial lexical set and LEXSEM from the union
of lexical and semantic sets.
To evaluate various subsets tried during
the selection steps, we applied two machine
learning algorithms: C5.0 (the commercial
version of C4.5 (Quinlan, 1994), available
at http://www.rulequest.com), a decision tree
classier;; and PEBLS (Cost and Salzberg,
1993), a k-nearest neighbor algorithm.
2
We
also measured the classication accuracy by
a procedure we call domain cross-validation
(DCV). DCV is a variation of the standard
cross-validation (CV) where the data is parti-
tioned according to domains instead of random
2
We used k = 3 and majorityvoting scheme for all
experiments in our currentwork.
choice. To do a k-fold DCV on a set of ex-
amples from n domains, the set is rst broken
into k non-overlappingblocks,whereeach block
contains examples exactly from m =
n
k
do-
mains. Then in each fold, a classier is trained
with (k ; 1)  m domains and tested on ex-
amples from m unseen domains. Thus, by ob-
serving the classication accuracy of the target
categories using DCV, we can measure the do-
main transferability: how well the features ex-
tractedfromsomedomainstransfertootherdo-
mains. Sincequestionterminologyisessentially
domain-independent, DCV is a better evalua-
tion measure than CV for our purpose.
3.1 Initial Lexical Set
The initial lexical set was obtained by ordering
the words in the dataset by their Gain Ratio
scores,thenselectingthesubsetwhichproduced
thebestclassicationaccuracybyC5.0andPE-
BLS. Gain Ratio (GR) is a metric often used
in classication systems (notably in C4.5) for
measuring howwell a feature predicts the cate-
gories of the examples. GR is a normalized ver-
sion of another metric called Information Gain
(IG), which measures the informativeness of a
feature by the number of bits required to en-
code the examples if they are partitioned into
two sets, based on the presence or absence of
the feature.
3
Let C denote the set of categories c
1
;;::;;c
m
for which the examples are classied (i.e., tar-
get categories). Given a collection of examples
S, the Gain Ratio of a feature A, GR(S;;A), is
dened as:
GR(S;;A)=
IG(S;;A)
SI(S;;A)
where IG(S;;A) isthe InformationGaindened
to be:
IG(S;;A)= ;
P
m
i=1
Pr(c
i
) log
2
Pr(c
i
)
+Pr(A)
P
m
i=1
Pr(c
i
jA) log
2
Pr(c
i
jA)
+Pr(A)
P
m
i=1
Pr(c
i
jA) log
2
Pr(c
i
jA)
and SI(S;;A) is the Splitting Information de-
ned to be:
SI(S;;A)=;Pr(A) log
2
Pr(A) ; Pr(A) log
2
Pr(A)
3
The description of Information Gain here is for bi-
nary partitioning. Information Gain can also be gener-
alized to m-way partitioning, for all m>=2.
Table 1: Example FAQ questions
Question Type Question
DEF \What does \reactivity" of emissions mean?"
REF \What do mutual funds invest in?"
TME \What dates are important when investing in mutual funds?"
ENT \Who invented Octane Ratings?"
RSN \Why does the Moon always show the same face to the Earth?"
PRC \How can I get rid of a caeine habit?"
MNR \How did the solar system form?"
ATR \Where can I get British tea in the United States?"
INT \When will the sun die?"
YNQ \Is the Moon moving away from the Earth?"
Then, features which yield high GR values are
good predictors. In previous work in text cat-
egorization, GR (or IG) has been shown to be
one of the most eective methods for reducing
dimensions (i.e., words to represent each text)
(Yang and Pedersen, 1997).
Here in applying GR, there was one issue
we had to consider: how to distinguish con-
tent words from non-content words. This issue
arose from the uneven distribution of the ques-
tion types in the dataset. Since not all question
types were represented in every domain, if we
chose question type as the target category, fea-
tures which yield high GR values might include
somedomain-specicwords. Ineect,goodpre-
dictors for our purpose are words which predict
question types very well, but do not predict do-
mains. Therefore, we dened the GR score of a
word to be the combination of two values: the
GR value when the target category was ques-
tion type, minus the GR value when the target
category was domain.
We computed the (modied) GR score for
1485 words which appeared more than twice in
thedataset,andappliedC5.0andPEBLS.Then
wegraduallyreducedthesetbytakingthetopn
words according to the GR scores and observed
changes in the classication accuracy. Figure 3
shows the result. The evaluation was done by
usingthe5-foldDCV,andtheaccuracypercent-
ages indicated in the gure were an average of
3 runs. The best accuracy was achieved by the
top 350 words by both algorithms;; the remain-
ing words seemed to have caused overtting as
the accuracy showed slight decline. Thus, we
took the top 350 words as the initiallexical fea-
ture set.
0
10
20
30
40
50
60
70
80
90
0 200 400 600 800 1000 1200 1400
# features
Accuracy (%)
C5.0
PEBLS
Figure 3: Classication Accuracy (%) on the
training data measured by Domain Cross Vali-
dation (DCV)
3.2 Initial Semantic Set
The initial semantic set was obtained by au-
tomatically selecting some nodes in the Word-
Net (Miller, 1990) noun and verb trees. For
each question type, we chose questions of cer-
tain structures and applied a shallow parser to
extract nouns and/or verbs which appeared at
a specic position. For example, for all ques-
tion types (except for YNQ), we extracted the
head noun from questions of the form \What
is NP ..?". Those nouns are essentially the
denominalization of the question type. The
nouns extracted included \way", \method",
\procedure", \process" for the type PRC, \rea-
son",\advantage"forRSN,and\organization",
\restaurant" for ENT. For the types DEF and
MNR, we also extracted the main verb from
questions of the form \How/What does NP V
..?". Such verbs included \work", \mean" for
DEF, and \aect" and \form" for MNR.
Then for the nouns and verbs extracted for
each question type, we applied the sense dis-
ambiguation algorithm used in (Resnik, 1997)
and derived semantic classes (or nodes in the
WordNet trees) which were their abstract gen-
eralization. Foreachwordinaset, wetraversed
the WordNet tree upward through the hyper-
nym links from the nodes which corresponded
to the rsttwo senses of the word, and assigned
each ancestor a value which equaled to the in-
verse of the distance (i.e., the number of links
traversed) from the original node. Then we
accumulated the values for all ancestors, and
selected ones (excluding the top nodes) whose
value was above a threshold. For example,
the set of nouns extracted for the type PRC
were \know-how" (an ancestor of \way" and
\method") and \activity" (an ancestor of \pro-
cedure" and \process").
Byapplyingthe procedureabove forallques-
tion types, we obtained a total of 112 semantic
classes. Thisconstitutestheinitialsemanticset.
3.3 Renement
Thenalfeature sets, LEXandLEXSEM,were
derived by further rening the initial sets. The
main purpose of renement was to reduce the
union of initial lexical and semantic sets (a to-
tal of 350 + 112 = 462 features) and derive
LEXSEM. It was done by taking the features
which appeared in more than half of the deci-
sion trees inducedby C5.0 duringthe iterations
of DCV.
4
Then we applied the same procedure
to the initial lexical set (350 features) and de-
rived LEX. Now both sets were (sub) optimal
subsets, with whichwe could make a fair com-
parison. Therewere117features/wordsand164
features selected for LEX andLEXSEMrespec-
tively.
Our renement method is similar to (Cardie,
1993) in that it selects features by removing
ones that did not appear in a decision tree.
The dierence is that, in our method, each de-
cision tree is induced from a strict subset of
the domains of the dataset. Therefore, by tak-
ing the intersection of multiple such trees, we
caneectivelyextract featuresthataredomain-
independent, thus transferable to other unseen
domains. Our method is also computationally
4
Wehave in fact experimented various threshold val-
ues. It turned out that .5 produced the best accuracy.
Table 2: Classication accuracy (%) on the
training set by using reduced feature sets
Feature set # features C5.0 PEBLS
Initial lex 350 76.7 71.8
LEX (reduced) 117 77.4 74.5
Initial lex + sem 462 76.7 71.8
LEXSEM (reduced) 164 77.7 74.7
less expensive and feasible, given the numberof
features expected to be in the reduced set (over
a hundred by our intuition), than other fea-
ture subset selection techniques, most of which
require expensive search through model space
(suchaswrapper approach (John et al., 1994)).
Table2showstheclassicationaccuracymea-
suredbyDCVforthetrainingset. Theincrease
of the accuracy after the renement was mini-
malusingC5.0(from76.7to77.4forLEX,from
76.7 to 77.7 for LEXSEM), as expected. But
theincreaseusingPEBLSwasrathersignicant
(from71.8to74.5forLEX,from71.8to74.7for
LEXSEM). Thisresult agreed with the ndings
in (Cardie, 1993), and conrmed that LEX and
LEXSEM were indeed (sub) optimal. However,
the dierence between LEX and LEXSEM was
not statistically signicant by either algorithm
(from 77.4 to 77.7 by C5.0, from 74.5 to 74.7
by PEBLS;; p-values were .23 and .41 respec-
tively
5
). This means the semantic features did
not help improve the classication accuracy.
As we inspected the results, we discovered
that, out of the 164 features in LEXSEM, 32
were semantic features, and they did occur in
33% of the training examples (1671=5105 
:33). However in most of those examples, key
terms were already represented by lexical fea-
tures, thus semantic features did not add any
more information to help determine the ques-
tion type. As an example, a sentence \What
are the dates of the upcoming Jewish holi-
days?" was represented by lexical features
\what", \be", \of" and \date", and a seman-
tic feature \time-unit" (an ancestor of \date").
The 117 words in LEX are listed in the Ap-
pendix at the end of this paper.
5
P-values were obtained by applying the t-test on
the accuracy produced by all iterations of DCV, with
anull hypothesis that the mean accuracy of LEXSEM
was higher than that of LEX.
Table 3: Classication accuracy (%) on the testsets
Feature set # FAQFinder AskJeeves
features C5.0 PEBLS C5.0 PEBLS
LEX 117 67.8 66.6 77.3 73.9
LEXSEM 164 67.5 67.1 73.7 71.1
3.4 External Testsets
Tofurtherinvestigate theeect ofsemanticfea-
tures, we tested LEX and LEXSEM with two
externaltestsets: onesetconsistingof620 ques-
tions taken from FAQFinder user log, and an-
othersetconsistingof3485questionstakenfrom
theAskJeeves(http://www.askjeeves.com)user
log. Both datasets contained questions from a
wide range of domains, therefore served as an
excellent indicatorofthe domaintransferability
for our two feature sets.
Table3showstheresults. FortheFAQFinder
data, LEX and LEXSEM produced compara-
ble accuracy using both C5.0 and PEBLS. But
for the AskJeeves data, LEXSEM did worse
than LEX consistently by both classiers. This
means the additionalsemantic features were in-
teracting with lexical features.
We speculate the reason to be the follow-
ing. Compared to the FAQFinder data, the
AskJeevesdatawasgatheredfromamuchwider
audience, and the questions spanned a broad
range of domains. Many terms in the questions
were from vocabulary considerably larger than
that of our training set. Therefore, the data
contained quite a few words whose hypernym
links lead to a semantic feature in LEXSEM
but did not fall into the question type keyed
by the feature. For instance, a question in
AskJeeves \What does Hanukah mean?" was
mis-classied as type TME by using LEXSEM.
This was because \Hanukah" in WordNet was
encodedasahyponymof\time period". Onthe
other hand, LEX did not include \Hanukah",
thus correctly classied the question as type
DEF.
4 Related Work
Recently,withaneedtoincorporateuserprefer-
ences in information retrieval, several work has
been done which classies documents by genre.
For instance, (Finn et al., 2002) used machine
learningtechniquesto identifysubjective (opin-
ion)documentsfromnewspaperarticles. Tode-
termine what feature adapts well to unseen do-
mains, they compared three kinds of features:
words, part-of-speech statistics and manually
selected meta-linguistic features. They con-
cluded that the part-of-speech performed the
best with regard to domain transfer. However,
not onlywere their feature sets pre-determined,
their features were distinct from words in the
documents (or features were the entire words
themselves), thus no feature subset selection
was performed.
(Wiebe, 2000) also used machine learning
techniques to identify subjectivesentences. She
focused on adjectives as an indicator of sub-
jectivity, and used corpus statistics and lexical
semantic information to derive adjectives that
yielded high precision.
5 Conclusions and Future Work
Inthispaper, weshowed that semantic features
didnot enhancelexicalfeatures intherepresen-
tation of questions for the purpose of question
type classication. While semantic features al-
low for generalization, they also seemed to do
more harm than good in some cases by inter-
acting with lexical features. This indicates that
question terminology is strongly lexical indeed,
and suggests that enumeration of words which
appear in typical, idiomatic question phrases
would be more eective than semantics.
For future work, we are planning to exper-
iment with synonyms. The use of synonyms
is another way of increasing the coverage of
question terminology;; while semantic features
try to achieve it by generalization, synonyms
do it by lexical expansion. Our plan is to use
the synonyms obtained from very large cor-
pora reported in (Lin, 1998). We are also
planning to compare the (lexical and seman-
tic) features we derived automatically in this
work with manually selected features. In our
previous work, manually selected (lexical) fea-
turesshowedslightlybetterperformanceforthe
training data but no signicant dierence for
the test data. We plan to manually pick out se-
mantic as well as lexical features, and apply to
the current data.

References
R. Burke, K. Hammond, V. Kulyukin, S. Lyti-
nen, N. Tomuro, and S. Schoenberg. 1997.
Question answering from frequently asked
question les: Experiences with the faqnder
system. AI Magazine,18(2).
C. Cardie. 1993. Using decision trees to im-
prove case-based learning. In Proceedings of
the 10th International Conference on Ma-
chine Learning (ICML-93).
S.CostandS.Salzberg. 1993. Aweightednear-
est neighboralgorithmforlearningwithsym-
bolic features. Machine Learning, 10(1).
A. Finn, N. Kushmerick, and B. Smyth. 2002.
Genre classication and domain transfer for
information ltering. In Proceedings of the
European Colloquium on Information Re-
trieval Research, Glasgow.
S. Harabagiu, D. Moldovan, M. Pasca, R. Mi-
halcea, M. Surdeanu, R. Bunescu, R. Girju,
V. Rus, and P. Morarescu. 2000. Falcon:
Boosting knowledge for answer engines. In
Proceedings of TREC-9.
E. Hovy, L. Gerber, U. Hermjakob, C. Lin, and
D. Ravichandran. 2001. Toward semantics-
based answer pinpointing. In Proceedings of
the DARPA Human Language Technologies
(HLT).
G.John,R.Kohavi,andK.Peger. 1994. Irrel-
evant features and the subset selection prob-
lem. In Proceedings of the 11th International
Conference on Machine Learning (ICML-94).
K. Kessler, G. Nunberg, and H. Schutze. 1997.
Automatic detection of text genre. In Pro-
ceedings of the 35th Annual Meeting of the
Association for Computational Linguistics
(ACL-97).
D. Lin. 1998. Automatic retrieval and cluster-
ing of similar words. In Proceedings of the
36th Annual Meeting of the Association for
Computational Linguistics (ACL-98).
S. Lytinen and N. Tomuro. 2002. The use
of question types to match questions in
faqnder. In Papers from the 2002 AAAI
Spring Symposium on Mining Answers from
Texts and Knowledge Bases.
G. Miller. 1990. Wordnet: An online lexical
database. International Journal of Lexicog-
raphy, 3(4).
R. Quinlan. 1994. C4.5: Programs for Machine
Learning. Morgan Kaufman.
P. Resnik. 1997. Selectional preference and
sense disambiguation. In Proceedings of the
ACL SIGLEX Workshop on Tagging Text
with Lexical Semantics,Washington D.C.
N. Tomuro and S. Lytinen. 2001. Selecting
features for paraphrasing question sentences.
In Proceedings of the workshop on Auto-
matic Paraphrasing at NLP Pacic Rim 2001
(NLPRS-2001), Tokyo, Japan.
E. Voorhees. 2000. The trec-9 question answer-
ing track report. In Proceedings of TREC-9.
J. Wiebe. 2000. Learning subjective adjectives
from corpora. In Proceedings of the 17th Na-
tional Conference on Articial Intelligence
(AAAI-2000), Austin, Texas.
Y. Yang and J. Pedersen. 1997. A comparative
study on feature selection in text categoriza-
tion. In Proceedings of the 14th International
Conference on Machine Learning (ICML-97).
