File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1410_metho.xml
Size: 14,589 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1410"> <Title>Question Terminology and Representation for Question Type Classication</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 5 FAQ questions which are ranked the highest </SectionTitle> <Paragraph position="0"> by the system's similarity measure. Currently FAQFinderincorporatesquestiontypeasoneof the four metrics in measuring the similaritybetween the user's question and FAQ questions. In the present implementation, the system uses a small set of manually selected words to determine the type of a question. The goal of our work here is to derive optimal features which wouldproduceimprovedclassicationaccuracy.</Paragraph> <Paragraph position="1"> The other three metrics are vector similarity, semantic similarity and coverage (Lytinen and Tomuro, 2002). 1. DEF (definition) 7. PRC (procedure) 2. REF (reference) 8. MNR (manner) 3. TME (time) 9. DEG (degree) 4. LOC (location) 10. ATR (atrans) 5. ENT (entity) 11. INT (interval) 6. RSN (reason) 12. YNQ (yes-no) Descriptive denitions of these types are found in (Tomuro and Lytinen, 2001). Table 1 shows example FAQ questions which we had used to develop the question types. Note that our question types are general question categories. They are aimed to cover a wide variety of questions entered bytheFAQFinder users.</Paragraph> </Section> <Section position="4" start_page="1" end_page="5" type="metho"> <SectionTitle> 3 Selection of Feature Sets </SectionTitle> <Paragraph position="0"> Inourcurrentwork,weutilizedtwofeaturesets: onesetconsistingoflexicalfeaturesonly(LEX), and another set consisting of a mixture of lexical features and semantic concepts (LEXSEM).</Paragraph> <Paragraph position="1"> Obviously, there are manyknown keywords, idioms and xed expressions commonly observed in question sentences. However, categorization of some of our 12 question types seem to depend on open-class words, for instance, \What does mpg mean?&quot; (DEF) and \What does Belgium import and export?&quot; (REF). To distinguish those types, semantic features seem eective. Semantic features could also be useful as back-o features sincethey allowfor generalization. For example, in WordNet (Miller, 1990), thenoun\know-how&quot;isencodedasahypernym of \method&quot;, \methodology&quot;, \solution&quot; and \technique&quot;. By selecting such abstract concepts as semantic features, we can cover a variety of paraphrases even for xed expressions, andsupplementthe coverage of lexicalfeatures.</Paragraph> <Paragraph position="2"> Weselectedthetwofeaturesetsinthefollowing two steps. In the rst step, using a dataset of 5105 example questions taken from 485 FAQ les/domains, we rst manually tagged each question by question type, and then automatically derived the initial lexical set and initial semantic set. Then in the second step, we renedthoseinitialsetsbypruningirrelevantfea- null tures and derived two subsets: LEX from the initial lexical set and LEXSEM from the union of lexical and semantic sets.</Paragraph> <Paragraph position="3"> To evaluate various subsets tried during the selection steps, we applied two machine learning algorithms: C5.0 (the commercial We also measured the classication accuracy by a procedure we call domain cross-validation (DCV). DCV is a variation of the standard cross-validation (CV) where the data is partitioned according to domains instead of random We used k = 3 and majorityvoting scheme for all experiments in our currentwork.</Paragraph> <Paragraph position="4"> choice. To do a k-fold DCV on a set of examples from n domains, the set is rst broken into k non-overlappingblocks,whereeach block contains examples exactly from m =</Paragraph> <Paragraph position="6"> domains. Then in each fold, a classier is trained with (k ; 1) m domains and tested on examples from m unseen domains. Thus, by observing the classication accuracy of the target categories using DCV, we can measure the domain transferability: how well the features extractedfromsomedomainstransfertootherdo- null mains. Sincequestionterminologyisessentially domain-independent, DCV is a better evaluation measure than CV for our purpose.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.1 Initial Lexical Set </SectionTitle> <Paragraph position="0"> The initial lexical set was obtained by ordering the words in the dataset by their Gain Ratio scores,thenselectingthesubsetwhichproduced thebestclassicationaccuracybyC5.0andPE-BLS. Gain Ratio (GR) is a metric often used in classication systems (notably in C4.5) for measuring howwell a feature predicts the categories of the examples. GR is a normalized version of another metric called Information Gain (IG), which measures the informativeness of a feature by the number of bits required to encode the examples if they are partitioned into two sets, based on the presence or absence of the feature.</Paragraph> <Paragraph position="1"> Let C denote the set of categories c</Paragraph> <Paragraph position="3"> for which the examples are classied (i.e., target categories). Given a collection of examples S, the Gain Ratio of a feature A, GR(S;;A), is dened as:</Paragraph> <Paragraph position="5"> The description of Information Gain here is for binary partitioning. Information Gain can also be generalized to m-way partitioning, for all m>=2.</Paragraph> <Paragraph position="6"> INT \When will the sun die?&quot; YNQ \Is the Moon moving away from the Earth?&quot; Then, features which yield high GR values are good predictors. In previous work in text categorization, GR (or IG) has been shown to be one of the most eective methods for reducing dimensions (i.e., words to represent each text) (Yang and Pedersen, 1997).</Paragraph> <Paragraph position="7"> Here in applying GR, there was one issue we had to consider: how to distinguish content words from non-content words. This issue arose from the uneven distribution of the question types in the dataset. Since not all question types were represented in every domain, if we chose question type as the target category, features which yield high GR values might include somedomain-specicwords. Ineect,goodpredictors for our purpose are words which predict question types very well, but do not predict domains. Therefore, we dened the GR score of a word to be the combination of two values: the GR value when the target category was question type, minus the GR value when the target category was domain.</Paragraph> <Paragraph position="8"> We computed the (modied) GR score for 1485 words which appeared more than twice in thedataset,andappliedC5.0andPEBLS.Then wegraduallyreducedthesetbytakingthetopn words according to the GR scores and observed changes in the classication accuracy. Figure 3 shows the result. The evaluation was done by usingthe5-foldDCV,andtheaccuracypercentages indicated in the gure were an average of 3 runs. The best accuracy was achieved by the top 350 words by both algorithms;; the remaining words seemed to have caused overtting as the accuracy showed slight decline. Thus, we took the top 350 words as the initiallexical feature set.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.2 Initial Semantic Set </SectionTitle> <Paragraph position="0"> The initial semantic set was obtained by automatically selecting some nodes in the Word-Net (Miller, 1990) noun and verb trees. For each question type, we chose questions of certain structures and applied a shallow parser to extract nouns and/or verbs which appeared at a specic position. For example, for all question types (except for YNQ), we extracted the head noun from questions of the form \What is NP ..?&quot;. Those nouns are essentially the denominalization of the question type. The nouns extracted included \way&quot;, \method&quot;, \procedure&quot;, \process&quot; for the type PRC, \reason&quot;,\advantage&quot;forRSN,and\organization&quot;, null \restaurant&quot; for ENT. For the types DEF and MNR, we also extracted the main verb from questions of the form \How/What does NP V ..?&quot;. Such verbs included \work&quot;, \mean&quot; for DEF, and \aect&quot; and \form&quot; for MNR.</Paragraph> <Paragraph position="1"> Then for the nouns and verbs extracted for each question type, we applied the sense disambiguation algorithm used in (Resnik, 1997) and derived semantic classes (or nodes in the WordNet trees) which were their abstract generalization. Foreachwordinaset, wetraversed the WordNet tree upward through the hypernym links from the nodes which corresponded to the rsttwo senses of the word, and assigned each ancestor a value which equaled to the inverse of the distance (i.e., the number of links traversed) from the original node. Then we accumulated the values for all ancestors, and selected ones (excluding the top nodes) whose value was above a threshold. For example, the set of nouns extracted for the type PRC were \know-how&quot; (an ancestor of \way&quot; and \method&quot;) and \activity&quot; (an ancestor of \procedure&quot; and \process&quot;).</Paragraph> <Paragraph position="2"> Byapplyingthe procedureabove forallquestion types, we obtained a total of 112 semantic classes. Thisconstitutestheinitialsemanticset.</Paragraph> </Section> <Section position="3" start_page="3" end_page="5" type="sub_section"> <SectionTitle> 3.3 Renement </SectionTitle> <Paragraph position="0"> Thenalfeature sets, LEXandLEXSEM,were derived by further rening the initial sets. The main purpose of renement was to reduce the union of initial lexical and semantic sets (a total of 350 + 112 = 462 features) and derive LEXSEM. It was done by taking the features which appeared in more than half of the decision trees inducedby C5.0 duringthe iterations of DCV.</Paragraph> <Paragraph position="1"> Then we applied the same procedure to the initial lexical set (350 features) and derived LEX. Now both sets were (sub) optimal subsets, with whichwe could make a fair comparison. Therewere117features/wordsand164 features selected for LEX andLEXSEMrespectively. null Our renement method is similar to (Cardie, 1993) in that it selects features by removing ones that did not appear in a decision tree. The dierence is that, in our method, each decision tree is induced from a strict subset of the domains of the dataset. Therefore, by taking the intersection of multiple such trees, we caneectivelyextract featuresthataredomainindependent, thus transferable to other unseen domains. Our method is also computationally Wehave in fact experimented various threshold values. It turned out that .5 produced the best accuracy. less expensive and feasible, given the numberof features expected to be in the reduced set (over a hundred by our intuition), than other feature subset selection techniques, most of which require expensive search through model space (suchaswrapper approach (John et al., 1994)). Table2showstheclassicationaccuracymeasuredbyDCVforthetrainingset. Theincrease of the accuracy after the renement was minimalusingC5.0(from76.7to77.4forLEX,from null 76.7 to 77.7 for LEXSEM), as expected. But theincreaseusingPEBLSwasrathersignicant (from71.8to74.5forLEX,from71.8to74.7for LEXSEM). Thisresult agreed with the ndings in (Cardie, 1993), and conrmed that LEX and LEXSEM were indeed (sub) optimal. However, the dierence between LEX and LEXSEM was not statistically signicant by either algorithm (from 77.4 to 77.7 by C5.0, from 74.5 to 74.7 by PEBLS;; p-values were .23 and .41 respectively null ). This means the semantic features did not help improve the classication accuracy.</Paragraph> <Paragraph position="2"> As we inspected the results, we discovered that, out of the 164 features in LEXSEM, 32 were semantic features, and they did occur in 33% of the training examples (1671=5105 :33). However in most of those examples, key terms were already represented by lexical features, thus semantic features did not add any more information to help determine the question type. As an example, a sentence \What are the dates of the upcoming Jewish holidays?&quot; was represented by lexical features \what&quot;, \be&quot;, \of&quot; and \date&quot;, and a semantic feature \time-unit&quot; (an ancestor of \date&quot;). The 117 words in LEX are listed in the Appendix at the end of this paper.</Paragraph> <Paragraph position="3"> P-values were obtained by applying the t-test on the accuracy produced by all iterations of DCV, with anull hypothesis that the mean accuracy of LEXSEM was higher than that of LEX.</Paragraph> </Section> <Section position="4" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 3.4 External Testsets </SectionTitle> <Paragraph position="0"> Tofurtherinvestigate theeect ofsemanticfeatures, we tested LEX and LEXSEM with two externaltestsets: onesetconsistingof620 questions taken from FAQFinder user log, and anothersetconsistingof3485questionstakenfrom null theAskJeeves(http://www.askjeeves.com)user log. Both datasets contained questions from a wide range of domains, therefore served as an excellent indicatorofthe domaintransferability for our two feature sets.</Paragraph> <Paragraph position="1"> Table3showstheresults. FortheFAQFinder data, LEX and LEXSEM produced comparable accuracy using both C5.0 and PEBLS. But for the AskJeeves data, LEXSEM did worse than LEX consistently by both classiers. This means the additionalsemantic features were interacting with lexical features.</Paragraph> <Paragraph position="2"> We speculate the reason to be the following. Compared to the FAQFinder data, the AskJeevesdatawasgatheredfromamuchwider audience, and the questions spanned a broad range of domains. Many terms in the questions were from vocabulary considerably larger than that of our training set. Therefore, the data contained quite a few words whose hypernym links lead to a semantic feature in LEXSEM but did not fall into the question type keyed by the feature. For instance, a question in AskJeeves \What does Hanukah mean?&quot; was mis-classied as type TME by using LEXSEM.</Paragraph> <Paragraph position="3"> This was because \Hanukah&quot; in WordNet was encodedasahyponymof\time period&quot;. Onthe other hand, LEX did not include \Hanukah&quot;, thus correctly classied the question as type DEF.</Paragraph> </Section> </Section> class="xml-element"></Paper>