File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1150_metho.xml
Size: 16,260 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1150"> <Title>Learning Question Classifiers</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Learning a Question Classifier </SectionTitle> <Paragraph position="0"> Using machine learning methods for question classification is advantageous over manual methods for several reasons. The construction of a manual classifier for questions is a tedious task that requires the analysis of a large number of questions. Moreover, mapping questions into fine classes requires the use of lexical items (specific words) and therefore an explicit representation of the mapping may be very large. On the other hand, in our learning approach one can define only a small number of &quot;types&quot; of features, which are then expanded in a data-driven way to a potentially large number of features (Cumby and Roth, 2000), relying on the ability of the learning process to handle it. It is hard to imagine writing explicitly a classifier that depends on thousands or more features. Finally, a learned classifier is more flexible to reconstruct than a manual one because it can be trained on a new taxonomy in a very short time.</Paragraph> <Paragraph position="1"> One way to exhibit the difficulty in manually constructing a classifier is to consider reformulations of a question: What tourist attractions are there in Reims? What are the names of the tourist attractions in Reims? What do most tourists visit in Reims? What attracts tourists to Reims? What is worth seeing in Reims? All these reformulations target the same answer type Location. However, different words and syntactic structures make it difficult for a manual classifier based on a small set of rules to generalize well and map all these to the same answer type. Good learning methods with appropriate features, on the other hand, may not suffer from the fact that the number of potential features (derived from words and syntactic structures) is so large and would generalize and classify these cases correctly.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.1 A Hierarchical Classifier </SectionTitle> <Paragraph position="0"> Question classification is a multi-class classification. A question can be mapped to one of 50 possible classes (We call the set of all possible class labels for a given question a confusion set (Golding and Roth, 1999)). Our learned classifier is based on the SNoW learning architecture (Carlson et al., 1999; Roth, 1998) where, in order to allow the classifier to output more than one class label, we map the classifier's output activation into a conditional probability of the class labels and threshold it.</Paragraph> <Paragraph position="1"> The question classifier makes use of a sequence of two simple classifiers (Even-Zohar and Roth, 2001), each utilizing the Winnow algorithm within SNoW. The first classifies questions into coarse classes (Coarse Classifier) and the second into fine classes (Fine Classifier). A feature extractor automatically extracts the same features for each classifier. The second classifier depends on the first in that its candidate labels are generated by expanding the set of retained coarse classes from the first into a set of fine classes; this set is then treated as the confusion set for the second classifier.</Paragraph> <Paragraph position="2"> Figure 1 shows the basic structure of the hierarchical classifier. During either the training or the testing stage, a question is processed along one path top-down to get classified.</Paragraph> <Paragraph position="3"> The initial confusion set of any question is BV are the ultimate outputs from the whole classifier which are used in our evaluation.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Feature Space </SectionTitle> <Paragraph position="0"> Each question is analyzed and represented as a list of features to be treated as a training or test example for learning. We use several types of features and investigate below their contribution to the QC accuracy.</Paragraph> <Paragraph position="1"> The primitive feature types extracted for each question include words, pos tags, chunks (nonoverlapping phrases) (Abney, 1991), named entities, head chunks (e.g., the first noun chunk in a sentence) and semantically related words (words that often occur with a specific question class).</Paragraph> <Paragraph position="2"> Over these primitive features (which we call &quot;sensors&quot;) we use a set of operators to compose more complex features, such as conjunctive (ngrams) and relational features, as in (Cumby and Roth, 2000; Roth and Yih, 2001). A simple script that describes the &quot;types&quot; of features used, (e.g., conjunction of two consecutive words and their pos tags) is written and the features themselves are extracted in a data driven way. Only &quot;active&quot; features are listed in our representation so that despite the large number of potential features, the size of each example is small.</Paragraph> <Paragraph position="3"> Among the 6 primitive feature types, pos tags, chunks and head chunks are syntactic features while named entities and semantically related words are semantic features. Pos tags are extracted using a SNoW-based pos tagger (Even-Zohar and Roth, 2001). Chunks are extracted using a previously learned classifier (Punyakanok and Roth, 2001; Li and Roth, 2001). The named entity classifier is also learned and makes use of the same technology developed for the chunker (Roth et al., 2002). The 'related word' sensors were constructed semiautomatically. null Most question classes have a semantically related word list. Features will be extracted for this class if a word in a question belongs to the list. For example, when &quot;away&quot;, which belongs to a list of words semantically related to the class distance, occurs in the sentence, the sensor Rel(distance) will be active. We note that the features from these sensors are different from those achieved using named entity since they support more general &quot;semantic categorization&quot; and include nouns, verbs, adjectives rather than just named entities.</Paragraph> <Paragraph position="4"> For the sake of the experimental comparison, we define six feature sets, each of which is an incremental combination of the primitive feature types. That is, Feature set 1 (denoted by Word) contains word features; Feature set 2 (Pos) contains features composed of words and pos tags and so on; The final feature set, Feature set 6 (RelWord) contains all the feature types and is the only one that contains the related words lists. The classifiers will be experimented with different feature sets to test the influence of different features. Overall, there are about BEBCBCBNBCBCBC features in the feature space of RelWord due to the generation of complex features over simple feature types. For each question, up to a couple of hundreds of them are active.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Decision Model </SectionTitle> <Paragraph position="0"> For both the coarse and fine classifiers, the same decision model is used to choose class labels for a question. Given a confusion set and a question, SNoW outputs a density over the classes derived from the activation of each class. After ranking the classes in the decreasing order of density values, we have the possible class labels BV BP CUCR the probability that a question belongs to Class i, the decision model yields a reasonable probabilistic interpretation. We use CC BPBCBMBLBH in the experiments.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="3" type="metho"> <SectionTitle> 4 Experimental Study </SectionTitle> <Paragraph position="0"> We designed two experiments to test the accuracy of our classifier on TREC questions. The first experiment evaluates the contribution of different feature types to the quality of the classification. Our hierarchical classifier is trained and tested using one of the six feature sets defined in Sect. 3.2 (we repeated the experiments on several different training and test sets). In the second experiment, we evaluate the advantage we get from the hierarchical classifier. We construct a multi-class classifier only for fine classes. This flat classifier takes all fine classes as its initial confusion set and classifies a question into fine classes directly. Its parameters and decision model are the same as those of the hierarchical one. By comparing this flat classifier with our hierarchical classifier in classifying fine classes, we hope to know whether the hierarchical classifier has any advantage in performance, in addition to the advantages it might have in downstream processing and comprehensibility.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 4.1 Data </SectionTitle> <Paragraph position="0"> Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as our test set .</Paragraph> <Paragraph position="1"> These questions were manually labeled according to our question hierarchy. Although we allow multiple labels for one question in our classifiers, in our labeling, for simplicity, we assigned exactly The annotated data and experimental results are available from http://L2R.cs.uiuc.edu/AOcogcomp/ one label to each question. Our annotators were requested to choose the most suitable class according to their own understanding. This methodology might cause slight problems in training, when the labels are ambiguous, since some questions are not treated as positive examples for possible classes as they should be. In training, we divide the 5,500 questions from the first three sources randomly into 5 training sets of 1,000, 2,000, 3,000, 4,000 and 5,500 questions. All 500 TREC 10 questions are used as the test set.</Paragraph> </Section> <Section position="2" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.2 Evaluation </SectionTitle> <Paragraph position="0"> In this paper, we count the number of correctly classified questions by two different precision standards corresponds to the usual definition of precision which allows only one label for each question, while C8</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="3" type="metho"> <SectionTitle> AKBH </SectionTitle> <Paragraph position="0"> allows multiple labels.</Paragraph> </Section> <Section position="7" start_page="3" end_page="3" type="metho"> <SectionTitle> AKBH </SectionTitle> <Paragraph position="0"> reflects the accuracy of our classifier with respect to later stages in a question answering system. As the results below show, although question classes are still ambiguous, few mistakes are introduced by our classifier in this step.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.3 Experimental Results </SectionTitle> <Paragraph position="0"> Performance of the hierarchical classifier Table 2 shows the C8</Paragraph> </Section> </Section> <Section position="8" start_page="3" end_page="3" type="metho"> <SectionTitle> AKBH </SectionTitle> <Paragraph position="0"> precision of the hierarchical classifier when trained on 5,500 examples and tested on the 500 TREC 10 questions. The results are quite encouraging; question classification is shown to be solved effectively using machine learning techniques. It also shows the contribution of the feature sets we defined. Overall, we get a 98.80% precision for coarse classes with all the features and 95% for the fine classes.</Paragraph> <Paragraph position="1"> sifier on 500 TREC 10 questions. Training is done on 5,500 questions. Columns show the performance for difference feature sets and rows show the precision for coarse and fine classes, resp. All the results are evaluated using C8</Paragraph> </Section> <Section position="9" start_page="3" end_page="3" type="metho"> <SectionTitle> AKBH </SectionTitle> <Paragraph position="0"> .</Paragraph> <Paragraph position="1"> Inspecting the data carefully, we can observe the significant contribution of the features constructed based on semantically related words sensors. It is interesting to observe that this improvement is even more significant for fine classes.</Paragraph> <Paragraph position="2"> of the hierarchical classifier on training sets of different sizes and exhibit the learning curve for this problem.</Paragraph> <Paragraph position="3"> We note that the average numbers of labels output by the coarse and fine classifiers are 1.54 and 2.05 resp., (using the feature set RelWord and 5,500 training examples), which shows the decision model is accurate as well as efficient.</Paragraph> <Paragraph position="4"> Comparison of the hierarchical and the flat classifier The flat classifier consists of one classifier which is almost the same as the fine classifier in the hierarchical case, except that its initial confusion set is the whole set of fine classes. Our original hope was that the hierarchical classifier would have a better performance, given that its fine classifier only needs to deal with a smaller confusion set. However, it turns out that there is a tradeoff between this factor and the inaccuracy, albeit small, of the coarse level prediction. As the results show, there is no performance advantage for using a level of coarse classes, and the semantically appealing coarse classes do not contribute to better performance.</Paragraph> <Paragraph position="5"> Figure 2 give some more intuition on the flat vs. hierarchical issue. We define the tendency of Class CX to be confused with Class CY as follows: color of the small box in position (i,j) denotes BW</Paragraph> </Section> <Section position="10" start_page="3" end_page="3" type="metho"> <SectionTitle> CXCY . The larger BW CXCY </SectionTitle> <Paragraph position="0"> is, the darker the color is. The dotted lines separate the 6 coarse classes.</Paragraph> <Paragraph position="1"> ing to Class j, and C6 are the numbers of questions in Class i and j resp. Figure 2 is a gray-scale map of the matrix D[n,n]. D[n,n] is so sparse that most parts of the graph are blank. We can see that there is no good clustering of fine classes mistakes within a coarse class, which explains intuitively why the hierarchical classifier with an additional level coarse classes does not work much better.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 4.4 Discussion and Examples </SectionTitle> <Paragraph position="0"> We have shown that the overall accuracy of our classifier is satisfactory. Indeed, all the reformulation questions that we exemplified in Sec. 3 have been correctly classified. Nevertheless, it is constructive to consider some cases in which the classifier fails.</Paragraph> <Paragraph position="1"> Below are some examples misclassified by the hierarchical classifier.</Paragraph> <Paragraph position="2"> What French ruler was defeated at the battle of Waterloo? null The correct label is individual, but the classifier, failing to relate the word &quot;ruler&quot; to a person, since it was not in any semantic list, outputs event.</Paragraph> <Paragraph position="3"> What is the speed hummingbirds fly? The correct label is speed, but the classifier outputs animal. Our feature sensors fail to determine that the focus of the question is 'speed'. This example illustrates the necessity of identifying the question focus by analyzing syntactic structures.</Paragraph> <Paragraph position="4"> What do you call a professional map drawer ? The classifier returns other entities instead of equivalent term. In this case, both classes are acceptable. The ambiguity causes the classifier not to output equivalent term as the first choice.</Paragraph> </Section> </Section> class="xml-element"></Paper>