File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1040_metho.xml
Size: 21,731 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1040"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 315-322, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Enhanced Answer Type Inference from Questions using Sequential Models</Title> <Section position="3" start_page="315" end_page="316" type="metho"> <SectionTitle> 2 Informer overview </SectionTitle> <Paragraph position="0"> Our key insight is that a human can classify a question based on very few tokens gleaned from skeletal syntactic information. This is certainly true of the most trivial classes (Who wrote Hamlet? or How many dogs pull a sled at Iditarod?) but is also true of more subtle clues (How much does a rhino weigh?).</Paragraph> <Paragraph position="1"> In fact, informal experiments revealed the surprising property that only one contiguous span of tokens is adequate for a human to classify a question. E.g., in the above question, a human does not even need the how much clue once the word weigh is available. In fact, &quot;How much does a rhino cost?&quot; has an identical syntax but a completely different answer type, not revealed by how much alone. The only exceptions to the single-span hypothesis are multifunction questions like &quot;What is the name and age of ...&quot;, which should be assigned to multiple answer types. In this paper we consider questions where one type suffices.</Paragraph> <Paragraph position="2"> Consider another question with multiple clues: Who is the CEO of IBM? In isolation, the clue who merely tells us that the answer might be a person or country or organization, while CEO is perfectly precise, rendering who unnecessary. All of the above applies a forteriori to what and which clues, which are essentially uninformative on their own, as in &quot;What is the distance between Pisa and Rome?&quot; Conventional QA systems use mild analysis on the wh-clues, and need much more sophistication on the rest of the question (e.g. inferring author from wrote, and even verb subcategorization). We submit that a single, minimal, suitably-chosen contiguous span of question token/s, defined as the informer span of the question, is adequate for question classification. null The informer span is very sensitive to the structure of clauses, phrases and possessives in the question, as is clear from these examples (informers italicized): &quot;What is Bill Clinton's wife's profession&quot;, and &quot;What country's president was shot at Ford's Theater&quot;. The choice of informer spans also depends on the target classification system. Initially we wished to handle definition questions separately, and marked no informer tokens in &quot;What is digitalis&quot;. However, what is is an excellent informer for the UIUC class DESC:def (description, definition). null</Paragraph> </Section> <Section position="4" start_page="316" end_page="321" type="metho"> <SectionTitle> 3 The meta-learning approach </SectionTitle> <Paragraph position="0"> We propose a meta-learning approach (SS3.1) in which the SVM can use features from the original question as well as its informer span. We show (SS3.2) that human-annotated informer spans lead to large improvements in accuracy. However, we show (SS3.3) that simple heuristic extraction rules commonly used in QA systems (e.g. head of noun phrase following wh-word) cannot provide informers that are nearly as useful. This naturally leads us to designing an informer tagger in SS4.</Paragraph> <Paragraph position="1"> Figure 1 shows our meta-learning (Chan and Stolfo, 1993) framework. The combiner is a linear multi-class one-vs-one SVM2, as in the Zhang and Lee (2003) baseline. We did not use ECOC (Hacioglu and Ward, 2003) because the reported gain is less than 1%.</Paragraph> <Paragraph position="2"> The word feature extractor selects unigrams and q-grams from the question. In our experience, q = 1 or q = 2 were best; if unspecified, all possible qgrams were used. Through tuning, we also found that the SVM &quot;C&quot; parameter (used to trade between training data fit and model complexity) must be set to 300 to achieve their published baseline numbers.</Paragraph> <Section position="1" start_page="316" end_page="317" type="sub_section"> <SectionTitle> 3.1 Adding informer features </SectionTitle> <Paragraph position="0"> We propose two very simple ways to derive features from informers for use with SVMs. Initially, assume that perfect informers are known for all questions; later (SS4) we study how to predict informers. Informer q-grams: This comprises of all word q-grams within the informer span, for all possible q. E.g., such features enable effective exploitation of informers like length or height to classify to the NUMBER:distance class in the UIUC data.</Paragraph> <Paragraph position="1"> Informer q-gram hypernyms: For each word or compound within the informer span that is a Word-Net noun, we add all hypernyms of all senses. The intuition is that the informer (e.g. author, cricketer, CEO) is often narrower than a broad question class (HUMAN:individual). Following hypernym links up to person via WordNet produces a more reliably correlated feature.</Paragraph> <Paragraph position="2"> Given informers, other question words might seem useless to the classifier. However, retaining regular features from other question words is an excellent idea for the following reasons.</Paragraph> <Paragraph position="3"> First, we kept word sense disambiguation (WSD) outside the scope of this work because WSD entails computation costs, and is unlikely to be reliable on short single-sentence questions. Questions like How long ... or Which bank ... can thus become ambiguous and corrupt the informer hypernym features. Additional question words can often help nail the correct class despite the feature corruption.</Paragraph> <Paragraph position="4"> Second, while our CRF-based approach to informer span tagging is better than obvious alternatives, it still has a 15% error rate. For the questions where the CRF prediction is wrong, features from non-informer words give the SVM an opportunity to still pick the correct question class.</Paragraph> <Paragraph position="5"> Word features: Based on the above discussion, one boolean SVM feature is created for every word q-gram over all question tokens. In experiments, we found bigrams (q = 2) to be most effective, closely followed by unigrams (q = 1). As with informers, we can also use hypernyms of regular words as SVM features (marked &quot;Question bigrams + hypernyms&quot; in Table 2).</Paragraph> </Section> <Section position="2" start_page="317" end_page="317" type="sub_section"> <SectionTitle> 3.2 Benefits from &quot;perfect&quot; informers </SectionTitle> <Paragraph position="0"> We first wished to test the hypothesis that identifying informer spans to an SVM learner can improve classification accuracy. Over and above the class labels, we had two volunteers tag the 6000 UIUC questions with informer spans (which we call fect&quot; informer spans, and various feature encodings. Observe in Table 2 that the unigram baseline is already quite competitive with the best prior numbers, and exploiting perfect informer spans beats all known numbers. It is clear that both informer q-grams and informer hypernyms are very valuable features for question classification. The fact that no improvement was obtained with over Question bi-grams using Question hypernyms highlights the importance of choosing a few relevant tokens as informers and designing suitable features on them.</Paragraph> <Paragraph position="1"> Table 3 (columns b and e) shows the benefits from perfect informers broken down into broad question types. Questions with what as the trigger are the biggest beneficiaries, and they also form by far the most frequent category.</Paragraph> <Paragraph position="2"> The remaining question, one that we address in the rest of the paper, is whether we can effectively and accurately automate the process of providing informer spans to the question classifier.</Paragraph> </Section> <Section position="3" start_page="317" end_page="318" type="sub_section"> <SectionTitle> 3.3 Informers provided by heuristics </SectionTitle> <Paragraph position="0"> In SS4 we will propose a non-trivial solution to the informer-tagging problem. Before that, we must justify that such machinery is indeed required.</Paragraph> <Paragraph position="1"> Some leading QA systems extract words very similar in function to informers from the parse tree of the question. Some (Singhal et al., 2000) pick the head of the first noun phrase detected by a shallow parser, while others use the head of the noun phrase adjoining the main verb (Ramakrishnan et al., 2004). Yet others (Harabagiu et al., 2000; Hovy et al., 2001) use hundreds of (unpublished to our knowledge) hand-built pattern-matching rules on the output of a full-scale parser.</Paragraph> <Paragraph position="2"> A natural baseline is to use these extracted words, which we call &quot;heuristic informers&quot;, with an SVM just like we used &quot;perfect&quot; informers. All that remains is to make the heuristics precise.</Paragraph> <Paragraph position="3"> How: For questions starting with how, we use the bigram starting with how unless the next word is a verb.</Paragraph> <Paragraph position="4"> Wh: If the wh-word is not how, what or which, use the wh-word in the question as a separate feature. null WhNP: For questions having what and which, use the WHNP if it encloses a noun. WHNP is the Noun Phrase corresponding to the Wh-word, given by a sentence parser (see SS4.2).</Paragraph> <Paragraph position="5"> NP1: Otherwise, for what and which questions, the first (leftmost) noun phrase is added to yet another feature subspace.</Paragraph> <Paragraph position="6"> Table 3 (columns c and f) shows that these already-messy heuristic informers do not capture the same signal quality as &quot;perfect&quot; informers. Our findings corroborate Li and Roth (2002), who report little benefit from adding head chunk features for the fine classification task.</Paragraph> <Paragraph position="7"> Moreover, observe that using heuristic informer features without any word features leads to rather poor performance (column c), unlike using perfect informers (column b) or even CRF-predicted informer (column d, see SS4). These clearly establish that the notion of an informer is nontrivial.</Paragraph> <Paragraph position="8"> 4 Using CRFs to label informers Given informers are useful but nontrivial to recognize, the next natural question is, how can we learn to identify them automatically? From earlier sections, it is clear (and we give evidence later, see Table 5) that sequence and syntax information will be question bigrams, b: perfect informers only, c: heuristic informers only, d: CRF informers only, e-g: bigrams plus perfect, heuristic and CRF informers.</Paragraph> <Paragraph position="9"> important.</Paragraph> <Paragraph position="10"> We will model informer span identification as a sequence tagging problem. An automaton makes probabilistic transitions between hidden states y, one of which is an &quot;informer generating state&quot;, and emits tokens x. We observe the tokens and have to guess which were produced from the &quot;informer generating state&quot;.</Paragraph> <Paragraph position="11"> Hidden Markov models are extremely popular for such applications, but recent work has shown that conditional random fields (CRFs) (Lafferty et al., 2001; Sha and Pereira, 2003) have a consistent advantage over traditional HMMs in the face of many redundant features. We refer the reader to the above references for a detailed treatment of CRFs. Here we will regard a CRF as largely a black box3.</Paragraph> <Paragraph position="12"> To train a CRF, we need a set of state nodes, a transition graph on these nodes, and tokenized text where each token is assigned a state. Once the CRF</Paragraph> </Section> <Section position="4" start_page="318" end_page="319" type="sub_section"> <SectionTitle> 4.1 State transition models </SectionTitle> <Paragraph position="0"> We started with the common 2-state &quot;in/out&quot; model used in information extraction, shown in the left half of Figure 2. State &quot;1&quot; is the informer-generating state. Either state can be initial and final (double circle) states.</Paragraph> <Paragraph position="1"> 0 1 0 1 2 What kind of an animal is Winnie the Pooh What, kind, of, an, is, The 2-state model can be myopic. Consider the question pair A: What country is the largest producer of wheat? B: Name the largest producer of wheat The i+-1 context of producer is identical in A and B. In B, for want of a better informer, we would want producer to be flagged as the informer, although it might refer to a country, person, animal, company, etc. But in A, country is far more precise. Any 2-state model that depends on positions i+-1 to define features will fail to distinguish between A and B, and might select both country and producer in A. As we have seen with heuristic informers, polluting the informer pool can significantly hurt SVM accuracy.</Paragraph> <Paragraph position="2"> Therefore we also use the 3-state &quot;begin/in/out&quot; (BIO) model. The initial state cannot be &quot;2&quot; in the 3-state model; all states can be final. The 3-state model allows at most one informer span. Once the 3-state model chooses country as the informer, it is unlikely to stretch state 1 up to producer.</Paragraph> <Paragraph position="3"> There is no natural significance to using four or more states. Besides, longer range syntax dependencies are already largely captured by the parser.What is the capital city of Japan</Paragraph> </Section> <Section position="5" start_page="319" end_page="320" type="sub_section"> <SectionTitle> 4.2 Features from a parse of the question </SectionTitle> <Paragraph position="0"> Sentences with similar parse trees are likely to have the informer in similar positions. This was the intuition behind Zhang et al.'s tree kernel, and is also our starting point. We used the Stanford Lexicalized Parser (Klein and Manning, 2003) to parse the question. (We assume familiarity with parse tree notation for lack of space.) Figure 3 shows a sample parse tree organized in levels. Our first step was to trans-</Paragraph> <Paragraph position="2"> xi What is the capital city of Japan tion parse showing tag and num attributes. capital city is the informer span with y = 1. late the parse tree into an equivalent multi-resolution tabular format shown in Table 4.</Paragraph> <Paragraph position="3"> Cells and attributes: A labeled question comprises the token sequence xi;i = 1,... and the label sequence yi,i = 1,... Each xi leads to a column vector of observations. Therefore we use matrix notation to write down x: A table cell is addressed as x[i,lscript] where i is the token position (column index) and lscript is the level or row index, 1-6 in this example. (Although the parse tree can be arbitrarily deep, we found that using features from up to level lscript = 2 was adequate.) Intuitively, much of the information required for spotting an informer can be obtained from the part of speech of the tokens and phrase/clause attachment information. Conversely, specific word information is generally sparse and misleading; the same word may or may not be an informer depending on its position. E.g., &quot;What birds eat snakes?&quot; and &quot;What snakes eat birds?&quot; have the same words but different informers. Accordingly, we observe two properties at each cell: tag: The syntactic class assigned to the cell by the parser, e.g. x[4,2].tag = NP. It is well-known that POS and chunk information are major clues to informer-tagging, specifically, informers are often nouns or noun phrases.</Paragraph> <Paragraph position="4"> num: Many heuristics exploit the fact that the first NP is known to have a higher chance of containing informers than subsequent NPs. To capture this positional information, we define num of a cell at [i,lscript] as one plus the number of distinct contiguous chunks to the left of [i,lscript] with tags equal to x[4,2].tag. E.g., at level 2 in the table above, the capital city forms the first NP, while Japan forms the second NP. Therefore x[7,2].num = 2.</Paragraph> <Paragraph position="5"> In conditional models, it is notationally convenient to express features as functions on (xi,yi). To one unfamiliar with CRFs, it may seem strange that yi is passed as an argument to features. At training time, yi is indeed known, and at testing time, the CRF algorithm efficiently finds the most probable sequence of yis using a Viterbi search. True labels are not revealed to the CRF at testing time.</Paragraph> <Paragraph position="6"> Cell features IsTag and IsNum: E.g., the observation &quot;y4 = 1 and x[4,2].tag = NP&quot; is captured by the statement that &quot;position 4 fires the feature IsTag1,NP,2&quot; (which has a boolean value). There is an IsTagy,t,lscript feature for each (y,t,lscript) triplet. Similarly, for every possible state y, every possible num value n (up to some maximum horizon), and every level lscript, we define boolean features IsNumy,n,lscript. E.g., position 7 fires the feature IsNum2,2,2 in the 3-state model, capturing the statement &quot;x[7,2].num = 2 and y7 = 2&quot;.</Paragraph> <Paragraph position="7"> Adjacent cell features IsPrevTag and IsNextTag: Context can be exploited by a CRF by coupling the state at position i with observations at positions adjacent to position i (extending to larger windows did not help). To capture this, we use more boolean features: position 4 fires the feature IsPrevTag1,DT,1 because</Paragraph> <Paragraph position="9"> feature for each possible (y,t,lscript) triple.</Paragraph> <Paragraph position="10"> State transition features IsEdge: Position i fires feature IsEdgeu,v if yi[?]1 = u and yi = v.</Paragraph> <Paragraph position="11"> There is one such feature for each state-pair (u,v) allowed by the transition graph. In addition we have sentinel features IsBeginu and IsEndu marking the beginning and end of the token sequence.</Paragraph> </Section> <Section position="6" start_page="320" end_page="321" type="sub_section"> <SectionTitle> 4.3 Informer-tagging accuracy </SectionTitle> <Paragraph position="0"> We study the accuracy of our CRF-based informer tagger wrt human informer annotations. In the next section we will see the effect of CRF tagging on question classification.</Paragraph> <Paragraph position="1"> There are at least two useful measures of informer-tagging accuracy. Each question has a known set Ik of informer tokens, and gets a set of tokens Ic flagged as informers by the CRF. For each question, we can grant ourself a reward of 1 if Ic = Ik, and 0 otherwise. In SS3.1, informers were regarded as a separate (high-value) bag of words.</Paragraph> <Paragraph position="2"> Therefore, overlap between Ic and Ik would be a reasonable predictor of question classification accuracy. We use the Jaccard similarity |Ik[?]Ic|/|Ik[?]Ic|. Table 5 shows the effect of using diverse feature sets. parison with the heuristic baseline (Jaccard accuracy expressed as %).</Paragraph> <Paragraph position="3"> Table 6 shows that the 3-state CRF performs much better than the 2-state CRF, especially on difficult questions with what and which. It also compares the Jaccard accuracy of informers found by the CRF vs. informers found by the heuristics described in SS3.3. Again we see a clear superiority of the CRF approach.</Paragraph> <Paragraph position="4"> Unlike the heuristic approach, the CRF approach is relatively robust to the parser emitting a somewhat incorrect parse tree, which is not uncommon. The heuristic approach picks the &quot;easy&quot; informer, who, over the better one, CEO, in &quot;Who is the CEO of IBM&quot;. Its bias toward the NP-head can also be a problem, as in &quot;What country's president ...&quot;.</Paragraph> </Section> <Section position="7" start_page="321" end_page="321" type="sub_section"> <SectionTitle> 4.4 Question classification accuracy </SectionTitle> <Paragraph position="0"> We have already seen in SS3.2 that perfect knowledge of informers can be a big help. Because the CRF can make mistakes, the margin may decrease. In this section we study this issue.</Paragraph> <Paragraph position="1"> We used questions with human-tagged informers (SS3.2) to train a CRF. The CRF was applied back on the training questions to get informer predictions, which were used to train the 1-vs-1 SVM metalearner (SS3). Using CRF-tagged and not human-tagged informers may seem odd, but this lets the SVM learn and work around systematic errors in CRF outputs.</Paragraph> <Paragraph position="2"> Results are shown in columns d and g of Table 3.</Paragraph> <Paragraph position="3"> Despite the CRF tagger having about 15% error, we obtained 86.2% SVM accuracy which is rather close to the the SVM accuracy of 88% with perfect informers. null The CRF-generated tags, being on the training data, might be more accurate that would be for unseen test cases, potentially misleading the SVM. This turns out not to be a problem: clearly we are very close to the upper bound of 88%. In fact, anecdotal evidence suggests that using CRF-assigned tags actually helped the SVM.</Paragraph> </Section> </Section> class="xml-element"></Paper>