File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1203_metho.xml
Size: 17,942 bytes
Last Modified: 2025-10-06 14:07:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1203"> <Title>Parsing and Question Classification for Question Answering</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Question Treebank </SectionTitle> <Paragraph position="0"> In question answering, it is particularly important to achieve a high accuracy in parsing the questions.</Paragraph> <Paragraph position="1"> There are often several text passages that contain an answer, so if the parser does not produce a sufficiently good parse tree for some of the answer sentences, there's still a good chance that the question can be answered correctly based on other sentences containing the answer. However, when the question is analyzed incorrectly, overall failure is much more likely.</Paragraph> <Paragraph position="2"> A scenario with a question in multiple variations, as cleverly exploited by the SMU team (Harabagiu, 2000) in TREC9 for maybe about 10% of the 500 original questions, is probably more of an anomaly and can't be assumed to be typical.</Paragraph> <Paragraph position="3"> Parsing accuracy of trained parsers is known to depend significantly on stylistic similarities between training corpus and application text. In the Penn Treebank, only about half a percent of all sentences from the Wall Street Journal are (full) questions. Many of these are rhetorical, such as &quot;So what's the catch?&quot; or &quot;But what about all those non-duck ducks flapping over Washington?&quot;. Many types of questions that are common in question answering are however severely underrepresented. For example, there are no questions beginning with the interrogatives When or How much and there are no para-interrogative imperative sentences starting with &quot;Name&quot;, as in Name a Gaelic language.</Paragraph> <Paragraph position="4"> This finding is of course not really surprising, since newspaper articles focus on reporting and are therefore predominantly declarative. Therefore, we have to expect a lower accuracy for parsing questions than for parsing declarative sentences, if the parser was trained on the Penn treebank only. This was confirmed by preliminary question parsing accuracy tests using a parser trained exclusively on sentences from the Wall Street Journal. Question parsing accuracy rates were significantly lower than for regular newspaper sentences, even though one might have expected them to be higher, given that questions, on average, tend to be only half as long as newspaper sentences.</Paragraph> <Paragraph position="5"> To remedy this shortcoming, we treebanked additional questions as we would expect them in question answering. At this point, we have treebanked a total of book and online resources, including answers.com. null The online questions cover a wider cross-section of style, including yes-no questions (of which there was only one in the TREC questions set), true-false questions (none in TREC), and questions with whdeterminer phrases1 (none in TREC). The additionally treebanked questions therefore complement the TREC questions.</Paragraph> <Paragraph position="6"> The questions were treebanked using the deterministic shift-reduce parser CONTEX. Stepping through a question, the (human) treebanker just hits the return key if the proposed parse action is correct, and types in the correct action otherwise. Given that the parser predicts over 90% of all individual steps correctly, this process is quite fast, most often significantly less than a minute per question, after the parser was trained using the first one hundred treebanked questions.</Paragraph> <Paragraph position="7"> The treebanking process includes a &quot;sanity check&quot; after the treebanking proper of a sentence. The sanity check searches the treebanked parse tree for constituents with an uncommon sub-constituent structure and flags them for human inspection. This helps to eliminate most human errors. Here is an example of a (slightly simplified) question parse tree. See section 5 for a discussion of how the trees differ from the Penn Treebank II standard.</Paragraph> <Paragraph position="8"> 1&quot;What country's national anthem does the movie Casablanca close to the strains of?&quot; [1] How much does one ton of cement cost? [SNT,PRES,Qtarget: MONETARY-QUANTITY] (QUANT) [2] How much [INTERR-ADV] (MOD) [3] How [INTERR-ADV] (PRED) [4] much [ADV] (SUBJ LOG-SUBJ) [5] one ton of cement [NP] (QUANT) [6] one ton [NP,MASS-Q] (PRED) [7] one ton [NP-N,MASS-Q] (QUANT) [8] one [CARDINAL] (PRED) [9] ton [COUNT-NOUN] (PRED) [10] of cement [PP] (P) [11] of [PREP] (PRED) [12] cement [NP] (PRED) [13] cement [NOUN] (PRED) [14] does cost [VERB,PRES] (AUX) [15] does [AUX] (PRED) [16] cost [VERB] (DUMMY) [17] ? [QUESTION-MARK]</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 QA Typing (&quot;Qtargets&quot;) </SectionTitle> <Paragraph position="0"> Previous research on question answering, e.g.</Paragraph> <Paragraph position="1"> Srihari and Li (2000), has shown that it is important to classify questions with respect to their answer types.</Paragraph> <Paragraph position="2"> For example, given the question &quot;How tall is Mt. Everest?&quot;, it is very useful to identify the answer type as a distance quantity, which allows us to narrow our answer search space considerably. We refer to such answer types as Qtargets.</Paragraph> <Paragraph position="3"> To build a very detailed question taxonomy, Gerber (2001) has categorized 18,000 online questions with respect to their answer type. From this we derived a set of currently 115 elementary Qtargets, such as distance quantity. For some questions, like &quot;Who is the owner of CNN?&quot;, the answer might be one of two or more distinct types of elementary Qtargets, such as proper-person or proper-organization for the ownership question. Including such combinations, the number of distinct Qtargets rises to 122.</Paragraph> <Paragraph position="4"> Here are some more examples: a0 Q1: How long would it take to get to Mars? Qtarget: temporal-quantity a0 Q2: When did Ferraro run for vice president? Qtarget: date, temp-loc-with-year; =temp-loc a0 Q3: Who made the first airplane? Qtarget: proper-person, proper-company; =proper-organization a0 Q4: Who was George Washington? Qtarget: why-famous-person a0 Q5: Name the second tallest peak in Europe.</Paragraph> <Paragraph position="5"> Qtarget: proper-mountain Question 1 (Q1) illustrates that it is not sufficient to analyze the wh-group of a sentence, since &quot;how long&quot; can also be used for questions targeting a distance-quantity. Question 2 has a complex Qtarget, giving first preference to a date or a temporal location with a year and second preference to a general temporal location, such as &quot;six years after she was first elected to the House of Representatives&quot;. The equal sign (=) indicates that sub-concepts of temp-loc such as time should be excluded from consideration at that preference level. Question 3 & 4 both are who-questions, however with very different Qtargets.</Paragraph> <Paragraph position="6"> Abstract Qtargets such as the why-famous-person of question 4, can have a wide range of answer types, for example a prominent position or occupation, or the fact that they invented or discovered something.</Paragraph> <Paragraph position="7"> Abstract Qtargets have one or more arguments that completely describe the question: &quot;Who was George Washington?&quot;, &quot;What was George Washington best known for?&quot;, and &quot;What made George Washington famous?&quot; all map to Qtarget why-famous-person, Qargs (&quot;George Washington&quot;). Below is a listing of all currently used abstract Qtargets:</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract Qtargets </SectionTitle> <Paragraph position="0"> Some of the Qtargets occurring only once were proper-American-football-sports-team, proper-planet, power-quantity, proper-ocean, season, color, phonenumber, proper-hotel and government-agency.</Paragraph> <Paragraph position="1"> The following Qtarget examples show the hierarchical structure of Qtargets: that have a particular semantic role with respect to their parent constituent.</Paragraph> <Paragraph position="2"> 2. Qtargets referring to marked-up constituents Q: Name a film in which Jude Law acted.</Paragraph> <Paragraph position="3"> Qtarget: (SLOT TITLE-P TRUE) This type of Qtarget recommends constituents with slots that the parser can mark up. For example, the parser marks constituents that are quoted and consist of mostly and markedly capitalized content words as potential titles.</Paragraph> <Paragraph position="4"> The 122 Qtargets are computed based on a list of 276 hand-written rules.2 One reason why there are relatively few rules per Qtarget is that, given a semantic parse tree, the rules can be formulated at a high level of abstraction. For example, parse trees offer an abstraction from surface word order and CONTEX's semantic ontology, which has super-concepts such as monetarily-quantifiable-abstract and sub-concepts such as income, surplus and tax, allows to keep many tests relatively simple and general.</Paragraph> <Paragraph position="5"> For 10% of the TREC 8&9 evaluation questions, there is no proper Qtarget in our current Qtarget hierarchy. Some of those questions could be covered by further enlarging and refining the Qtarget hierarchy, while others are hard to capture with a semantic super-category that would narrow the search space in a meaningful way:</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In the first two test runs, the system was trained on 2000 and 3000 Wall Street Journal sentences (enriched Penn Treebank). In runs three and four, we trained the parser with the same Wall Street Journal sentences, augmented by the 38 treebanked pre-TREC8 questions. For the fifth run, we further added the 200 TREC8 questions as training sentences when testing TREC9 questions, and the first 200 TREC9 questions as training sentences when testing TREC8 questions.</Paragraph> <Paragraph position="1"> For the final run, we divided the 893 TREC-8 and TREC-9 questions into 5 test subsets of about 179 for a five-fold cross validation experiment, in which the system was trained on 2000 WSJ sentences plus about 975 questions (all 1153 questions minus the approximately 179 test sentences held back for testing). In each of the 5 subtests, the system was then evaluated on the test sentences that were held back, yielding a total of 893 test question sentences.</Paragraph> <Paragraph position="2"> The Wall Street Journal sentences contain a few questions, often from quotes, but not enough and not representative enough to result in an acceptable level of question parsing accuracy. While questions are typically shorter than newspaper sentences (making parsing easier), the word order is often markedly different, and constructions like preposition stranding (&quot;What university was Woodrow Wilson President of?&quot;) are much more common. The results in figure 1 show how crucial it is to include additional questions when training a parser, particularly with respect to Qtarget accuracy.3 With an additional 1153 treebanked questions as training input, parsing accuracy levels improve considerably for questions.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Answer Candidate Parsing </SectionTitle> <Paragraph position="0"> A thorough question analysis is however only one part of question answering. In order to do meaningful matching of questions and answer candidates, the 3At the time of the TREC9 evaluation in August 2000, only about 200 questions had been treebanked, including about half of the TREC8 questions (and obviously none of the TREC9 questions).</Paragraph> <Paragraph position="1"> analysis of the answer candidate must reflect the depth of analysis of the question.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Semantic Parse Tree Enhancements </SectionTitle> <Paragraph position="0"> This means, for example, that when the question analyzer finds that the question &quot;How long does it take to fly from Washington to Hongkong?&quot; looks for a temporal quantity as a target, the answer candidate analysis should identify any temporal quantities as such.</Paragraph> <Paragraph position="1"> Similarly, when the question targets the name of an airline, such as in &quot;Which airlines offer flights from Washington to Hongkong?&quot;, it helps to have the parser identify proper airlines as such in an answer candidate sentence.</Paragraph> <Paragraph position="2"> For this we use an in-house preprocessor to identify constituents like the 13 types of quantities in section 3 and for the various types of temporal locations. Our named entity tagger uses BBN's Identi-Finder(TM) (Kubala, 1998; Bikel, 1999), augmented by a named entity refinement module. For named entities (NEs), IdentiFinder provides three types of classes, location, organization and person. For better matching to our question categories, we need a finer granularity for location and organization in particular.</Paragraph> <Paragraph position="3"> a0 Location a0 proper-city, proper-country, proper-mountain, proper-island, proper-starconstellation, ...</Paragraph> <Paragraph position="4"> a0 Organization a0 government-agency, propercompany, proper-airline, proper-university, proper-sports-team, proper-american-footballsports-team, ...</Paragraph> <Paragraph position="5"> For this refinement, we use heuristics that rely both on lexical clues, which for example works quite well for colleges, which often use &quot;College&quot; or &quot;University&quot; as their lexical heads, and lists of proper entities, which works particularly well for more limited classes of named entities like countries and government agencies. For many classes like mountains, lexical clues (&quot;Mount Whitney&quot;, &quot;Humphreys Peak&quot;, &quot;Sassafras Mountain&quot;) and lists of well-known entities (&quot;Kilimanjaro&quot;, &quot;Fujiyama&quot;, &quot;Matterhorn&quot;) complement each other well. When no heuristic or background knowledge applies, the entity keeps its coarse level designation (&quot;location&quot;).</Paragraph> <Paragraph position="6"> For other Qtargets, such as &quot;Which animals are the most common pets?&quot;, we rely on the SENSUS ontology4 (Knight and Luk, 1994), which for example includes a hierarchy of animals. The ontology allows us to conclude that the &quot;dog&quot; in an answer sentence candidate matches the Qtarget animal (while &quot;pizza&quot; doesn't).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Semantically Motivated Trees </SectionTitle> <Paragraph position="0"> The syntactic and semantic structure of a sentence often differ. When parsing sentences into parse trees or building treebanks, we therefore have to decide whether to represent a sentence primarily in terms of its syntactic structure, its semantic structure, something in between, or even both.</Paragraph> <Paragraph position="1"> We believe that an important criterion for this decision is what application the parse trees might be used for. As the following example illustrates, a semantic representation is much more suitable for question answering, where questions and answer candidates have to be matched. What counts in question answering is that question and answer match semantically. In previous research, we found that the semantic representation is also more suitable for machine translation applications, where syntactic properties of a sentence are often very language specific and therefore don't map well to another language.</Paragraph> <Paragraph position="2"> Parse trees [1] and [12] are examples of our system's structure, whereas [18] and [30] represent the same question/answer pair in the more syntactically oriented structure of the Penn treebank5 (Marcus 1993).</Paragraph> <Paragraph position="3"> Question and answer in CONTEX format: more detail is given for tree [1]. UPenn is in the process of developing a new treebank format, which is more semantically oriented than their old one, and is closer to the CONTEX format described here.</Paragraph> <Paragraph position="4"> The &quot;semantic&quot; trees ([1] and [12]) have explicit roles for all constituents, a flatter structure at the sentence level, use traces more sparingly, separate syntactic categories from information such as tense, and group semantically related words, even if they are non-contiguous at the surface level (e.g. verb complex [8]). In trees [1] and [12], semantic roles match at the top level, whereas in [18] and [30], the semantic roles are distributed over several layers.</Paragraph> <Paragraph position="5"> Another example for differences between syntactic and semantic structures are the choice of the head in a prepositional phrase (PP). For all PPs, such as on Nov. 11, 1989, capital of Albania and [composed] by Chopin, we always choose the noun phrase as the head, while syntactically, it is clearly the preposition that heads a PP.</Paragraph> <Paragraph position="6"> We restructured and enriched the Penn treebank into such a more semantically oriented representation, and also treebanked the 1153 additional questions in this format.</Paragraph> </Section> </Section> class="xml-element"></Paper>