File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1069_metho.xml

Size: 14,875 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1069">
  <Title>Toward Semantics-Based Answer Pinpointing</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. THE QA TYPOLOGY
</SectionTitle>
    <Paragraph position="0"> In order to perform pinpointing deeper than the word level, the system has to produce a representation of what the user is asking. Some previous work in automated question answering has categorized questions by question word or by a mixture of question word and the semantic class of the answer [11, 10]. To ensure full coverage of all forms of simple question and answer, and to be able to factor in deviations and special requirements, we are developing a QA Typology.</Paragraph>
    <Paragraph position="1"> We motivate the Typology (a taxonomy of QA types) as follows.</Paragraph>
    <Paragraph position="2"> There are many ways to ask the same thing: What is the age of the Queen of Holland? How old is the Netherlands' queen? How long has the ruler of Holland been alive? Likewise, there are many ways of delivering the same answer: about 60; 63 years old; since January 1938. Such variations form a sort of semantic equivalence class of both questions and answers.</Paragraph>
    <Paragraph position="3"> Since the user may employ any version of his or her question, and the source documents may contain any version(s) of the answer, an efficient system should group together equivalent question types and answer types. Any specific question can then be indexed into its type, from which all equivalent forms of the answer can be ascertained. These QA equivalence types can help with both query expansion and answer pinpointing.</Paragraph>
    <Paragraph position="4"> However, the equivalence is fuzzy; even slight variations introduce exceptions: who invented the gas laser? can be answered by both Ali Javan and a scientist at MIT, while what is the name of the person who invented the gas laser? requires the former only. This inexactness suggests that the QA types be organized in an inheritance hierarchy, allowing the answer requirements satisfying more general questions to be overridden by more specific ones 'lower down'.</Paragraph>
    <Paragraph position="5"> These considerations help structure the Webclopedia QA Typology. Instead of focusing on question word or semantic type of the answer, our classes attempt to represent the user's intention, including for example the classes Why-Famous (for Who was Christopher Columbus? but not Who discovered  America?, which is the QA type Proper-Person) and Abbreviation-Expansion (for What does HLT stand for?). In addition, the QA Typology becomes increasingly specific as one moves from the root downward.</Paragraph>
    <Paragraph position="6"> To create the QA Typology, we analyzed 17,384 questions and their answers (downloaded from answers.com); see (Gerber, in prep.). The Typology (Figure 2) contains 72 nodes, whose leaf nodes capture QA variations that can in many cases be further differentiated.</Paragraph>
    <Paragraph position="7"> Each Typology node has been annotated with examples and typical patterns of expression of both Question and Answer, using a simple template notation that expressed configurations of words and parse tree annotations (Figure 3). Question pattern information (specifically, the semantic type of the answer required, which we call a Qtarget) is produced by the CONTEX parser (Section 4) when analyzing the question, enabling it to output its guess(s) for the QA type. Answer pattern information is used by the Matcher (Section 5) to pinpoint likely answer(s) in the parse trees of candidate answer sentences.</Paragraph>
    <Paragraph position="8"> Question examples and question templates Who was Johnny Mathis' high school track coach? Who was Lincoln's Secretary of State? who be &lt;entity&gt;'s &lt;role&gt; Who was President of Turkmenistan in 1994? Who is the composer of Eugene Onegin? Who is the chairman of GE? who be &lt;role&gt; of &lt;entity&gt; Answer templates and actual answers &lt;person&gt;, &lt;role&gt; of &lt;entity&gt; Lou Vasquez, track coach of...and Johnny Mathis  At the time of the TREC-9 Q&amp;A evaluation, we had produced approx. 500 patterns by simply cross-combining approx. 20 Question patterns with approx. 25 Answer patterns. To our disappointment (Section 6), these patterns were both too specific and too few to identify answers frequently--when they applied, they were quite accurate, but they applied too seldom. We therefore started work on automatically learning QA patterns in parse trees (Section 7). On the other hand, the semantic class of the answer (the Qtarget) is used to good effect (Sections 4 and 6).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. PARSING
</SectionTitle>
    <Paragraph position="0"> CONTEX is a deterministic machine-learning based grammar learner/parser that was originally built for MT [6]. For English, parses of unseen sentences measured 87.6% labeled precision and 88.4% labeled recall, trained on 2048 sentences from the Penn Treebank. Over the past few years it has been extended to Japanese and Korean [7].</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Parsing Questions
</SectionTitle>
      <Paragraph position="0"> Accuracy is particularly important for question parsing, because for only one question there may be several answers in a large document collection. In particular, it is important to identify as specific a Qtarget as possible. But grammar rules</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ERACITY YES :NO
TRUE :FALSE
NTIT Y A GENT NA ME LAST-NAME
FIRST- NA ME
ORGANIZATION
GROUP- OF- PEOPLE
ANIMAL
PERSON OCCUPATI ON-PERSON
</SectionTitle>
    <Paragraph position="0"> for declarative sentences do not apply well to questions, which although typically shorter than declaratives, exhibit markedly different word order, preposition stranding (&amp;quot;What university was Woodrow Wilson President of?&amp;quot;), etc.</Paragraph>
    <Paragraph position="1"> Unfortunately for CONTEX, questions to train on were not initially easily available; the Wall Street Journal sentences contain a few questions, often from quotes, but not enough and not representative enough to result in an acceptable level of question parse accuracy. By collecting and treebanking, however, we increased the number of questions in the training data from 250 (for our TREC-9 evaluation version of Webclopedia) to 400 on Oct 16 to 975 on Dec 9. The effect is shown in Table 1. In the first test run (&amp;quot;[trained] without [additional questions]&amp;quot;), CONTEX was trained mostly on declarative sentences (2000 Wall Street Journal sentences, namely the enriched Penn Treebank, plus a few other non-question sentences such as imperatives and short phrases). In later runs (&amp;quot;[trained] with [add. questions]&amp;quot;), the system was trained on the same examples plus a subset of the 1153 questions we have treebanked at ISI (38 questions from the pre-TREC-8 test set, all 200 from TREC-8 and 693 TREC-9, and 222 others).</Paragraph>
    <Paragraph position="2"> The TREC-8 and TREC-9 questions were divided into 5 subsets, used in a five-fold cross validation test in which the system was trained on all but the test questions, and then evaluated on the test questions.</Paragraph>
    <Paragraph position="3"> Reasons for the improvement include (1) significantly more training data; (2) a few additional features, some more treebank cleaning, a bit more background knowledge etc.; and (3) the 251 test questions on Oct. 16 were probably a little bit harder on average, because a few of the TREC-9 questions initially treebanked (and included in the October figures) were selected for early treebanking because they represented particular challenges, hurting subsequent Qtarget processing.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Parsing Potential Answers
</SectionTitle>
      <Paragraph position="0"> The semantic type ontology in CONTEX was extended to include 115 Qtarget types, plus some combined types; more details in [8]. Beside the Qtargets that refer to concepts in CONTEX's concept ontology (see first example below), Qtargets can also refer to part of speech labels (first example), to constituent roles or slots of parse trees (second and third examples), and to more abstract nodes in the QA Typology (later examples). For questions with the Qtargets Q-WHY-FAMOUS, Q-WHY-FAMOUS-PERSON, Q-SYNONYM, and others, the parser also provides Qargs--information helpful for matching (final examples).</Paragraph>
      <Paragraph position="1"> Semantic ontology types (I-EN-CITY) and part of speech labels (S-PROPER-NAME):  These Qtargets are determined during parsing using 276 hand-written rules. Still, for approx. 10% of the TREC-8&amp;9 questions there is no easily determinable Qtarget (&amp;quot;What does the Peugeot company manufacture?&amp;quot;; &amp;quot;What is caliente in English?&amp;quot;). Strategies for dealing with this are under investigation. More details appear in (Hermjakob, 2001). The current accuracy of the parser on questions and resulting Qtargets sentences is shown in Table 2.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. ANSWER MATCHING
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6. RESULTS
</SectionTitle>
    <Paragraph position="0"> We entered the TREC-9 short form QA track, and received an overall Mean Reciprocal Rank score of 0.318, which put Webclopedia in essentially tied second place with two others.</Paragraph>
    <Paragraph position="1"> (The best system far outperformed those in second place.) In order to determine the relative performance of the modules, we counted how many correct answers their output contained, working on our training corpus. Table 3 shows the evolution of the system over a sample one-month period, reflecting the amount of work put into different modules. The modules QA pattern, Qtarget, Qword, and Window were all run in parallel from the same Ranker output.</Paragraph>
    <Paragraph position="2"> The same pattern, albeit with lower scores, occurred in the TREC test (Table 4). The QA patterns made only a small contribution, the Qtarget made by far the largest contribution, and, interestingly, the word-level window match lay somewhere in between.</Paragraph>
    <Paragraph position="3">  We are pleased with the performance of the Qtarget match. This shows that CONTEX is able to identify to some degree the semantic type of the desired answer, and able to pinpoint these types also in candidate answers. The fact that it outperforms the window match indicates the desirability of looking deeper than the surface level. As discussed in Section 4, we are strengthening the parser's ability to identify Qtargets.</Paragraph>
    <Paragraph position="4"> We are disappointed in the performance of the 500 QA patterns. Analysis suggests that we had too few patterns, and the ones we had were too specific. When patterns matched, they were rather accurate, both in finding correct answers and more precisely pinpointing the boundaries of answers. However, they were too sensitive to variations in phrasing. Furthermore, it was difficult to construct robust and accurate question and answer phraseology patterns manually, for several reasons. First, manual construction relies on the inventiveness of the pattern builder to foresee variations of phrasing, for both question and answer. It is however nearly impossible to think of all possible variations when building patterns.</Paragraph>
    <Paragraph position="5"> Second, it is not always clear at what level of representation to formulate the pattern: when should one specify using words? Parts of speech? Other parse tree nodes? Semantic classes? The patterns in Figure 3 include only a few of these alternatives. Specifying the wrong elements can result in non-optimal coverage. Third, the work is simply tedious. We therefore decided to try to learn QA patterns automatically.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7. TOWARD LEARNING QA PATTERNS
AUTOMATICALLY
</SectionTitle>
    <Paragraph position="0"> To learn corresponding question and answer expressions, we pair up the parse trees of a question and (each one of) its answer(s). We then apply a set of matching criteria to identify potential corresponding portions of the trees. We then use the EM algorithm to learn the strengths of correspondence combinations at various levels of representation. This work is still in progress.</Paragraph>
    <Paragraph position="1"> In order to learn this information we observe the truism that there are many more answers than questions. This holds for the two QA corpora we have access to--TREC and an FAQ website (since discontinued). We therefore use the familiar version of the Noisy Channel Model and Bayes' Rule. For each basic QA</Paragraph>
    <Paragraph position="3"> all trees (# nodes that may express a true A) / (number of nodes in tree)</Paragraph>
    <Paragraph position="5"> all QA tree pairs (number of covarying nodes in Q and A trees) / (number of nodes in A tree) As usual, many variations are possible, including how to determine likelihood of expressing a true answer; whether to consider all nodes or just certain major syntactic ones (N, NP, VP, etc.); which information within each node to consider (syntactic? semantic? lexical?); how to define 'covarying information'--node identity? individual slot value equality?; what to do about the actual answer node in the A trees; if (and how) to represent the relationships among A nodes that have been found to be important; etc. Figure 4 provides an answer parse tree that indicates likely Location nodes, determined by appropriate syntactic class, semantic type, and syntactic role in the sentence.</Paragraph>
    <Paragraph position="6"> Our initial model focuses on bags of corresponding QA parse tree nodes, and will help to indicate for a given question what type of node(s) will contain the answer. We plan to extend this model to capture structured configurations of nodes that, when matched to a question, will help indicate where in the parse tree of a potential answer sentence the answer actually lies. Such bags or structures of nodes correspond, at the surface level, to important phrases or words. However, by using CONTEX output we abstract away from the surface level, and learn to include whatever syntactic and/or semantic information is best suited for predicting likely answers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML