XML Viewer - p04-1073

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1073_metho.xml
Size: 28,927 bytes
Last Modified: 2025-10-06 14:08:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1073">
  <Title>Question Answering using Constraint Satisfaction: QA-by-Dossier-with-Constraints</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Introducing QDC
</SectionTitle>
    <Paragraph position="0"> QA-by-Dossier-with-Constraints is an extension of on-going work of ours called QA-by-Dossier (QbD) (Prager et al., 2004). In the latter, definitional questions of the form &amp;quot;Who/What is X&amp;quot; are answered by asking a set of specific factoid questions about properties of X. So if X is a person, for example, these auxiliary questions may be about important dates and events in the person's life-cycle, as well as his/her achievement. Likewise, question sets can be developed for other entities such as organizations, places and things.</Paragraph>
    <Paragraph position="1"> QbD employs the notion of follow-on questions.</Paragraph>
    <Paragraph position="2"> Given an answer to a first-round question, the system can ask more specific questions based on that knowledge. For example, on discovering a person's profession, it can ask occupation-specific follow-on questions: if it finds that people are musicians, it can ask what they have composed, if it finds they are explorers, then what they have discovered, and so on.</Paragraph>
    <Paragraph position="3"> QA-by-Dossier-with-Constraints extends this approach by capitalizing on the fact that a set of answers about a subject must be mutually consistent, with respect to constraints such as time and geography. The essence of the QDC approach is to initially return instead of the best answer to appropriately selected factoid questions, the top n answers (we use n=5), and to choose out of this top set the highest confidence answer combination that satisfies consistency constraints.</Paragraph>
    <Paragraph position="4"> We illustrate this idea by way of the example, &amp;quot;When did Leonardo da Vinci paint the Mona Lisa?&amp;quot;. Table 1 shows our system's top answers to this question, with associated scores in the range  Vinci paint the Mona Lisa?&amp;quot; The correct answer is &amp;quot;1503&amp;quot;, which is in 4th place, with a low confidence score. Using QA-by-Dossier, we ask two related questions &amp;quot;When was Leonardo da Vinci born?&amp;quot; and &amp;quot;When did Leonardo da Vinci die?&amp;quot; The answers to these auxiliary questions are shown in Table 2.</Paragraph>
    <Paragraph position="5"> Given common knowledge about a person's life expectancy and that a painting must be produced while its author is alive, we observe that the best dates proposed in Table 2 consistent with one another are that Leonardo da Vinci was born in 1452, died in 1519, and painted the Mona Lisa in 1503.</Paragraph>
    <Paragraph position="6"> [The painting date of 1490 also satisfies the constraints, but with a lower confidence.] We will examine the exact constraints used a little later. This example illustrates how the use of auxiliary questions helps constrain answers to the original question, and promotes correct answers with initial low confidence scores. As a side-effect, a short dossier is produced.</Paragraph>
    <Paragraph position="7">  was Leonardo da Vinci born?&amp;quot; and &amp;quot;When did Leonardo da Vinci die?&amp;quot;.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Reciprocal Questions
</SectionTitle>
      <Paragraph position="0"> QDC also employs the notion of reciprocal questions. These are a type of follow-on question used solely to provide constraints, and do not add to the dossier. The idea is simply to double-check the answer to a question by inverting it, substituting the first-round answer and hoping to get the original subject back. For example, to double-check &amp;quot;Sacramento&amp;quot; as the answer to &amp;quot;What is the capital of California?&amp;quot; we would ask &amp;quot;Of what state is Sacramento the capital?&amp;quot;. The reciprocal question would be asked of all of the candidate answers, and the confidences of the answers to the reciprocal questions would contribute to the selection of the optimum answer. We will discuss later how this reciprocation may be done automatically. In a separate study of reciprocal questions (Prager et al., 2004), we demonstrated an increase in precision from .43 to .95, with only a 30% drop in recall.</Paragraph>
      <Paragraph position="1"> Although the reciprocal questions seem to be symmetrical and thus redundant, their power stems from the differences in the search for answers inherent in our system. The search is primarily based on the expected answer type (STATE vs. CAPITAL in the above example). This results in different document sets being passed to the answer selection module. Subsequently, the answer selection module works with a different set of syntactic and semantic relationships, and the process of asking a reciprocal question ends up looking more like the process of asking an independent one. The only difference between this and the &amp;quot;regular&amp;quot; QDC case is in the type of constraint applied to resolve the resulting answer set.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Applying QDC
</SectionTitle>
      <Paragraph position="0"> In order to automatically apply QDC during question answering, several problems need to be addressed. First, criteria must be developed to determine when this process should be invoked.</Paragraph>
      <Paragraph position="1"> Second, we must identify the set of question types that would potentially benefit from such an approach, and, for each question type, develop a set of auxiliary questions and appropriate constraints among the answers. Third, for each question type, we must determine how the results of applying constraints should be utilized.</Paragraph>
      <Paragraph position="2">  To address these questions we must distinguish between &amp;quot;planned&amp;quot; and &amp;quot;ad-hoc&amp;quot; uses of QDC. For answering definitional questions (&amp;quot;Who/what is X?&amp;quot;) of the sort used in TREC2003, in which collections of facts can be gathered by QA-by-Dossier, we can assume that QDC is always appropriate. By defining broad enough classes of entities for which these questions might be asked (e.g. people, places, organizations and things, or major subclasses of these), we can for each of these classes manually establish once and for all a set of auxiliary questions for QbD and constraints for QDC. This is the approach we have taken in the experiments reported here. We are currently working on automatically learning effective auxiliary questions for some of these classes.</Paragraph>
      <Paragraph position="3"> In a more ad-hoc situation, we might imagine that a simple variety of QDC will be invoked using solely reciprocal questions whenever the difference between the scores of the first and second answer is below a certain threshold.</Paragraph>
      <Paragraph position="4">  We will posit three methods of generating auxiliary question sets: o By hand o Through a structured repository, such as a knowledge-base of real-world information o Through statistical techniques tied to a machine-learning algorithm, and a text corpus.</Paragraph>
      <Paragraph position="5"> We think that all three methods are appropriate, but we initially concentrate on the first for practical reasons. Most TREC-style factoid questions are about people, places, organizations, and things, and we can generate generic auxiliary question sets for each of these classes. Moreover, the purpose of this paper is to explain the QDC methodology and to investigate its value.</Paragraph>
      <Paragraph position="6">  The constraints that apply to a given situation can be naturally represented in a network, and we find it useful for visualization purposes to depict the constraints graphically. In such a graph the entities and values are represented as nodes, and the constraints and questions as edges.</Paragraph>
      <Paragraph position="7"> It is not clear how possible, or desirable, it is to automatically develop such constraint networks (other than the simple one for reciprocal questions), since so much real-world knowledge seems to be required. To illustrate, let us look at the constraints required for the earlier example. A more complex constraint system is used in our experiments described later. For our Leonardo da Vinci example, the set of constraints applied can be expressed as follows1:</Paragraph>
      <Paragraph position="9"> The corresponding graphical representation is in  constraints betray a certain arbitrariness, we found it a useful practice to find a middle ground between absolute minima or maxima that the values can achieve and their likely values. Furthermore, although these constraints are manually derived for our prototype system, they are fairly general for the human life-cycle and can be easily reused for other, similar questions, or for more complex dossiers, as  We also note that even though a constraint network might have been inspired by and centered around a particular question, once the network is established, any question employed in it could be the end-user question that triggers it.</Paragraph>
      <Paragraph position="10"> There exists the (general) problem of when more than one set of answers satisfies our constraints. Our approach is to combine the first-round scores of the individual answers to provide a score for the dossier as a whole. There are several ways to do this, and we found experimentally that it does not appear critical exactly how this is done. In the example in the evaluation we mention one particular combination algorithm.</Paragraph>
      <Paragraph position="11">  There are an unlimited number of possible constraint networks that can be constructed. We have experimented with the following: Timelines. People and even artifacts have lifecycles. The examples in this paper exploit these. 1 Painting is only an example of an activity in these constraints. Any other achievement that is usually associated with adulthood can be used.</Paragraph>
      <Paragraph position="12"> Geographic (&amp;quot;Where is X&amp;quot;). Neighboring entities are in the same part of the world.</Paragraph>
      <Paragraph position="13"> Kinship (&amp;quot;Who is married to X&amp;quot;). Most kinship relationships have named reciprocals e.g. husbandwife, parent-child, and cousin-cousin. Even though these are not in practice one-one relationships, we can take advantage of sufficiency even if necessity is not entailed.</Paragraph>
      <Paragraph position="14"> Definitional (&amp;quot;What is X?&amp;quot;, &amp;quot;What does XYZ stand for?&amp;quot;) For good definitions, a term and its definition are interchangeable.</Paragraph>
      <Paragraph position="15"> Part-whole. Sizes of parts are no bigger than sizes of wholes. This fact can be used for populations, areas, etc.</Paragraph>
      <Paragraph position="16">  We performed a manual examination of the 500 TREC2002 questions2 to see for how many of these questions the QDC framework would apply. Being a manual process, these numbers provide an upper bound on how well we might expect a future automatic process to work.</Paragraph>
      <Paragraph position="17"> We noted that for 92 questions (18%) a non-trivial constraint network of the above kinds would apply. For a total of 454 questions (91%), a simple reciprocal constraint could be generated. However, for 61 of those, the reciprocal question was sufficiently non-specific that the sought reciprocal answer was unlikely to be found in a reasonably-sized hit-list. For example, the reciprocal question to &amp;quot;How did Mickey Mantle die?&amp;quot; would be &amp;quot;Who died of cancer?&amp;quot; However, we can imagine using other facts in the dossier to craft the question, giving us &amp;quot;What famous baseball player (or Yankees player) died of cancer?&amp;quot;, giving us a much better chance of success. For the simple reciprocation, though, subtracting these doubtful instances leaves 79% of the questions appearing to be good candidates for QDC.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Test set generation
</SectionTitle>
      <Paragraph position="0"> To evaluate QDC, we had our system develop dossiers of people in the creative arts, unseen in previous TREC questions. However, we wanted to use the personalities in past TREC questions as independent indicators of appropriate subject matter.</Paragraph>
      <Paragraph position="1"> Therefore we collected all of the &amp;quot;creative&amp;quot; people in the TREC9 question set, and divided them up into  inspection, lend themselves readily to reciprocation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Birthdate
Deathdate
Leonardo Painting
</SectionTitle>
      <Paragraph position="0"> and Vincent Van Gogh, etc. - twelve such groupings in all. For each set, we entered the individuals in the  (http://labs.google.com/sets), which finds &amp;quot;similar&amp;quot; entities to the ones entered. For example, from our set of male singers it found: Elton John, Sting, Garth Brooks, James Taylor, Phil Collins, Melissa Etheridge, Alanis Morissette, Annie Lennox, Jackson Browne, Bryan Adams, Frank Sinatra and Whitney Houston.</Paragraph>
      <Paragraph position="1"> Altogether, we gathered 276 names of creative individuals this way, after removing duplicates, items that were not names of individuals, and names that did not occur in our test corpus (the AQUAINT corpus). We then used our system manually to help us develop &amp;quot;ground truth&amp;quot; for a randomly selected subset of 109 names. This ground truth served both as training material and as an evaluation key. We split the 109 names randomly into a set of 52 for training and 57 for testing. The training process used a hill-climbing method to find optimal values for three internal rejection thresholds. In developing the ground truth we might have missed some instances of assertions we were looking for, so the reported recall (and hence F-measure) figures should be considered to be upper bounds, but we believe the calculated figures are not far from the truth.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 QDC Operation
</SectionTitle>
      <Paragraph position="0"> The system first asked three questions for each subject X: In what year was X born? In what year did X die? What compositions did X have? The third of these triggers our named-entity type COMPOSITION that is used for all kinds of titled works - books, films, poems, music, plays and so on, and also quotations. Our named-entity recognizer has rules to detect works of art by phrases that are in apposition to &amp;quot;the film ...&amp;quot; or the &amp;quot;the book ...&amp;quot; etc., and also captures any short phrase in quotes beginning with a capital letter. The particular question phrasing we used does not commit us to any specific creative verb. This is of particular importance since it very frequently happens in text that titled works are associated with their creators by means of a possessive or parenthetical construction, rather than subject-verb-object.</Paragraph>
      <Paragraph position="1"> The top five answers, with confidences, are returned for the born and died questions (subject to also passing a confidence threshold test). The compositions question is treated as a list question, meaning that all answers that pass a certain threshold are returned. For each such returned work Wi, two additional questions are asked: What year did X have Wi? Who had Wi? The top 5 answers to each of these are returned, again as long as they pass a confidence threshold. We added a sixth answer &amp;quot;NIL&amp;quot; to each of the date sets, with a confidence equal to the rejection threshold. (NIL is the code used in TREC ever since TREC10 to indicate the assertion that there is no answer in the corpus.) We used a two stage constraint-satisfaction process: Stage 1: For each work Wi for subject X, we added together its original confidence to the confidence of the answer X in the answer set of the reciprocal question (if it existed - otherwise we added zero). If the total did not exceed a learned threshold (.50) the work was rejected.</Paragraph>
      <Paragraph position="2"> Stage 2. For each subject, with the remaining candidate works we generated all possible combinations of the date answers. We rejected any combination that did not satisfy the following constraints:</Paragraph>
      <Paragraph position="4"> The apparent redundancy here is because of the potential NIL answers for some of the date slots.</Paragraph>
      <Paragraph position="5"> We also rejected combinations of works whose years spanned more than 100 years (in case there were no BORN or DIED dates). In performing these constraint calculations, NIL satisfied every test by fiat. The constraint network we used is depicted in  We used as a test corpus the AQUAINT corpus used in TREC-QA since 2002. Since this was not the same corpus from which the test questions were generated (the Web), we acknowledged that there might be some difference in the most common spelling of certain names, but we made no attempt to correct for this. Neither did we attempt to normalize, translate or aggregate names of the titled works that were returned, so that, for example, &amp;quot;Well- null were treated as different. Since only individuals were used in the question set, we did not have instances of problems we saw in training, such as where an ensemble (such as The Beatles) created a certain piece, which in turn via the reciprocal question was found to have been written by a single per-son (Paul McCartney). The reverse situation was still possible, but we did not handle it. We foresee a future version of our system having knowledge of ensembles and their composition, thus removing this restriction. In general, a variety of ontological relationships could occur between the original individual and the discovered performer(s) of the work. We generated answer keys by reading the passages that the system had retrieved and from which the answers were generated, to determine &amp;quot;truth&amp;quot;. In cases of absent information in these passages, we did our own corpus searches. This of course made the issue of evaluation of recall only relative, since we were not able to guarantee we had found all existing instances.</Paragraph>
      <Paragraph position="6"> We encountered some grey areas, e.g., if a painting appeared in an exhibition or if a celebrity endorsed a product, then should the exhibition's or product's name be considered an appropriate &amp;quot;work&amp;quot; of the artist? The general perspective adopted was that we were not establishing or validating the nature of the relationship between an individual and a creative work, but rather its existence. We answered &amp;quot;yes&amp;quot; if we subjectively felt the association to be both very strong and with the individual's participation - for example, Pamela Anderson and Playboy. However, books/plays about a person or dates of performances of one's work were considered incorrect. As we shall see, these decisions would not have a big impact on the outcome.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Effect of Constraints
</SectionTitle>
      <Paragraph position="0"> The answers collected from these two rounds of questions can be regarded as assertions about the subject X. By applying constraints, two possible effects can occur to these assertions:  1. Some works can get thrown out.</Paragraph>
      <Paragraph position="1"> 2. An asserted date (which was the top candidate  from its associated question) can get replaced by a candidate date originally in positions 2-6 (where sixth place is NIL) Effect #1 is expected to increase precision at the risk of worsening recall; effect #2 can go either way. We note that NIL, which is only used for dates, can be the correct answer if the desired date assertion is absent from the corpus; NIL is considered a &amp;quot;value&amp;quot; in this evaluation.</Paragraph>
      <Paragraph position="2"> By inspection, performances and other indirect works (discussed in the previous section) were usually associated with the correct artist, so our decision to remove them from consideration resulted in a decrease in both the numerator and denominator of the precision and recall calculations, resulting in a minimal effect.</Paragraph>
      <Paragraph position="3"> The results of applying QDC to the 57 test individuals are summarized in Table 3. The baseline assertions for individual X were:  Two calculations of P/R/F are made, depending on whether the averaging is done over the whole set, or first by individual; the results are very similar. The QDC assertions were the same as those for QbD, but reflecting the following effects: o Some {Wi, date} pairs were thrown out (3 out of 14 on average) o Some dates in positions 2-6 moved up (applicable to birth, death and work dates) The results show improvement in both precision and recall, in turn determining a 75-80% relative increase in F-measure.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> This exposition of QA-by-Dossier-with-Constraints is very short and undoubtedly leaves may questions unanswered. We have not presented a precise method for computing the QDC scores.</Paragraph>
    <Paragraph position="1"> One way to formalize this process would be to treat it as evidence gathering and interpret the results in a Bayesian-like fashion. The original system confidences would represent prior probabilities reflecting the system's belief that the answers are correct. As more evidence is found, the confidences would be updated to reflect the changed likelihood that an answer is correct.</Paragraph>
    <Paragraph position="2"> We do not know a priori how much &amp;quot;slop&amp;quot; should be allowed in enforcing the constraints, since auxiliary questions are as likely to be answered incorrectly as the original ones. A further problem is to determine the best metric for evaluating such approaches, which is a question for QA in general.</Paragraph>
    <Paragraph position="3"> The task of generating auxiliary questions and constraint sets is a matter of active research. Even for simple questions like the ones considered here, the auxiliary questions and constraints we looked at were different and manually chosen. Hand-crafting a large number of such sets might not be feasible, but it is certainly possible to build a few for common situations, such as a person's life-cycle. More generally, QDC could be applied to situations in which a certain structure is induced by natural temporal (our Leonardo example) and/or spatial constraints, or by properties of the relation mentioned in the question (evaluation example). Temporal and spatial constraints appear general to all relevant question types, and include relations of precedence, inclusion, etc.</Paragraph>
    <Paragraph position="4"> For certain relationships, there are naturally-occurring reciprocals (if X is married to Y, then Y is married to X; if X is a child of Y then Y is a parent of X; compound-term to acronym and vice versa).</Paragraph>
    <Paragraph position="5"> Transitive relationships (e.g. greater-than, locatedin, etc.) offer the immediate possibility of constraints, but this avenue has not yet been explored.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Automatic Generation of Reciprocal Ques-
</SectionTitle>
      <Paragraph position="0"> tions While not done in the work reported here, we are looking at generating reciprocal questions automatically. Consider the following transformations: &amp;quot;What is the capital of California?&amp;quot; -&gt; &amp;quot;Of what state is &lt;candidate&gt; the capital?&amp;quot; &amp;quot;What is Frank Sinatra's nickname?&amp;quot; -&gt; &amp;quot;Whose (or what person's) nickname is &lt;candidate&gt;?&amp;quot; null &amp;quot;How deep is Crater Lake?&amp;quot; -&gt; &amp;quot;What (or what lake) is &lt;candidate&gt; deep?&amp;quot; &amp;quot;Who won the Oscar for best actor in 1970?&amp;quot; -&gt; &amp;quot;In what year did &lt;candidate&gt; win the Oscar for best actor?&amp;quot; (and/or &amp;quot;What award did &lt;candidate&gt; win in 1970?&amp;quot;) These are precisely the transformations necessary to generate the auxiliary reciprocal questions from the given original questions and candidate answers to them. Such a process requires identifying an entity in the question that belongs to a known class, and substituting the class name for the entity. This entity is made the subject of the question, the previous subject (or trace) being replaced by the candidate answer. We are looking at parse-tree rather than string transformations to achieve this. This work will be reported in a future paper.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Final Thoughts
</SectionTitle>
      <Paragraph position="0"> Despite these open questions, initial trials with QA-by-Dossier-with-Constraints have been very encouraging, whether it is by correctly answering previously missed questions, or by improving confidences of correct answers. An interesting question is when it is appropriate to apply QDC. Clearly, if the base QA system is too poor, then the answers to the auxiliary questions will be useless; if the base system is highly accurate, the increase in accuracy will be negligible. Thus our approach seems most beneficial to middle-performance levels, which, by inspection of TREC results for the last 5 years, is where the leading systems currently lie.</Paragraph>
      <Paragraph position="1"> We had initially thought that use of constraints would obviate the need for much of the complexity inherent in NLP. As mentioned earlier, with the case of &amp;quot;The Beatles&amp;quot; being the reciprocal answer to the auxiliary composition question to &amp;quot;Who is Paul McCartney?&amp;quot;, we see that structured, ontological information would benefit QDC. Identifying alternate spellings and representations of the same name (e.g. Clavier/Klavier, but also taking care of variations in punctuation and completeness) is also necessary. When we asked &amp;quot;Who is Ian Anderson?&amp;quot;, having in mind the singer-flautist for the Jethro Tull rock band, we found that he is not only that, but also the community investment manager of the English conglomerate Whitbread, the executive director of the U.S. Figure Skating Association, a writer for New Scientist, an Australian medical advisor to the WHO, and the general sales manager of Houseman, a supplier of water treatment systems. Thus the problem of word sense disambiguation has returned in a particularly nasty form. To be fully effective, QDC must be configured not just to find a consistent set of properties, but a number of independent sets that together cover the highest-confidence returned answers3. Altogether, we see that some of the very problems we aimed to skirt are still present and need to be addressed. However, we have shown that even disregarding these issues, QDC was able to provide substantial improvement in accuracy.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Summary
</SectionTitle>
    <Paragraph position="0"> We have presented a method to improve the accuracy of a QA system by asking auxiliary questions for which natural constraints exist. Using these constraints, sets of mutually consistent answers can be generated. We have explored questions in the biographical areas, and identified other areas of applicability. We have found that our methodology exhibits a double advantage: not only can it im-</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Possibly the smallest number of sets that provide such cover-
</SectionTitle>
    <Paragraph position="0"> age.</Paragraph>
    <Paragraph position="1"> prove QA accuracy, but it can return a set of mutually-supporting assertions about the topic of the original question. We have identified many open questions and areas of future work, but despite these gaps, we have shown an example scenario where QA-by-Dossier-with-Constraints can improve the F-measure by over 75%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML