File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/h01-1006_evalu.xml
Size: 10,720 bytes
Last Modified: 2025-10-06 13:58:38
<?xml version="1.0" standalone="yes"?> <Paper uid="H01-1006"> <Title>Answering What-Is Questions by Virtual Annotation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 RESULTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated Virtual Annotation on two sets of questions - the definitional questions from TREC9 and similar kinds of questions from the Excite query log (see http://www.excite.com). In both cases we were looking for definitional text in the TREC corpus. The TREC questions had been previously verified (by NIST) to have answers there; the Excite questions had no such guarantee. We started with 174 Excite questions of the form &quot;What is X&quot;, where X was a 1- or 2-word phrase. We removed those questions that we felt would not have been acceptable as TREC9 questions. These were questions where: o The query terms did not appear in the TREC corpus, and some may not even have been real words (e.g.</Paragraph> <Paragraph position="1"> &quot;What is a gigapop&quot;).1 37 questions.</Paragraph> <Paragraph position="2"> o The query terms were in the corpus, but there was no definition present (e.g &quot;What is a computer monitor&quot;).2 18 questions.</Paragraph> <Paragraph position="3"> o The question was not asking about the class of the term but how to distinguish it from other members of the class (e.g. &quot;What is a star fruit&quot;). 17 questions. o The question was about computer technology that emerged after the articles in the TREC corpus were written (e.g. &quot;What is a pci slot&quot;). 19 questions. o The question was very likely seeking an example, not a definition (e.g. &quot;What is a powerful adhesive&quot;). 1 question plus maybe some others - see the Discussion questions for which there is no answer in the corpus (deliberately). While it is important for a system to be able to make this distinction, we kept within the TREC9 framework for this evaluation.</Paragraph> <Paragraph position="4"> section later. How to automatically distinguish these cases is a matter for further research.</Paragraph> <Paragraph position="5"> Of the remaining 82 Excite questions, 13 did not have entries in WordNet. We did not disqualify those questions.</Paragraph> <Paragraph position="6"> For both the TREC and Excite question sets we report two evaluation measures. In the TREC QA track, 5 answers are submitted per question, and the score for the question is the reciprocal of the rank of the first correct answer in these 5 candidates, or 0 if the correct answer is not present at all. A submission's overall score is the mean reciprocal rank (MRR) over all questions. We calculate MRR as well as mean binary score (MBS) over the top 5 candidates; the binary score for a question is We see that for the 24 TREC9 definitional questions, our MRR score with VA was the same as the MBS score. This was because for each of the 20 questions where the system found a correct answer, it was in the top position.</Paragraph> <Paragraph position="7"> By comparison, our base system achieved an overall MRR score of .315 across the 693 questions of TREC9. Thus we see that with VA, the average score of definitional questions improves from below our TREC average to considerably higher. While the percentage of definitional questions in TREC9 was quite small, we shall explain in a later section how we plan to extend our techniques to other question types.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Errors </SectionTitle> <Paragraph position="0"> The VA process is not flawless, for a variety of reasons. One is that the hierarchy in WordNet does not always exactly correspond to the way people classify the world. For example, in WordNet a dog is not a pet, so &quot;pet&quot; will never even be a candidate answer to &quot;What is a dog&quot;.</Paragraph> <Paragraph position="1"> When the question term is in WordNet, VA succeeds most of the time. One of the error sources is due to the lack of uniformity of the semantic distance between levels. For example, the parents of &quot;architect&quot; are &quot;creator&quot; and &quot;human&quot;, the latter being what our system answers to &quot;What is an architect&quot;. This is technically correct, but not very useful.</Paragraph> <Paragraph position="2"> Another error source is polysemy. This does not seem to cause problems with VA very often - indeed the co-occurrence calculations that we perform are similar to those done by [Mihalcea and Moldovan, 1999] to perform word sense disambiguation - but it can give rise to amusing results. For example, when asked &quot;What is an ass&quot; the system responded with &quot;Congress&quot;. Ass has four senses, the last of which in WordNet is a slang term for sex. The parent synset contains the archaic synonym congress (uncapitalized!). In the TREC corpus there are several passages containing the words ass and Congress, which lead to congress being the hypernym with the greatest score. Clearly this particular problem can be avoided by using orthography to indicate word-sense, but the general problem remains.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 DISCUSSION AND FURTHER WORK </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Discussion </SectionTitle> <Paragraph position="0"> While we chose not to use Hearst's approach of key-phrase identification as the primary mechanism for answering What is questions, we don't reject the utility of the approach. Indeed, a combination of VA as described here with a key-phrase analysis to further filter candidate answer passages might well reduce the incidence of errors such as the one with ass mentioned in the previous section. Such an investigation remains to be done.</Paragraph> <Paragraph position="1"> We have seen that VA gives very high performance scores at answering What is questions - and we suggest it can be extended to other types - but we have not fully addressed the issue of automatically selecting the questions to which to apply it. We have used the heuristic of only looking at questions of the form &quot;What is (a/an) X&quot; where X is a phrase of one or two words. By inspection of the Excite questions, almost all of those that pass this test are looking for definitions, but some - such as &quot;What is a powerful adhesive&quot; - very probably do not. There are also a few questions that are inherently ambiguous (understanding that the questioners are not all perfect grammarians): is &quot;What is an antacid&quot; asking for a definition or a brand name? Even if it is known or assumed that a definition is required, there remains the ambiguity of the state of knowledge of the questioner. If the person has no clue what the term means, then a parent class, which is what VA finds, is the right answer. If the person knows the class but needs to know how to distinguish the object from others in the class, for example &quot;What is a star fruit&quot;, then a very different approach is required. If the question seems very specific, but uses common words entirely, such as the Excite question &quot;What is a yellow spotted lizard&quot;, then the only reasonable interpretation seems to be a request for a subclass of the head noun that has the given property. Finally, questions such as &quot;What is a nanometer&quot; and &quot;What is rubella&quot; are looking for a value or more common synonym.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Other Question Types </SectionTitle> <Paragraph position="0"> The preceding discussion has centered upon What is questions and the use of WordNet, but the same principles can be applied to other question types and other ontologies. Consider the question &quot;Where is Chicago&quot;, from the training set NIST supplied for TREC8. Let us assume we can use statistical arguments to decide that, in a vanilla context, the question is about the city as opposed to the rock group, any of the city's sports teams or the University. There is still considerable ambiguity regarding the granularity of the desired answer. Is it: Cook County? Illinois? The Mid-West? The United States? North America? The Western Hemisphere? ...</Paragraph> <Paragraph position="1"> There are a number of geographical databases available, which either alone or with some data massaging can be viewed as ontologies with &quot;located within&quot; as the primary relationship. Then by applying Virtual Annotation to Where questions we can find the enclosing region that is most commonly referred to in the context of the question term. By manually applying our algorithm to &quot;Chicago&quot; and the list of geographic regions in the previous paragraph we find that &quot;Illinois&quot; wins, as expected, just beating out &quot;The United States&quot;. However, it should be mentioned that a more extensive investigation might find a different weighting scheme more appropriate for geographic hierarchies.</Paragraph> <Paragraph position="2"> The aforementioned answer of &quot;Illinois&quot; to the question &quot;Where is Chicago?&quot; might be the best answer for an American user, but for anyone else, an answer providing the country might be preferred.</Paragraph> <Paragraph position="3"> How can we expect Virtual Annotation to take this into account? The &quot;hidden variable&quot; in the operation of VA is the corpus. It is assumed that the user belongs to the intended readership of the articles in the corpus, and to the extent that this is true, the results of VA will be useful to the user.</Paragraph> <Paragraph position="4"> Virtual Annotation can also be used to answer questions that are seeking examples or instances of a class. We can use WordNet again, but this time look to hyponyms. These questions are more varied in syntax than the What is kind; they include, for example from TREC9 again: &quot;Name a flying mammal.&quot; &quot;What flower did Vincent Van Gogh paint?&quot; and &quot;What type of bridge is the Golden Gate Bridge?&quot;</Paragraph> </Section> </Section> class="xml-element"></Paper>