File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1007_metho.xml
Size: 32,178 bytes
Last Modified: 2025-10-06 14:08:51
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1007"> <Title>Answering Definition Questions Using Multiple Knowledge Sources</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Answering Definition Questions </SectionTitle> <Paragraph position="0"> Our first step in answering a definition question is to extract the concept for which information is being sought-called the target term, or simply, the target. Once the target term has been found, three techniques are employed to retrieve relevant nuggets: lookup in a database created from the AQUAINT corpus1, lookup in a Web dictionary followed by answer projection, and lookup directly in the AQUAINT corpus with an IR engine. Answers from the three different sources are then merged to produce the final system output. The following subsections describe each of these modules in greater detail.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Target Extraction </SectionTitle> <Paragraph position="0"> We have developed a simple pattern-based parser to extract the target term using regular expressions. If the natural language question does not fit any of our patterns, the parser heuristically extracts the last sequence of capitalized words in the question as the target.</Paragraph> <Paragraph position="1"> Our simple target extractor was tested on all definition questions from the TREC-9 and TREC-10 QA Track testsets and performed with one hundred percent accuracy on those questions. However, there were several instances where the target term was not correctly extracted from the definition questions in TREC 2003, which made it difficult for downstream modules to find relevant nuggets (see Section 3.2 for a discussion).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Database Lookup </SectionTitle> <Paragraph position="0"> The use of surface patterns for answer extraction has proven to be an effective strategy for factoid question answering (Soubbotin and Soubbotin, 2001; Brill et al., 2001; Hermjakob et al., 2002). Typically, surface patterns are applied to a candidate set of documents returned by a document or passage retriever. Although this strategy often suffers from low recall, it is generally not a problem for factoid questions, where only a single instance of the answer is required. Definition questions, however, require a system to find as many relevant nuggets as possible, making recall very important.</Paragraph> <Paragraph position="1"> To boost recall, we employed an alternative strategy: by applying the set of surface patterns offline, we were able to &quot;precompile&quot; from the AQUAINT corpus a list of nuggets about every entity mentioned within it. In essence, we have automatically constructed an immense relational database containing nuggets distilled from every article in the corpus. The task of answering definition questions then becomes a simple lookup for the relevant term. This approach is similar in spirit to the work reported by Fleischman et al. (2003) and Mann (2002), except that our system benefits from a greater variety of patterns and answers a broader range of questions.</Paragraph> <Paragraph position="2"> Our surface patterns operated both at the word and part-of-speech level. Rudimentary chunking, such as marking the boundaries of noun phrases, was performed by grouping words based on their part-of-speech tags. In total, we applied eleven surface patterns over the entire corpus--these are detailed in Table 1, with examples in Typically, surface patterns identify nuggets on the order of a few words. In answering definition questions, however, we decided to return responses that include additional context--there is evidence that contextual information results in higher-quality answers (Lin et al., 2003). To accomplish this, all nuggets were expanded around their center point to encompass one hundred characters. We found that this technique enhances the readability of the responses, because many nuggets seem odd and out of place without context.</Paragraph> <Paragraph position="3"> The results of applying our surface patterns to the entire AQUAINT corpus--the target, pattern type, nugget, and source sentence--are stored in a relational database.</Paragraph> <Paragraph position="4"> To answer a definition question, the target is used to query for all relevant nuggets in the database.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Dictionary Lookup </SectionTitle> <Paragraph position="0"> Another component of our system for answering definition questions utilizes an existing Web-based dictionary--dictionary definitions often supply knowledge that can be directly exploited. Previous factoid ques-Copular A fractal is a pattern that is irregular, but self-similar at all size scales tion answering systems have already demonstrated the value of semistructured resources on the Web (Lin and Katz, 2003); we believe that some of these resources can be similarly employed to answer definition questions.</Paragraph> <Paragraph position="1"> The setup of the TREC evaluations requires every answer to be paired with a supporting document; therefore, a system cannot simply return the dictionary definition of a term as its response. To address this issue, we developed answer projection techniques to &quot;map&quot; dictionary definitions back onto AQUAINT documents. Similar techniques have been employed for factoid questions, for example, in (Brill et al., 2001).</Paragraph> <Paragraph position="2"> We have constructed a wrapper around the Merriam-Webster online dictionary. To answer a question using this technique, keywords from the target term's dictionary definition and the target itself are used as the query to Lucene, a freely-available open-source IR engine. Our system retrieves the top one hundred documents returned by Lucene and tokenizes them into individual sentences, discarding candidate sentences that do not contain the target term. The remaining sentences are scored by their keyword overlap with the dictionary definition, weighted by the inverse document frequency of each keyword. All sentences with a non-zero score are retained and shortened to one hundred characters centered around the target term, if necessary.</Paragraph> <Paragraph position="3"> The following are two examples of results from our dictionary lookup component: What is the vagus nerve? Dictionary definition: either of the 10th pair of cranial nerves that arise from the medulla and supply chiefly the viscera especially with autonomic sensory and motor fibers Projected answer: The vagus nerve is sometimes called the 10th cranial nerve. It runs from the brain ...</Paragraph> <Paragraph position="4"> What is feng shui? Dictionary definition: a Chinese geomantic practice in which a structure or site is chosen or configured so as to harmonize with the spiritual forces that inhabit it Projected answer: In case you've missed the feng shui bandwagon, it is, according to Webster's, &quot;a Chinese geomantic practice ...</Paragraph> <Paragraph position="5"> This strategy was inspired by query expansion techniques often employed in document retrieval-essentially, the dictionary definition of a term is used as the source of expansion terms. Creative use of Web-based resources combined with proven information retrieval techniques enables this component to provide high quality responses to definition questions.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Document Lookup </SectionTitle> <Paragraph position="0"> If no answers are found by the previous two techniques, as a last resort our system employs traditional document retrieval to extract relevant nuggets. The target term is used as a Lucene query to gather a set of one hundred candidate documents. These documents are tokenized into individual sentences, and all sentences containing the target term are retained as responses (ranked by the Lucenegenerated score of the document from which they came).</Paragraph> <Paragraph position="1"> These sentences are also shortened if necessary.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Answer Merging </SectionTitle> <Paragraph position="0"> The answer merging component of our system is responsible for integrating results from all three sources: database lookup, dictionary lookup, and document lookup. As previously mentioned, responses extracted using document lookup are used only if the other two methods returned no answers.</Paragraph> <Paragraph position="1"> Redundancy presents a major challenge for integrating knowledge from multiple sources. This problem is especially severe for nuggets stored in our database. Since we precompiled knowledge about every entity instance in the entire AQUAINT corpus, common nuggets are often repeated. In order to deal with this problem, we applied a simple heuristic to remove duplicate information: if two responses share more than sixty percent of their keywords, one of them is randomly discarded.</Paragraph> <Paragraph position="2"> After duplicate removal, all responses are ordered by the expected accuracy of the technique used to extract the nugget. To determine this expected accuracy, we performed a fine-grained evaluation for each surface pattern as well as the dictionary lookup strategy; we discuss these results further in Section 3.1.</Paragraph> <Paragraph position="3"> Finally, the answer merging component decides how many responses to return. Given n total responses, we calculate the final number of responses to return as: n if n [?] 10 n +[?]n[?]10 if n > 10 Having described the architecture of our system, we proceed to present evaluation results.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> In this section we present two separate evaluations of our system. The first is a component analysis of our database and dictionary techniques, and the second involves our participation in the TREC 2003 Question Answering Track.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Component Evaluation </SectionTitle> <Paragraph position="0"> We evaluated the performance of each individual surface pattern and the dictionary lookup technique on 160 definition questions selected from the TREC-9 and TREC-10 QA Track testsets. Since we primarily generated our patterns by directly analyzing the corpus, these questions can be considered a blind testset. The performance of our surface patterns and our dictionary lookup technique is shown in Table 3.</Paragraph> <Paragraph position="1"> Overall, database lookup retrieved approximately eight nuggets per question at an accuracy nearing 40%; dictionary lookup retrieved about 1.5 nuggets per question at an accuracy of 45%. Obviously, recall of our techniques is extremely hard to measure directly; instead, we use the prevalence of each pattern as a poor substitute. As shown in Table 3, some patterns occur frequently (e.g., e1 is and e1 appo), but others are relatively rare, such as the relative clause pattern, which yielded only six nuggets for the entire testset.</Paragraph> <Paragraph position="2"> These results represent a baseline for the performance of each technique. Our focus was not on perfecting each individual pattern and the dictionary matching algorithm, but on building a complete working system. We will discuss future improvements and refinements in Section 5.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 TREC 2003 Results </SectionTitle> <Paragraph position="0"> Our system for answering definition questions was independently and formally evaluated at the TREC 2003 Question Answering Track. For the first time, TREC evaluated definition questions in addition to factoid and list questions. Although our entry handled all three types of questions, we only report the results of the definition questions here; see (Katz et al., 2003) for description of the other components.</Paragraph> <Paragraph position="1"> Overall, our system performed well, ranking eighth out of twenty-five groups that participated (Voorhees, 2003). Our official results for the definition sub-task are shown in Table 4, along with overall statistics for all groups. The formula used to calculate the F-measure is given in Figure 1. The b value of five indicates that recall is considered five times more important than precision, an arbitrary value set for the purposes of the evaluation.</Paragraph> <Paragraph position="2"> Nugget precision is computed based on a length allowance of one hundred non-whitespace characters per relevant response, because a pilot study demonstrated that it was impossible for assessors to consistently enumerate the total set of &quot;concepts&quot; contained in a system response (Voorhees, 2003). The assessors' nugget list (i.e., the ground truth) was created by considering the union of all responses returned by all participants. All relevant nuggets are divided into &quot;vital&quot; and &quot;non-vital&quot; categories, where vital nuggets are items of information that must be in a definition for it to be considered &quot;good&quot;. Non-vital nuggets may also provide relevant information, but a &quot;good&quot; definition does not need to include them. Nugget recall is thus only a function of vital nuggets.</Paragraph> <Paragraph position="3"> The best run, with an F-measure of 0.555, was submitted by BBN (Xu et al., 2003). The system used many of the same techniques we described here, with one important exception--they did not precompile nuggets into a database. In their own error analysis, they cited recall as a major cause of bad performance; this is an issue specifically addressed by our approach.</Paragraph> <Paragraph position="4"> Interestingly, Xu et al. also reported an IR baseline which essentially retrieved the top 1000 sentences in the corpus that mentioned the target term (subjected to simple heuristics to remove redundant answers). This base-line technique achieved an F-measure of 0.493, which beat all other runs (expect for BBN's own runs). Because the F-measure heavily favored recall over precision, simple IR techniques worked extremely well. This issue is discussed in Section 4.1.</Paragraph> <Paragraph position="5"> To identify areas for improvement, we analyzed the questions on which we did poorly and found that many of the errors can be traced back to problems with target extraction. If the target term is not correctly identified, then all subsequent modules have little chance of providing relevant nuggets. For eight questions, our system did not identify the correct target. The presence of stopwords and special characters in names was not anticipated:</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> What is Bausch & Lomb? </SectionTitle> <Paragraph position="0"> Who is Vlad the Impaler? Who is Akbar the Great? Our naive pattern-based parser extracted Lomb, Impaler, and Great as the target terms for the above questions. Fortunately, because Lomb and Impaler were rare terms, our system did manage to return relevant nuggets. However, since Great is a very common word, our nuggets for Akbar the Great were meaningless.</Paragraph> <Paragraph position="1"> The system's inability to parse certain names is related to our simple assumption that the final consecutive sequence of capitalized words in a question is likely to be the target. This simply turned out to be an incorrect assumption, as seen in the following questions: Who was Abraham in the Old Testament? What is ETA in Spain?</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> What is Friends of the Earth? </SectionTitle> <Paragraph position="0"> Our parser extracted Old Testament, Spain, and Earth as the targets for these questions, which directly resulted in the system's failure to return relevant nuggets.</Paragraph> <Paragraph position="1"> Our target extractor also had difficulty with apposition. Given the question &quot;What is the medical condition shingles?&quot;, the extractor incorrectly identified the entire phrase medical condition shingles as the target term. Finally, our policy of ignoring articles before the target term caused problems with the question &quot;What is the Hague?&quot; Since we extracted Hague as the target term, we returned answers about a British politician as well as the city in Holland. Our experiences show that while target extraction seems relatively straightforward, there are instances where a deeper linguistic understanding is necessary.</Paragraph> <Paragraph position="2"> Overall, our database and dictionary lookup techniques worked well. For six questions (out of fifty), however, neither technique found any nuggets, and therefore our system resorted to document lookup.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Evaluation Reconsidered </SectionTitle> <Paragraph position="0"> This section takes a closer look at the setup of the definition question evaluation at TREC 2003. In particular, we examine three issues: the scoring metric, error inherent in the evaluation process, and variations in judgments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Scoring Metric </SectionTitle> <Paragraph position="0"> As defined, nugget recall is only a function of the nuggets considered &quot;vital&quot;. This, however, leads to a counter-intuitive situation where a system that returned every non-vital nugget but no vital nuggets would receive a score of zero. This certainly does not reflect the information needs of a real user--even in the absence of &quot;vital&quot; information, related knowledge might still be useful to a user. One solution might be to assign a relative weight to distinguish vital and non-vital nuggets.</Paragraph> <Paragraph position="1"> The distinction between vital and non-vital nuggets is itself somewhat arbitrary. Consider some relevant nuggets for the question &quot;What is Bausch & Lomb?&quot;: based in Rochester, New York According to the official assessment, the first four nuggets are vital and the fifth is not. This means that the location of Bausch & Lomb's headquarters is considered less important than employee count and revenue. We disagree and also believe that &quot;based in Rochester, New York&quot; is more important than &quot;in 50 countries&quot;. Since it appears that the difference between vital and non-vital cannot be easily operationalized, there is little hope for systems to learn and exploit this distinction.</Paragraph> <Paragraph position="2"> As a reference, we decided to reevaluate our system, ignoring the distinction between vital and non-vital nuggets. The overall nugget recall is reported in Table 5.</Paragraph> <Paragraph position="3"> We also report the nugget recall of our system after fixing our target extractor to handle the variety of target terms in the testset (the &quot;fixed&quot; run). Unfortunately, our performance for the fixed run did not significantly increase because the problem associated with unanticipated targets extended beyond the target extractor. Since our surface patterns did not handle these special entities, the database did not contain relevant entries for those targets.</Paragraph> <Paragraph position="4"> Another important issue in the evaluation concerns the value of b, the relative importance between precision and recall in calculating the F-measure. The top entry achieved an F-measure of 0.555, but the response length averaged 2059 non-whitespace characters per question.</Paragraph> <Paragraph position="5"> In contrast, our run with an F-measure of 0.309 averaged only 620 non-whitespace characters per answer (only two other runs in the top ten had average response lengths lower than ours; the lowest was 338). Figure 2 shows F-measure of our system, the top run, and the IR baseline plotted against the value of b. As can be seen, if precision and recall are considered equally important (i.e., b = 1), the difference in performance between our system and that of the top system is virtually indistinguishable (and our system performs significantly better than the IR baseline). At the level of b = 5, it is obvious that standard IR technology works very well. The advantages of surface patterns, linguistic processing, answer fusion, and other techniques become more obvious if the F-measure is not as heavily biased towards recall.</Paragraph> <Paragraph position="6"> What is the proper value of b? As this was the first formal evaluation of definition questions, the value was set arbitrarily. However, we believe that there is no &quot;correct&quot; value of b. Instead, the relative importance of precision and recall varies dramatically from application to application, depending on the user information need. A college student writing a term paper, for example, would most likely value recall highly, whereas the opposite would be true for a user asking questions on a PDA. We believe that these tradeoffs are worthy of further research.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Evaluation Error </SectionTitle> <Paragraph position="0"> In the TREC 2003 evaluation, we submitted three identical runs, but nevertheless received different scores for each of the runs. This situation can be viewed as a probe into the error margin of the evaluation--assessors are human and naturally make mistakes, and to ensure the quality of the evaluation we need to quantify this variation.</Paragraph> <Paragraph position="1"> Voorhees' analysis (2003) revealed that scores for pairs of identical runs differed by as much as 0.043 in F-measure.</Paragraph> <Paragraph position="2"> For the three identical runs we submitted, there was one nugget missed in our first run that was found in the other two runs, ten nuggets from six questions missed for our second run that were found in the other runs, and ten nuggets from five questions missed in our third run.</Paragraph> <Paragraph position="3"> There were also nine nuggets from seven questions that were missed for all three runs, even though they were clearly present in our answers.</Paragraph> <Paragraph position="4"> Together over our three runs, there were 48 nuggets from 13 questions that were clearly present in our responses but were not consistently recognized by the assessors. The question affected most by these discrepancies was &quot;Who is Alger Hiss?&quot;, for which we received an F-measure of 0.671 in our first run, while for the second and third runs we received a score of zero.</Paragraph> <Paragraph position="5"> If the 48 missed nuggets had been recognized by the assessors, our F-measure would be 0.327, 0.045 higher than the score we actually received for runs b and c. This single-point investigation is not meant to contest the relative rankings of submitted runs, but simply to demonstrate the magnitude of the human error currently present in the evaluation of definition questions (presumably, all groups suffered equally from these variations).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Variations in Judgment </SectionTitle> <Paragraph position="0"> The answers to definition questions were judged by humans, and humans naturally have differing opinions as to the quality of a response. These differences of opinion are not mistakes (unlike the issues discussed in the previous section), but legitimate variations in what assessors consider to be acceptable. These variations are compounded by the small size of the testset--only fifty questions. In a post-evaluation analysis, Voorhees (2003) determined that a score difference of at least 0.1 in F-measure is required in order for two evaluation results to be considered statistically different (at 95% confidence). A range of +-0.1 around our F-measure of 0.309 could either push our results up to fifth place or down to eleventh place.</Paragraph> <Paragraph position="1"> A major source of variation is whether or not a passage matches a particular nugget in the assessor's list (the ground truth). Obviously, the assessors are not merely doing a string comparison, but are instead performing a &quot;semantic match&quot; of the relevant concepts involved. The following passages were rejected as matches to the assessors' nuggets: Who is Al Sharpton? Nugget: Harlem civil rights leader Our answer: New York civil rights activist Who is Ari Fleischer? Nugget: Elizabeth Dole's Press Secretary Our answer: Ari Fleischer, spokesman for ...</Paragraph> <Paragraph position="2"> Elizabeth Dole What is the medical condition shingles? Nugget: tropical [sic] capsaicin relieves pain of shingles Our answer: Epilepsy drug relieves pain from ... shingles Consider the nugget for Al Sharpton: although an &quot;activist&quot; may not be a &quot;leader&quot;, and someone from New York may not necessarily be from Harlem, one might argue that the two nuggets are &quot;close enough&quot; to warrant a semantic match. The same situation is true of the other two questions. The important point here is that different assessors may judge these nuggets differently, contributing to detectable variations in score.</Paragraph> <Paragraph position="3"> Another important issue is the composition of the assessors' nugget list, which serves as &quot;ground truth&quot;. To insure proper assessment, each nugget should ideally represent an &quot;atomic&quot; concept--which in many cases, it does not. Again consider the nugget for Al Sharpton; &quot;a Harlem civil rights leader&quot; includes the concepts that he was an important civil rights figure and that he did much of his work in Harlem. It is entirely conceivable that a response would provide one fact but not the other. How then should this situation be scored? As another example, one of the nuggets for Alexander Pope is &quot;English poet&quot;, which is clearly two separate facts.</Paragraph> <Paragraph position="4"> Another desirable characteristic of the assessor's nugget list is uniqueness--nuggets should be unique, not only in their text but also in their meaning. In the TREC 2003 testset, three questions had exact duplicate nuggets. Furthermore, there were also several questions for which multiple nuggets are nearly synonymous (or are implied by other nuggets), such as the following: Because the nuggets overlap greatly with each other in the concepts they denote, consistent and reproducible evaluation results are difficult.</Paragraph> <Paragraph position="5"> Another desirable property of the ground truth is completeness, or coverage of the nuggets--which we also found to be lacking. There were many relevant items of information returned by our runs that did not make it onto the assessors' nugget list (even as non-vital nuggets). For the question &quot;Who is Alberto Tomba?&quot;, the fact that he is Italian was not judged to be relevant. For &quot;What are fractals?&quot;, the ground truth does not contain the idea that they can be described by simple formulas, which is one of their most important characteristics. Some more examples are shown below: Aga Khan is the founder and principal shareholder of the Nation Media Group.</Paragraph> <Paragraph position="6"> The vagus nerve is the sometimes known as the 10th cranial nerve.</Paragraph> <Paragraph position="7"> Alexander Hamilton was an author, a general, and a founding father.</Paragraph> <Paragraph position="8"> Andrew Carnegie established a library system in Canada.</Paragraph> <Paragraph position="9"> Angela Davis taught at UC Berkeley.</Paragraph> <Paragraph position="10"> This coverage issue also points to a deeper methodological problem with evaluating definition questions by pooling the results of all participants. Vital nuggets may be excluded simply because no system returned them.</Paragraph> <Paragraph position="11"> Unfortunately, there is no easy way to quantify this phenomenon. null Clearly, evaluating answers to definition questions is a challenging task. Nevertheless, consistent, repeatable, and meaningful scoring guidelines are critical to driving the development of the field. We believe that lessons learned from our analysis can lead to a more refined evaluation in the coming years.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Future Work </SectionTitle> <Paragraph position="0"> The results of our work highlight several areas for future improvement. As mentioned earlier, target extraction is a key, non-trivial capability critical to the success of a system. Similarly, database lookup works only if the relevant target terms are identified and indexed while preprocessing the corpus. Both of these issues point to the need for a more robust named-entity extractor, capable of handling specialized names (e.g., &quot;Bausch & Lomb&quot;, &quot;Destiny's Child&quot;, &quot;Akbar the Great&quot;). At the same time, the named-entity extractor must not be confused by sentences such as &quot;Raytheon & Boeing are defense contractors&quot; or &quot;She gave John the Honda for Christmas&quot;.</Paragraph> <Paragraph position="1"> Another area for improvement is the accuracy of the surface patterns. In general, our patterns only used local information; we expect that expanding the context on which these patterns operate will reduce the number of false matches. As an example, consider our e1 is pattern; in one test, over 60% of irrelevant nuggets were cases where the target is the object of a preposition and not the subject of the copular verb immediately following it.</Paragraph> <Paragraph position="2"> For example, this pattern matched the question &quot;What is mold?&quot; to the sentence &quot;tools you need to look for mold are ...&quot;. If we endow our patterns with better linguistic notions of constituency, we can dramatically improve their precision. Another direction we are pursuing is the use of machine learning techniques to learn predictors of good nuggets, much like the work of Fleischman et al. (2003). Separating &quot;good&quot; from &quot;bad&quot; nuggets fits very naturally into a binary classification task.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we have described a novel set of strategies for answering definition questions from multiple sources: a database of nuggets precompiled offline using surface patterns, a Web-based electronic dictionary, and documents retrieved using traditional information retrieval technology. We have also demonstrated how answers derived using multiple strategies can be smoothly integrated to produce a final set of answers. In addition, our analyses have shown the difficulty of evaluating definition questions and inability of present metrics to accurately capture the information needs of real-world users.</Paragraph> <Paragraph position="1"> We believe that our research makes significant contributions toward the understanding of definition questions, a largely unexplored area of question answering.</Paragraph> </Section> class="xml-element"></Paper>