File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3251_intro.xml
Size: 6,305 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3251"> <Title>Instance-Based Question Answering: A Data-Driven Approach</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Motivation </SectionTitle> <Paragraph position="0"> Most existing Question Answering systems classify new questions according to static ontologies. These ontologies incorporate human knowledge about the expected answer (e.g. date, location, person), answer type granularity (e.g. date, year, century), and very often semantic information about the question type (e.g. birth date, discovery date, death date).</Paragraph> <Paragraph position="1"> While effective to some degree, these ontologies are still very small, and inconsistent. Considerable manual effort is invested into building and maintaining accurate ontologies even though answer types are arguably not always disjoint and hierarchical in nature (e.g. &quot;Where is the corpus callosum?&quot; expects an answer that is both location and body part).</Paragraph> <Paragraph position="2"> The most significant drawback is that ontologies are not standard among systems, making individual component evaluation very difficult and re-training for new domains time-consuming.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Answer Modeling </SectionTitle> <Paragraph position="0"> The task of determining the answer type of a question is usually considered a hard 1 decision problem: questions are classified according to an answer ontology. The classification (location, person's name, etc) is usually made in the beginning of the QA process and all subsequent efforts are focused on finding answers of that particular type.</Paragraph> <Paragraph position="1"> Several existing QA systems implement feedback loops (Harabagiu et al., 2000) or full-fledged planning (Nyberg et al., 2003) to allow for potential answer type re-classification.</Paragraph> <Paragraph position="2"> However, most questions can have multiple answer types as well as specific answer type distributions. The following questions can accommodate answers of types: full date, year, and decade.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Question Answer </SectionTitle> <Paragraph position="0"> When did Glen lift off in Friendship7? Feb. 20, 1962 When did Glen join NASA? 1959 When did Glen have long hair? the fifties However, it can be argued that date is the most likely answer type to be observed for the first question, year the most likely type for the second question, and decade most likely for the third question. In fact, although the three questions can be answered by various temporal expressions, the distributions over these expressions are quite different. Existing answer models do not usually account for these distributions, even though there is a clear potential for better answer extraction and more refined answer scoring.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Document Retrieval </SectionTitle> <Paragraph position="0"> When faced with a new question, QA systems usually generate few, carefully expanded queries which produce ranked lists of documents. The retrieval step, which is very critical in the QA process, does not take full advantage of context information.</Paragraph> <Paragraph position="1"> However, similar questions with known answers do share context information in the form of lexical and structural features present in relevant documents.</Paragraph> <Paragraph position="2"> For example all questions of the type &quot;When was X born?&quot; find their answers in documents which often contain words such as &quot;native&quot; or &quot;record&quot;, phrases such as &quot;gave birth to X&quot;, and sometimes even specific parse trees.</Paragraph> <Paragraph position="3"> Most IR research in Question Answering is focused on improving query expansion and structur1the answer is classified into a single class instead of generating a probability distribution over answers ing queries in order to take advantage of specific document pre-processing. In addition to automatic query expansion for QA (Yang et al., 2003), queries are optimized to take advantage of expansion resources and document sources. Very often, these optimizations are performed offline, based on the type of question being asked.</Paragraph> <Paragraph position="4"> Several QA systems associate this type of information with question ontologies: upon observing questions of a certain type, specific lexical features are sought in the retrieved documents. These features are not always automatically learned in order to be used in query generation. Moreover, systems are highly dependent on specific ontologies and become harder to re-train.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Answer Extraction </SectionTitle> <Paragraph position="0"> Given a set of relevant documents, the answer extraction step consists of identifying snippets of text or exact phrases that answer the question. Manual approaches to answer extraction have been moderately successful in the news domain. Regular expressions, rule and pattern-based extraction are among the most efficient techniques for information extraction. However, because of the difficulty in extending them to additional types of questions, learning methods are becoming more prevalent.</Paragraph> <Paragraph position="1"> Current systems (Ravichandran et al., 2003) already employ traditional information extraction and machine learning for extracting answers from relevant documents. Boundary detection techniques, finite state transducers, and text passage classification are a few methods that are usually applied to this task.</Paragraph> <Paragraph position="2"> The drawback shared by most statistical answer extractors is their reliance on predefined ontologies. They are often tailored to expected answer types and require type-specific resources. Gazetteers, encyclopedias, and other resources are used to generate type specific features.</Paragraph> </Section> </Section> class="xml-element"></Paper>