File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1188_intro.xml
Size: 5,052 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1188"> <Title>Information Extraction for Question Answering: Improving Recall Through Syntactic Patterns</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Current retrieval systems allow us to locate documents that might contain the pertinent information, but most of them leave it to the user to extract the useful information from a ranked list of documents.</Paragraph> <Paragraph position="1"> Hence, the (often unwilling) user is left with a relatively large amount of text to consume. There is a need for tools that reduce the amount of text one might have to read to obtain the desired information. Corpus-based question answering is designed to take a step closer to information retrieval rather than document retrieval. The question answering (QA) task is to find, in a large collection of data, an answer to a question posed in natural language.</Paragraph> <Paragraph position="2"> One particular QA strategy that has proved successful on large collections uses surface patterns derived from the question to identify answers. For example, for questions like When was Gandhi born?, typical phrases containing the answer are Gandhi was born in 1869 and Gandhi (1869-1948). These examples suggest that text patterns such as &quot;name was born in birth date&quot; and &quot;name (birth year-death year)&quot; formulated as regular expressions, can be used to select the answer phrase.</Paragraph> <Paragraph position="3"> Similarly, such lexical or lexico-syntactic patterns can be used to extract specific information on semantic relations from a corpus offline, before actual questions are known, and store it in a repository for quick and easy access. This strategy allows one to handle some frequent question types: Who is. . . , Where is. . . , What is the capital of. . . etc. (Fleischman et al., 2003; Jijkoun et al., 2003).</Paragraph> <Paragraph position="4"> A great deal of work has addressed the problem of extracting semantic relations from unstructured text. Building on this, much recent work in QA has focused on systems that extract answers from large bodies of text using simple lexico-syntactic patterns. These studies indicate two distinct problems associated with using patterns to extract semantic information from text. First, the patterns yield only a small subset of the information that may be present in a text (the recall problem). Second, a fraction of the information that the patterns yield is unreliable (the precision problem). The precision of the extracted information can be improved significantly by using machine learning methods to filter out noise (Fleischman et al., 2003). The recall problem is usually addressed by increasing the amount of text data for extraction (taking larger collections (Fleischman et al., 2003)) or by developing more surface patterns (Soubbotin and Soubbotin, 2002).</Paragraph> <Paragraph position="5"> Some previous studies indicate that in the setting of an end-to-end state-of-the-art QA system, with additional answer finding strategies, sanity checking, and statistical candidate answer re-ranking, recall is more of a problem than precision (Bernardi et al., 2003; Jijkoun et al., 2003): it often seems useful to have more data rather than better data. The aim of this paper is to address the recall problem by using extraction methods that are linguistically more sophisticated than surface pattern matching.</Paragraph> <Paragraph position="6"> Specifically, we use dependency parsing to extract syntactic relations between entities in a text, which are not necessarily adjacent on the surface level. A small set of hand-built syntactic patterns allows us to detect relevant semantic information. A comparison of the parsing-based approach to a surface-pattern-based method on a set of TREC questions about persons shows a substantial improvement in the amount of the extracted information and number of correctly answered questions.</Paragraph> <Paragraph position="7"> In our experiments we tried to understand whether linguistically involved methods such as parsing can be beneficial for information extraction, where rather shallow techniques are traditionally employed, and whether the abstraction from surface to syntactic structure of the text does indeed help to find more information, at the same time avoiding the time-consuming manual development of increasing numbers of surface patterns.</Paragraph> <Paragraph position="8"> The remainder of the paper is organized as follows. In Section 2 we discuss related work on extracting semantic information. We describe our main research questions and experimental setting in Section 3. Then, in Section 4 we provide details on the extraction methods used (surface and syntactic). Sections 5 and 6 contain a description of our experiments and results, and an error analysis, respectively. We conclude in Section 7.</Paragraph> </Section> class="xml-element"></Paper>