File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1006_intro.xml
Size: 3,087 bytes
Last Modified: 2025-10-06 14:01:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1006"> <Title>Learning Surface Text Patterns for a Question Answering System</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Most of the recent open domain question answering systems use external knowledge and tools for answer pinpointing. These may include named entity taggers, WordNet, parsers, hand -tagged corpora, and ontology lists (Srihari and Li, 00; Harabagiu et al., 01; Hovy et al., 01; Prager et al., 01). However, at the recent TREC -10 QA evaluation (Voorhees, 01), the winning system used just one resource: a fairly extensive list of surface patterns (Soubbotin and Soubbotin, 01). The apparent power of such patterns surprised many. We therefore decided to investigate their potential by acquiring patterns automatically and to measure their accuracy.</Paragraph> <Paragraph position="1"> It has been noted in several QA systems that certain types of answer are expressed using characteristic phrases (Lee et al., 01; Wang et al., 01). For example, for BIRTHDATEs (with questions like &quot;When was X born?&quot;), typical answers are &quot;Mozart was born in 1756.&quot; &quot;Gandhi (1869-1948)...&quot; These examples suggest that phrases like &quot;<NAME> was born in <BIRTHDATE>&quot; &quot;<NAME> (<BIRTHDATE>-&quot; when for mulated as regular expressions, can be used to locate the correct answer.</Paragraph> <Paragraph position="2"> In this paper we present an approach for automatically learning such regular expressions (along with determining their precision) from the web, for given types of questions. Our me thod uses the machine learning technique of bootstrapping to build a large tagged corpus starting with only a few examples of QA pairs. Similar techniques have been investigated extensively in the field of information extraction (Riloff, 96). These techn iques are greatly aided by the fact that there is no need to hand - tag a corpus, while the abundance of data on the web makes it easier to determine reliable statistical estimates.</Paragraph> <Paragraph position="3"> Our system assumes each sentence to be a simple sequence of words and search es for repeated word orderings as evidence for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 41-47. Proceedings of the 40th Annual Meeting of the Association for useful answer phrases. We use suffix trees for extracting substrings of optimal length.</Paragraph> <Paragraph position="4"> We borrow the idea of suffix trees from computational biology (Gusfield, 97) where it is primarily used for detecting D NA sequences. Suffix trees can be processed in time linear on the size of the corpus and, more importantly, they do not restrict the length of substrings. We then test the patterns learned by our system on new unseen questions from the TREC - 10 set and eva luate their results to determine the precision of the patterns.</Paragraph> </Section> class="xml-element"></Paper>