File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1021_intro.xml
Size: 3,967 bytes
Last Modified: 2025-10-06 14:00:41
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-1021"> <Title>Ranking suspected answers to natural language questions using predictive annotation</Title> <Section position="4" start_page="0" end_page="150" type="intro"> <SectionTitle> 2 System description </SectionTitle> <Paragraph position="0"> Our system (Figure 1) consists of two pieces: an IR component (GuruQA) that which returns matching texts, and an answer selection compo- null neat (AnSel/Werlect) that extracts and ranks potential answers from these texts.</Paragraph> <Paragraph position="1"> This paper focuses on the process of ranking potential answers selected by the IR engine, which is itself described in (Prager et al., 1999).</Paragraph> <Section position="1" start_page="150" end_page="150" type="sub_section"> <SectionTitle> 2.1 The Information Retrieval component </SectionTitle> <Paragraph position="0"> In the context of fact-seeking questions, we made the following observations: * In documents that contain the answers, the query terms tend to occur in close proximity to each other.</Paragraph> <Paragraph position="1"> * The answers to fact-seeking questions are usually phrases: &quot;President Clinton&quot;, &quot;in the Rocky Mountains&quot;, and &quot;today&quot;). * These phrases can be categorized by a set of a dozen or so labels (Figure 2) corresponding to question types.</Paragraph> <Paragraph position="2"> * The phrases can be identified in text by pattern matching techniques (without full NLP).</Paragraph> <Paragraph position="3"> As a result, we defined a set of about 20 categories, each labeled with its own QA-Token, and built an IR system which deviates from the traditional model in three important aspects. * We process the query against a set of approximately 200 question templates which, may replace some of the query words with a set of QA-Tokens, called a SYNclass. Thus &quot;Where&quot; gets mapped to &quot;PLACES&quot;, but &quot;How long &quot; goes to &quot;@SYN(LENGTH$, DURATIONS)&quot;.</Paragraph> <Paragraph position="4"> Some templates do not cause complete replacement of the matched string. For example, the pattern &quot;What is the population&quot; gets replaced by &quot;NUMBERS population'. null * Before indexing the text, we process it with Textract (Byrd and Ravin, 1998; Wacholder et al., 1997), which performs lemmatization, and discovers proper names and technical terms. We added a new module (Resporator) which annotates text segments with QA-Tokens using pattern matching. Thus the text &quot;for 5 centuries&quot; matches the DURATIONS pattern &quot;for :CARDINAL _timeperiod&quot;, where :CAR-DINAL is the label for cardinal numbers, and _timeperiod marks a time expression.</Paragraph> <Paragraph position="5"> * GuruQA scores text passages instead of documents. We use a simple documentand collection-independent weighting scheme: QA-Tokens get a weight of 400, proper nouns get 200 and any other word - 100 (stop words are removed in query processing after the pattern template matching operation). The density of matching query tokens within a passage is contributes a score of 1 to 99 (the highest scores occur when all matched terms are consecutive).</Paragraph> <Paragraph position="6"> Predictive Annotation works best for Where, When, What, Which and How+adjective questions than for How+verb and Why questions, since the latter are typically not answered by phrases. However, we observed that &quot;by&quot; + the present participle would usually indicate the description of a procedure, so we instantiate a METHODS QA-Token for such occurrences. We have no such QA-Token for Why questions, but we do replace the word &quot;why&quot; with &quot;@SYN(result, cause, because)&quot;, since the occurrence of any of these words usually betokens an explanation.</Paragraph> </Section> </Section> class="xml-element"></Paper>