File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1187_metho.xml
Size: 24,596 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1187"> <Title>Web-Based List Question Answering</Title> <Section position="3" start_page="1" end_page="2" type="metho"> <SectionTitle> 2 Design Considerations </SectionTitle> <Paragraph position="0"> Our goal is to find as many distinct exact answers on the Web as possible. This requires us to: * perform effective and exhaustive search; and * extract distinct answers.</Paragraph> <Paragraph position="1"> In order to perform effective search, we employ question transformation to get effectual web queries. However, this is not a trivial task. If the query is too general, too many documents may be retrieved and the system would not have sufficient resources to scan through all of them. If the query is too specific, no pages may be retrieved.</Paragraph> <Paragraph position="2"> Given millions of web pages returned by search engines, our strategy is to divide-and-conquer by first identify Collection Pages (CP) that contain a list of answer instances. For example, for the question &quot;What breeds of dog have won the &quot;Best in Show&quot; award at the Westminster Dog Show?&quot;, we can find a Collection Page as shown in Figure 1 (top). Such a web page is a very good resource of answers. In general, we observe that there is a large number of named entities of the type desired appearing in a Collection Page, typically in a list or table. Our intuition is that if we can find a Collection Page that contains almost all the answers, then the rest of the work is simply to extract answers from it or related web pages by wrapper rule induction.</Paragraph> <Paragraph position="3"> Another kind of &quot;good&quot; web page is a Topic Page, that contains just one answer instance (Figure 1, bottom). It typically contains many named entities, which correspond to our original query terms and some other named entities of the answer target type.</Paragraph> <Paragraph position="4"> Given the huge amount of web data, there will be many Topic Pages that refer to the same answer instance. There is hence a need to group the pages and to identify a pertinent and distinctive page in order to represent a distinct answer.</Paragraph> <Paragraph position="5"> Table 2: Web Page Classes The rest of the top returned web pages could be either relevant or irrelevant to the question. In summary, we need to classify web pages into four classes: Collection Page, Topic Page, Relevant Page, and Irrelevant Page (Table 2), based on their functionality and contribution in finding list answers. Based on the above considerations, we propose a general framework to find list answers on the Web using the following steps: a) Retrieve a good set of web documents.</Paragraph> <Paragraph position="6"> b) Identify Collection Pages and distinct Topic Pages as main resources of answers.</Paragraph> <Paragraph position="7"> c) Perform clustering on other web pages based on their similarities to distinct Topic Pages to form clusters that correspond to distinct answer instances. null d) Extract answers from Collection Pages and Topic Page clusters.</Paragraph> <Paragraph position="8"> 3 Question Transformation and Web Page</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> Retrieval </SectionTitle> <Paragraph position="0"> Agichtein et al. (2001) presented a technique on learning search engine specific query transformations for question answering. A set of transformation rules are learned from a training corpus and applied to the questions at the search time. Related work could also be found in Kwok et al. (2001) where the user's question is processed by a parser to learn its syntactic structure and various query modulation techniques are applied to the initial questions to get high quality results for later answer extraction.</Paragraph> <Paragraph position="1"> FADA performs question parsing to identify key question words and the expected answer type. It extracts several sets of words from the original question and identifies the detailed question classes. It then formulates a number of queries by combining the known facets together with heuristic patterns for list questions.</Paragraph> <Paragraph position="2"> We perform both shallow and full parsing on a question followed by Named Entity Recognition (NER) to get the known query facets and their types. The shallow parser we used is the free online memory-based chunker and the full parser is MINIPAR .</Paragraph> <Paragraph position="3"> Both parsers are very efficient and usually parse 300 words within a second. The procedure of query parsing is as follows: a) Remove head words. The head words in a question could be wh-question words and leading verbs. The list of head words includes &quot;who, what, when, where, which, how, how much, how many, list, name, give, provide, tell&quot;, etc. Removing them enables us to get the correct subject/object relation and verb in the question. For example, for question &quot;What breeds of dog have won the 'Best in Show' award at the Westminster Dog Show?&quot;, after removing the head word, the question becomes &quot;breeds of dog have won the 'Best in Show' award at the Westminster Dog Show&quot;.</Paragraph> <Paragraph position="4"> b) Detect subject and object for the remaining question segments by shallow parsing. For example, after parsing the above question, we get:</Paragraph> <Paragraph position="6"> From the parsed sentence, we want to get the logical subject as the sentence subject or its immediate modifiers. Here we have the logical subject&quot;breeds of dog&quot;, verb-&quot;won&quot;, and logical object&quot;the best in show award&quot;. If the resulting logical subject/object is the term &quot;that&quot; as in the following parsed query for &quot;U.S. entertainers that later we get the noun or noun phrase before the clause as the logical subject/object. Hence, we have the logical subject-&quot;entertainers&quot;, action-&quot;became&quot;, and logical object-&quot;politician&quot;.</Paragraph> <Paragraph position="7"> c) Extract all the noun phrases as potential descriptions from the remaining question segments, which are usually prepositional phrases or clauses.</Paragraph> <Paragraph position="8"> For the &quot;dog breeds&quot; example, we get the descrip-</Paragraph> <Paragraph position="10"> d) Apply named entity recognition to the resulting description phrases by using NEParser, a fine-grained named entity recognizer used in our TREC-12 system (Yang et al., 2003). It assigns tags like &quot;person&quot;, &quot;location&quot;, &quot;time&quot;, &quot;date&quot;, &quot;number&quot;. For the &quot;dog breed&quot; example, &quot;Westminster&quot; gets the location tag.</Paragraph> <Paragraph position="11"> After the above analysis, we obtain all the known facets provided in the original question. We then make use of this knowledge to form web queries to get the right set of pages. This is a crucial task in dealing with the Web. One of the query transformation rules is given as follows: (list|directoty|category|top|favorite )? (:|of)? <subj> <action>? <object>? <description1>? <description2>? ...</Paragraph> <Paragraph position="12"> <descriptionN>? The rule starts the query optionally with leading words (list, directory, category), optionally followed by a colon or &quot;of&quot;, followed by subject phrase (<subj>), optionally followed by action (<action>), optionally followed by object (<object>) and description phrases (<description1>...<descriptionN>). In the above pattern, &quot;?&quot; denotes optional, &quot;...&quot; omit, and &quot;|&quot; alternative. For example, for the &quot;dog breed&quot; question, we form queries &quot;breed of dog won best in show Westminster Dog Show&quot;, &quot;directory breed of dog best in show Westminster Dog Show&quot;, and &quot;list breed of dog won best in show&quot; etc.</Paragraph> <Paragraph position="13"> Transforming the initial natural language questions into a good query can dramatically improve the chances of finding good answers. FADA submits these queries to well-known search engines (Google, AltaVista, Yahoo) to get the top 1,000 Web pages per search engine per query. Here we attempt to retrieve a large number of web pages to serve our goal - find All Distinct answers. Usually, there are a large number of web pages which are redundant as they come from the same URL addresses. We remove the redundant web pages using the URL addresses as the guide. We also filter out files whose formats are neither HTML nor plain text and those whose lengths are too short or too long. Hence the size of the resulting document set for each question varies from a few thousands to ten of thousands.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Web Page Classification </SectionTitle> <Paragraph position="0"> In order to group the web pages returned by the search engines into the four categories discussed earlier, it is crucial to find a good set of features to represent the web pages. Many techniques such as td.idf (Salton and Buckley, 1988) and a stop word list have been proposed to extract lexical features to help document clustering. However, they do not work well for question answering. As pointed out by Ye et al. (2003) in their discussion on the person/organization finding task, given two resume pages about different persons, it is highly possible that they are grouped into one cluster because they share many similar words and phrases. On the other hand, it is difficult to group together a news page and a resume page about the same target entity, due to the diversity in subject matter, word choice, literary styles and document format. To overcome this problem, they used mostly named entity and link information as the basis for clustering. Compared to their task, our task of finding good web documents containing answers is much more complex. The features are more heterogeneous, and it is more difficult to choose those that reflect the essential characteristics of list answers.</Paragraph> <Paragraph position="1"> In our approach, we obtain the query words through subject/object detection and named entity recognition. We found that there are a large number of named entities of the same type appearing in a Collection Page, typically within a list or table. And in a Topic Page, there is also typically a group of named entities, which could correspond to our original query terms or answer target type. Therefore, named entities play important roles in semantic expression and should be used to reflect the content of the pages.</Paragraph> <Paragraph position="2"> The Web track in past TREC conferences shows that URL, HTML structure, anchor text, hyperlinks, and document length tend to contain important heuristic clues for web clustering and information retrieval (Craswell and Hawking, 2002). We have found that a Topic Page is highly likely to repeat the subject in its URL, title, or at the beginning of its page. In general, if the subject appears in important locations, such as in HTML tags <title>, <H1> and <H2>, or appears frequently, then the corresponding pages should be Topic Pages and their topic is about the answer target.</Paragraph> <Paragraph position="3"> Followed the above discussion, we design a set of 29 features based on Known Named Entity Type, Answer Named Entity Type, ordinary Named Entities, list, table, URL, HTML structure, Anchor, Hyperlinks, and document length to rep resent the web pages. Table 3 lists the features used in our system.</Paragraph> <Paragraph position="4"> In the table and subsequent sections, NE refers to Named Entity.</Paragraph> <Paragraph position="5"> We trained two classifiers: the Collection Page classifier and the Topic Page classifier. The former classifies web pages into Collection Pages and non-collection pages while the later further classifies the non-collection pages into Topic Pages and Others.</Paragraph> <Paragraph position="6"> Both Classifiers are implemented using Decision Tree C4.5 (Quinlan 1993). We used 50 list questions from TREC-10 and TREC-11 for training and TREC-12 list questions for testing. We parse the questions, formulate web queries and collect web pages by using the algorithm described in Section 2.</Paragraph> <Paragraph position="7"> in the question. In the &quot;dog breed&quot; example, it is the number of Location NEs since &quot;Westminster&quot; is identified as Location by # of NEs belonging to other NE type. In the &quot;dog breed&quot; example, it is the total number of Time and Breed NEs 11 |Answer_NE |# of NEs belonging to expected answer type. In the &quot;dog breed&quot; example, it is the number of Breed NEs Web page classification enables us to get Collection Pages, Topic Pages and the rest of the pages. Our experiments on TREC-12 list questions showed that we can achieve a classification precision of 91.1% and 92% for Collection Pages and Topic Pages respectively.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Finding Answer Sources </SectionTitle> <Paragraph position="0"> Based on Web page classification, we form the initial sets of Collection Pages CPSet, Topic Pages TPSet and OtherSet. In order to boost the recall, we first use the outgoing links of Collection Pages to find more Topic Pages. These outgoing pages are potential Topic Pages but not necessarily appearing among the top returned web documents. Our subsequent tests reveal that the new Topic Pages introduced by links from Collection Pages greatly increase the overall answer recall by 23%. The new Topic Page set becomes: TPSet' = TPSet + {outgoing pages of CPs} Second, we select distinct Topic Pages. We compare the page similarity between each pair of Topic Pages using the algorithm below.</Paragraph> <Paragraph position="2"> Here the page similarity function sim() is a linear combination of overlaps between Known_NE, Answer_NE, URL similarity and link similarity. Th is preset at 0.75 and may be overridden by the user.</Paragraph> <Paragraph position="4"> is the number of named entities of answer type in Topic Page tp i.</Paragraph> <Paragraph position="5"> For those pairs with high similarity, we keep the page that contains more named entities of answer type in TPSet' and move the other into OtherSet. The resulting Topic Pages in TPSet' are distinct and will be used as cluster seeds for the next step.</Paragraph> <Paragraph position="6"> Third, we identify and dispatch Relevant Pages from OtherSet into appropriate clusters based on their similarities with the cluster seeds.</Paragraph> <Paragraph position="8"> Here t is preset at 0.55, and sim() is defined as above. Each cluster corresponds to a distinct answer instance. The Topic Page provides the main facts about that answer instance while Relevant Pages provide supporting materials for the unique answer instance. The average ratio of correct clustering is 54.1% in our experiments.</Paragraph> <Paragraph position="9"> Through web page clustering, we avoid early answer redundancy, and have a higher chance to finding distinct answers on the noisy Web.</Paragraph> </Section> <Section position="6" start_page="2" end_page="5" type="metho"> <SectionTitle> 6 Answer Extraction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 6.1 HTML Source Page Cleaning </SectionTitle> <Paragraph position="0"> Many HTML web pages contain common HTML mistakes, including missing or unmatched tags, end tags in the wrong order, missing quotes round attributes, missed / in end tags, and missing > closing tags, etc. We use HtmlTidy to clean up the web pages before classification and clustering. FADA also uses an efficient technique to remove advertisements. We periodically update the list from Accs-Net , a site that specializes in creating such blacklists of advertisers. If a link address matches an entry in a blacklist, the HTML portion that contained the link is removed.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 6.2 Answer Extraction from CP </SectionTitle> <Paragraph position="0"> Collection Pages are very good answer resources for list QA. However, to extract the &quot;exact&quot; answers from the resource page, we need to perform wrapper rule induction to extract the useful content. There is a large body of related work in content extraction, which enables us to process only extracted content rather than cluttered data coming directly from the web. Gupta et al. (2003) parsed HTML documents to a Document Object Model tree and to extract the main content of a web page by removing the link lists and empty tables. In contrast, our link list extractor finds all link lists, which are table cells or lists for which the ratio of the number of links to the number of non-linked words is greater than a specific ratio. We have written separate extractors for each answer target type. The answers obtained in Collection Pages are then &quot;projected&quot; onto the TREC AQUAINT corpus to get the TREC answers (Brill et al., 2001).</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 6.3 Answer Extraction from TP Cluster </SectionTitle> <Paragraph position="0"> Having web pages clustered for a certain question, especially when the clusters nicely match distinct answer, facilitates the task of extracting the possible answers based on the answer target type. We perform this by first analyzing the main Topic Pages in each cluster. In case we find multiple passages containing different answer candidates in the same Topic Page, we select the answer candidate from the passage that has the most variety of NE types since it is likely to be a comprehensive description about different facets of a question topic. The answer found in the Topic Page is then &quot;projected&quot; onto the QA corpus to get the TREC answers as with the Collection Page. In case no TREC answers can be found http://www.accs-net.com/hosts/get_hosts.html based on the Topic Page, we go to the next most relevant page in the same cluster to search for the answer. The process is repeated until either an answer from the cluster is found in the TREC corpus or when all Relevant Pages in the cluster have been exhausted.</Paragraph> <Paragraph position="1"> For the question &quot;Which countries did the first lady Hillary Clinton visit?&quot;, we extracted the Locations after performing Named Entity analysis on each cluster and get 38 country names as answers.</Paragraph> <Paragraph position="2"> The recall is much higher than the best performing system (Harabagiu et al., 2003) in TREC-12 which found 26 out of 44 answers.</Paragraph> <Paragraph position="3"> 7 Evaluation on TREC-12 Question Set We used the 37 TREC-12 list questions to test the overall performance of our system and compare the answers we found in the TREC AQUAINT corpus (after answer projection (Brill et al. 2001)) with the answers provided by NIST.</Paragraph> </Section> <Section position="4" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 7.1 Tests of Web Page Classification </SectionTitle> <Paragraph position="0"> In Section 3, the web pages are classified into three classes: Collection Pages, Topic Pages, and Others.</Paragraph> <Paragraph position="1"> Table 4 shows the system performance of the classification. We then perform a redistribution of classified pages, where the outgoing pages from CPs go to TP collection, and the Relevant Pages are grouped as supportive materials into clusters, which are based on distinct Topic Page. Nevertheless, the performance of web page classification will influence the later clustering and answer finding task. Table 4 shows that we could achieve an overall classification average precision of 0.897 and average recall of 0.851. This performance is adequate to support the subsequent steps of finding complete answers.</Paragraph> </Section> <Section position="5" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 7.2 Performance and Effects of Web Page Clustering </SectionTitle> <Paragraph position="0"> Relevant Pages are put into clusters to provide supportive material for a certain answer instance. The performance of Relevant Page dispatch/clustering is 54.1%. We also test different clustering thresholds for our web page clustering as defined in Section 5.</Paragraph> <Paragraph position="1"> We use the F1 measure of the TREC-12 list QA results as the basis to compare the performance of different clustering threshold combinations as shown in xx. We obtain the best performance of F</Paragraph> </Section> <Section position="6" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 7.3 Overall Performance </SectionTitle> <Paragraph position="0"> Table 6 compares a baseline list question answering system with FADA. The baseline is based on a system which we used in participation in the TREC-12 QA task (Yang et al., 2003). It extends the traditional IR/NLP approach for factoid QA to perform list QA, as is done in most other TREC-12 systems.</Paragraph> <Paragraph position="1"> It achieves an average F of 0.319, and is ranked 2 nd in the list QA task.</Paragraph> <Paragraph position="2"> We test two variants of FADA - one without introducing the outgoing pages from CPs as potential TPs (FADA1), and one with (FADA2). The two variants are used to evaluate the effects of CPs in the list QA task. The results of these two variants of FADA on the TREC-12 list task are presented in Without the benefit of the outgoing pages from CPs to find potential answers, FADA1 could boost the average recall by 30% and average F by 16.6% as compared to the baseline. The great improvement in recall is rather encouraging because it is crucial for a list QA system to find a complete set of answers, which is how list QA differ from factoid QA. By taking advantage of the outgoing pages from CPs, FADA2 further improves performance to an average recall of 0.422 and average F of 0.464. It outperforms the best TREC-12 QA system (Voorhees, 2003) by 19.6% in average F score.</Paragraph> <Paragraph position="3"> From Table 6, we found that the outgoing pages from the Collection Pages (or resource pages) contribute much to answer finding task. It gives rise to an improvement in recall of 22.7% as compared to the variant of FADA1 that does not take advantage of outgoing pages. We think this is mainly due to the characteristics of the TREC-12 questions. Most questions ask for well-known things, and famous events, people, and organization. For this kind of questions, we can easily find a Collection Page that contains tabulated answers since there are web sites that host and maintain such information. For instance, &quot;Westminster Dog Show&quot; has an official . However, for those questions that lack Collection Pages, such as &quot;Which countries did the first lady Hillary Clinton visit?&quot;, we still need to rely more on Topic Pages and Relevant Pages.</Paragraph> <Paragraph position="4"> With the emphasis on answer completeness and uniqueness, FADA uses a large set of documents obtained from the Web to find answers. As compared to the baseline system, this results in a drop in average answer precision although both recall and F are significantly improved. This is due to the fact that we seek most answers from the noisy Web directly, whereas in the baseline system, the Web is merely used to form new queries and the answers are found from the TREC AQUAINT corpus. We are still working to find a good balance between precision and recall.</Paragraph> <Paragraph position="5"> The idea behind FADA system is simple: Since Web knowledge helps in answering factoid questions, why not list questions? Our approach in FADA demonstrates that this is possible. We believe that list QA should benefit even more than factoid QA from using Web knowledge.</Paragraph> </Section> </Section> class="xml-element"></Paper>