File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0803_intro.xml
Size: 1,827 bytes
Last Modified: 2025-10-06 14:03:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0803"> <Title>Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Internet has grown into a collection of billions of web pages. Web search engines are important interfaces to this vast information. We send simple text queries to search engines and retrieve web pages. However, due to the ambiguities in the queries, a search engine may return a lot of irrelevant pages. In the case of personal name queries, we may receive web pages for other people with the same name (namesakes). For example, if we search Google 1 for Jim Clark, even among the top 100 results we find at least eight different Jim Clarks. The two popular namesakes; pages), and Jim Clark the founder of Netscape (26 pages), cover the majority of the pages. What if we are interested only in the Formula one world champion and want to filter out the pages for the other Jim Clarks? One solution is to modify our query by including a phrase such as Formula one or racing driver with the name, Jim Clark.</Paragraph> <Paragraph position="1"> This paper presents an automatic method to extract such phrases from the Web. We follow a three-stage approach. In the first stage we represent each document containing the ambiguous name by a term-entity model, as described in section 5.2. We define a contextual similarity metric based on snippets returned by a search engine, to calculate the similarity between term-entity models. In the second stage, we cluster the documents using the similarity metric. In the final stage, we select key phrases from the clusters that uniquely identify each namesake.</Paragraph> </Section> class="xml-element"></Paper>