File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0803_metho.xml

Size: 17,471 bytes

Last Modified: 2025-10-06 14:10:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0803">
  <Title>Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search</Title>
  <Section position="4" start_page="0" end_page="17" type="metho">
    <SectionTitle>
2 Applications
</SectionTitle>
    <Paragraph position="0"> Two tasks that can readily benefit from automatically extracted key phrases to disambiguate personal names are query suggestion and social network extraction. In query suggestion (Gauch and Smith, 1991), the search engine returns a set of phrases to the user alongside with the search results. The user can then modify the original query using these phrases to narrow down the search.</Paragraph>
    <Paragraph position="1"> Query suggestion helps the users to easily navigate through the result set. For personal name queries, the key phrases extracted by our algorithm can be used as suggestions to reduce the ambiguity and narrow down the search on a particular namesake.</Paragraph>
    <Paragraph position="2"> Social networking services (SNSs) have been given much attention on the Web recently. As a kind of online applications, SNSs can be used  to register and share personal information among friends and communities. There have been recent attempts to extract social networks using the information available on the Web 2(Mika, 2004; Matsuo et al., 2006). In both Matsuo's (2006) and Mika's (2004) algorithms, each person is represented by a node in the social network and the strength of the relationship between two people is represented by the length of the edge between the corresponding two nodes. As a measure of the strength of the relationship between two people A and B, these algorithms use the number of hits obtained for the query A AND B. However, this approach fails when A or B has namesakes because the number of hits in these cases includes the hits for the namesakes. To overcome this problem, we could include phrases in the query that uniquely identify A and B from their namesakes.</Paragraph>
  </Section>
  <Section position="5" start_page="17" end_page="18" type="metho">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> Person name disambiguation can be seen as a special case of word sense disambiguation (WSD) (Schutze, 1998; McCarthy et al., 2004) problem which has been studied extensively in Natural Language Understanding. However, there are several fundamental differences between WSD and person name disambiguation. WSD typically concentrates on disambiguating between 2-4 possible meanings of the word, all of which are a priori known. However, in person name disambiguation in Web, the number of different namesakes can be much larger and unknown. From a resource point of view, WSD utilizes sense tagged dictionaries such as WordNet, whereas no dictionary can provide information regarding different namesakes for a particular name.</Paragraph>
    <Paragraph position="1"> The problem of person name disambiguation has been addressed in the domain of research paper citations (Han et al., 2005), with various supervised methods proposed for its solution. However, citations have a fixed format compared to free text on the Web. Fields such as co-authors, title, journal name, conference name, year of publication can be easily extracted from a citation and provide vital information to the disambiguation process.</Paragraph>
    <Paragraph position="2"> Research on multi-document person name resolution (Bagga and Baldwin, 1998; Mann and Yarowsky, 2003; Fleischman and Hovy, 2004) focuses on the related problem of determining if 2http://flink.sematicweb.org/. The system won the 1st place at the Semantic Web Challenge in ISWC2004.</Paragraph>
    <Paragraph position="3"> two instances with the same name and from different documents refer to the same individual.</Paragraph>
    <Paragraph position="4"> Bagga and Baldwin (1998) first perform within-document coreference resolution to form coreference chains for each entity in each document.</Paragraph>
    <Paragraph position="5"> They then use the text surrounding each reference chain to create summaries about each entity in each document. These summaries are then converted to a bag of words feature vector and are clustered using standard vector space model often employed in IR. The use of simplistic bag of words clustering is an inherently limiting aspect of their methodology. On the other hand, Mann and Yarowsky (2003) proposes a richer document representation involving automatically extracted features. However, their clustering technique can be basically used only for separating two people with the same name. Fleischman and Hovy (2004) constructs a maximum entropy classifier to learn distances between documents that are then clustered.</Paragraph>
    <Paragraph position="6"> Their method requires a large training set.</Paragraph>
    <Paragraph position="7"> Pedersen et al. (2005) propose an unsupervised approach to resolve name ambiguity by representing the context of an ambiguous name using second order context vectors derived using singular value decomposition (SVD) on a co-occurrence matrix. They agglomeratively cluster the vectors using cosine similarity. They evaluate their method only on a conflated dataset of pseudonames, which begs the question of how well such a technique would fair on a more real-world challenge. Li et al. (2005) propose two approaches to disambiguate entities in a set of documents: a supervisedly trained pairwise classifier and an unsupervised generative model. However, they do not evaluate the effectiveness of their method in Web search.</Paragraph>
    <Paragraph position="8"> Bekkerman and McCallum (2005) present two unsupervised methods for finding web pages referring to a particular person: one based on link structure and another using Agglomerative/Conglomerative Double Clustering (A/CDC). Their scenario focuses on simultaneously disambiguating an existing social network of people, who are closely related. Therefore, their method cannot be applied to disambiguate an individual whose social network (for example, friends, colleagues) is not known. Guha and Grag (2004) present a re-ranking algorithm to disambiguate people. The algorithm requires a user to select one of the returned pages as a starting point. Then,  through comparing the person descriptions, the algorithm re-ranks the entire search results in such a way that pages referring to the same person described in the user-selected page are ranked higher. A user needs to browse the documents in order to find which matches the user's intended referent, which puts an extra burden on the user.</Paragraph>
    <Paragraph position="9"> None of the above mentioned works attempt to extract key phrases to disambiguate person name queries, a contrasting feature in our work.</Paragraph>
  </Section>
  <Section position="6" start_page="18" end_page="18" type="metho">
    <SectionTitle>
4 Data Set
</SectionTitle>
    <Paragraph position="0"> We select three ambiguous names (Micheal Jackson, William Cohen and Jim Clark) that appear in previous work in name resolution. For each name we query Google with the name and download top 100 pages. We manually classify each page according to the namesakes discussed in the page.</Paragraph>
    <Paragraph position="1"> We ignore pages which we could not decide the namesake from the content. We also remove pages with images that do not contain any text. No pages were found where more than one namesakes of a name appear. For automated pseudo-name evaluation purposes, we select four names (Bill Clinton, Bill Gates, Tom Cruise and Tiger Woods) for conflation, who we presumed had one vastly predominant sense. We download 100 pages from Google for each person. We replace the name of the per-son by &amp;quot;person-X&amp;quot; in the collection, thereby introducing ambiguity. The structure of our dataset is shown in Table 1.</Paragraph>
  </Section>
  <Section position="7" start_page="18" end_page="20" type="metho">
    <SectionTitle>
5 Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
5.1 Problem Statement
</SectionTitle>
      <Paragraph position="0"> Given a collection of documents relevant to an ambiguous name, we assume that each document in the collection contains exactly one namesake of the ambiguous name. This is a fair assumption considering the fact that although namesakes share a common name, they specializes in different fields and have different Web appearances. Moreover, the one-to-one association between documents and people formed by this assumption, let us model the person name disambiguation problem as a one of hard-clustering of documents.</Paragraph>
      <Paragraph position="1"> The outline of our method is as following; Given a set of documents representing a group of people with the same name, we represent each document in the collection using a Term-Entity model (section 5.2). We define a contextual similarity metric (section 5.4) and then cluster (section 5.5) the term-entity models using the contextual similarity between them. Each cluster is considered to be representing a different namesake.</Paragraph>
      <Paragraph position="2"> Finally, key phrases that uniquely identify each namesake are selected from the clusters. We perform experiments at each step of our method to evaluate its performance.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
5.2 Term-Entity Model
</SectionTitle>
      <Paragraph position="0"> The first step toward disambiguating a personal name is to identify the discriminating features of one person from another. In this paper we propose Term-Entity models to represent a person in a document. null Definition. A term-entity model T(A), representing a person A in a document D, is a boolean expression of n literals a1,a2,...,an. Here, a boolean literal ai is a multi-word term or a named entity extracted from the document D.</Paragraph>
      <Paragraph position="1"> For simplicity, we only consider boolean expressions that combine the literals through AND operator.</Paragraph>
      <Paragraph position="2"> The reasons for using terms as well as named entities in our model are two fold. Firstly, there are multi-word phrases such as secretary of state, racing car driver which enable us to describe a person uniquely but not recognized by named entity taggers. Secondly, automatic term extraction (Frantzi and Ananiadou, 1999) can be done using statistical methods and does not require extensive linguistic resources such as named entity dictionaries, which may not be available for some domains.</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
5.3 Creating Term-Entity Models
</SectionTitle>
      <Paragraph position="0"> We extract terms and named entities from each document to build the term-entity model for that document. For automatic multi-word term extraction, we use the C-value metric proposed by Frantzi et al. (1999). Firstly, the text from which we need to extract terms is tagged using a part of speech tagger. Then a linguistic filter and a stop words list constrain the word sequences that  States&amp;quot; are allowed as genuine multi-word terms. The linguistic filter contains a predefined set of patterns of nouns, adjectives and prepositions that are likely to be terms. The sequences of words that remain after this initial filtering process (candidate terms) are evaluated for their termhood (likeliness of a candidate to be a term) using C-value. C-value is built using statistical characteristics of the candidate string, such as, total frequency of occurrence of the candidate string in the document, the frequency of the candidate string as part of other longer candidate strings, the number of these longer candidate terms and the length of candidate string (in number of words). We select the candidates with higher C-values as terms (see (Frantzi and Ananiadou, 1999) for more details on C-value based term extraction).</Paragraph>
      <Paragraph position="1"> To extract entities for the term-entity model, the documents were annotated by a named entity tagger 3. We select personal names, organization names and location names to be included in the term-entity model.</Paragraph>
    </Section>
    <Section position="4" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
5.4 Contextual Similarity
</SectionTitle>
      <Paragraph position="0"> We need to calculate the similarity between term-entity models derived from different documents, in order to decide whether they belong to the same namesake or not. WordNet 4 based similarity metrics have been widely used to compute the semantic similarity between words in sense dis- null States&amp;quot; ambiguation tasks (Banerjee and Pedersen, 2002; McCarthy et al., 2004). However, most of the terms and entities in our term-entity models are proper names or multi-word expressions which are not listed in WordNet.</Paragraph>
      <Paragraph position="1"> Sahami et al. (2005) proposed the use of snippets returned by a Web search engine to calculate the semantic similarity between words. A snippet is a brief text extracted from a document around the query term. Many search engines provide snippets alongside with the link to the original document. Since snippets capture the immediate surrounding of the query term in the document, we can consider a snippet as the context of a query term. Using snippets is also efficient because we do not need to download the source documents.</Paragraph>
      <Paragraph position="2"> To calculate the contextual similarity between two terms (or entities), we first collect snippets for each term (or entity) and pool the snippets into a combined &amp;quot;bag of words&amp;quot;. Each collection of snippets is represented by a word vector, weighted by the normalized frequency (i.e., frequency of a word in the collection is divided by the total number of words in the collection). Then, the contextual similarity between two phrases is defined as the inner product of their snippet-word vectors.</Paragraph>
      <Paragraph position="3"> Figures 1 and 2 show the distribution of most frequent words in snippets for the queries &amp;quot;George Bush&amp;quot;, &amp;quot;Tiger Woods&amp;quot; and &amp;quot;President of the United States&amp;quot;. In Figure 1 we observe the words &amp;quot;george&amp;quot; and &amp;quot;bush&amp;quot; appear in snippets for the query &amp;quot;President of the United States&amp;quot;, whereas in Figure 2 none of the high frequent words appears in snippets for both queries. Contextual  similarity calculated as the inner product between word vectors is 0.2014 for &amp;quot;George Bush&amp;quot; and &amp;quot;President of the United States&amp;quot;, whereas the same is 0.0691 for &amp;quot;Tiger Woods&amp;quot; and &amp;quot;President of the United States&amp;quot;. We define the similarity sim(T(A),T(B)), between two term-entity</Paragraph>
      <Paragraph position="5"> Here, |ai |represents the vector that contains the frequency of words that appear in the snippets for term/entity ai. Contextual similarity between terms/entities ai and bj, is defined as the inner product |ai|*|bj|. Without a loss of generality we assume n [?] m in formula 1.</Paragraph>
    </Section>
    <Section position="5" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
5.5 Clustering
</SectionTitle>
      <Paragraph position="0"> We use Group-average agglomerative clustering (GAAC) (Cutting et al., 1992), a hybrid of single-link and complete-link clustering, to group the documents that belong to a particular namesake.</Paragraph>
      <Paragraph position="1"> Initially, we assign a separate cluster for each of the documents in the collection. Then, GAAC in each iteration executes the merger that gives rise to the cluster G with the largest average correla-</Paragraph>
      <Paragraph position="3"> Here, |G |denotes the number of documents in the merged cluster G; u and v are two documents in G and sim(T(u),T(v)) is given by equation 1.</Paragraph>
      <Paragraph position="4"> Determining the total number of clusters is an important issue that directly affects the accuracy of disambiguation. We will discuss an automatic method to determine the number of clusters in section 6.3.</Paragraph>
    </Section>
    <Section position="6" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
5.6 Key phrases Selection
</SectionTitle>
      <Paragraph position="0"> GAAC process yields a set of clusters representing each of the different namesakes of the ambiguous name. To select key phrases that uniquely identify each namesake, we first pool all the terms and entities in all term-entity models in each cluster.</Paragraph>
      <Paragraph position="1"> For each cluster we select the most discriminative terms/entities as the key phrases that uniquely identify the namesake represented by that cluster from the other namesakes. We achieve this in two steps. In the first step, we reduce the number of terms/entities in each cluster by removing terms/entities that also appear in other clusters.</Paragraph>
      <Paragraph position="2"> In the second step, we select the terms/entities in each cluster according to their relevance to the ambiguous name. We compute the contextual similarity between the ambiguous name and each term/entity and select the top ranking terms/entities from each cluster.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML