File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1002_intro.xml

Size: 7,025 bytes

Last Modified: 2025-10-06 14:03:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1002">
  <Title>Using Encyclopedic Knowledge for Named Entity Disambiguation</Title>
  <Section position="3" start_page="9" end_page="10" type="intro">
    <SectionTitle>
2 Wikipedia - A Wiki Encyclopedia
</SectionTitle>
    <Paragraph position="0"> Wikipedia is a free online encyclopedia written collaboratively by volunteers, using a wiki software that allows almost anyone to add and change articles. It is a multilingual resource - there are about 200 language editions with varying levels of coverage. Wikipedia is a very dynamic and quickly growing resource - articles about newsworthy events are often added within days of their occurrence. As an example, the September 2005 version contains 751,666 articles, around 180,000 more articles than four months earlier. The work in this paper is based on the English version from May 2005, which contains 577,860 articles.</Paragraph>
    <Paragraph position="1"> Each article in Wikipedia is uniquely identified by its title - a sequence of words separated by underscores, with the first word always capitalized. Typically, the title is the most common name for the entity described in the article. When the name is ambiguous, it is further qualified with a parenthetical expression. For instance, the article on John Williams the composer has the title John Williams (composer).</Paragraph>
    <Paragraph position="2"> Because each article describes a specific entity or concept, the remainder of the paper sometimes uses the term 'entity' interchangeably to refer to both the article and the corresponding entity. Also, let BX denote the entire set of entities from Wikipedia. For any entity CTBEBX, CTBMD8CXD8D0CT is the title name of the corresponding article, and CTBMCC is the text of the article.</Paragraph>
    <Paragraph position="3"> In general, there is a many-to-many correspondence between names and entities. This relation is captured in Wikipedia through redirect and disambiguation pages, as described in the next two sections.</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
2.1 Redirect Pages
</SectionTitle>
      <Paragraph position="0"> A redirect page exists for each alternative name that can be used to refer to an entity in Wikipedia.</Paragraph>
      <Paragraph position="1"> The name is transformed (using underscores for spaces) into a title whose article contains a redirect link to the actual article for that entity. For example, John Towner Williams is the full name of the composer John Williams. It is therefore an alternative name for the composer, and consequently the article with the title John Towner Williams is just a pointer to the article for John Williams (composer). An example entry with a considerably higher number of redirect pages is United States. Its redirect pages correspond to acronyms (U.S.A., U.S., USA, US), Spanish translations (Los Estados Unidos, Estados Unidos), misspellings (Untied States) or synonyms (Yankee land).</Paragraph>
      <Paragraph position="2"> For any given Wikipedia entity CTBEBX, let CTBMCA be the set of all names that redirect to CT.</Paragraph>
    </Section>
    <Section position="2" start_page="9" end_page="10" type="sub_section">
      <SectionTitle>
2.2 Disambiguation Pages
</SectionTitle>
      <Paragraph position="0"> Another useful structure is that of disambiguation pages, which are created for ambiguous names, i.e. names that denote two or more entities in Wikipedia. For example, the disambiguation page for the name John Williams lists 22 associated  entities. Therefore, besides the non-ambiguous names that come from redirect pages, additional aliases can be found by looking for all disambiguation pages that list a particular Wikipedia entity. In his philosophical article &amp;quot;On Sense and Reference&amp;quot; (Frege, 1999), Gottlob Frege gave a famous argument to show that sense and reference are distinct. In his example, the planet Venus may be referred to using the phrases &amp;quot;morning star&amp;quot; and &amp;quot;evening star&amp;quot;. This theoretical example is nicely captured in practice in Wikipedia by two disambiguation pages, Morning Star and Evening Star, both listing Venus as a potential referent.</Paragraph>
      <Paragraph position="1"> For any given Wikipedia entity CT BE BX, let CTBMBW be the set of names whose disambiguation pages contain a link to CT.</Paragraph>
    </Section>
    <Section position="3" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
2.3 Categories
</SectionTitle>
      <Paragraph position="0"> Every article in Wikipedia is required to have at least one category. As shown in Table 1, John Williams (composer) is associated with a set of categories, among them Star Wars music, Film score composers, and 20th century classical composers. Categories allow articles to be placed into one or more topics. These topics can be further categorized by associating them with one or more parent categories. In Table 1 Venus is shown as both an article title and a category. As a category, it has one direct parent Planets of the Solar System, which in turn belongs to two more general categories, Planets and Solar System. Thus, categories form a directed acyclic graph, allowing multiple categorization schemes to co-exist simultaneously. There are in total 59,759 categories in Wikipedia.</Paragraph>
      <Paragraph position="1"> For a given Wikipedia entity CTBEBX, let CTBMBV be the set of categories to which CT belongs (i.e. CT's immediate categories and all their ancestors in the Wikipedia taxonomy).</Paragraph>
    </Section>
    <Section position="4" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
2.4 Hyperlinks
</SectionTitle>
      <Paragraph position="0"> Articles in Wikipedia often contain mentions of entities that already have a corresponding article. When contributing authors mention an existing Wikipedia entity inside an article, they are required to link at least its first mention to the corresponding article, by using links or piped links.</Paragraph>
      <Paragraph position="1"> Both types of links are exemplified in the following wiki source code of a sentence from the article on Italy: &amp;quot;The [[Vatican City|Vatican]] is now an independent enclave surrounded by [[Rome]]&amp;quot;.</Paragraph>
      <Paragraph position="2"> The string from the second link (&amp;quot;Rome&amp;quot;) denotes the title of the referenced article. The same string is also used in the display version. If the author wants another string displayed (e.g., &amp;quot;Vatican&amp;quot; instead of &amp;quot;VaticanCity&amp;quot;), then the alternative string is included in a piped link, after the title string.</Paragraph>
      <Paragraph position="3"> Consequently, the display string for the aforementioned example is: &amp;quot;The Vatican is now an independent enclave surrounded by Rome&amp;quot;. As described later in Section 4, the hyperlinks can provide useful training examples for a named entity disambiguator.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML