File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0202_intro.xml

Size: 5,258 bytes

Last Modified: 2025-10-06 14:06:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0202">
  <Title>Is Hillary Rodham Clinton the President? Disambiguating Names across Documents Yael RAVIN</Title>
  <Section position="4" start_page="10" end_page="11" type="intro">
    <SectionTitle>
2 Current System for Cross-Document
</SectionTitle>
    <Paragraph position="0"> Coreference The choice of a canonical string as the identifier for equivalence groups within each document is very important for later merging across documents. The document-based canonical string should be explicit enough to distinguish between different named entities, yet normalized enough to aggregate all mentions of the same entity across documents. Canonical strings of human names are comprised of the following parts, if found: first name, middle name, last name, and suffix (e.g., Jr.). Professional or personal titles and nicknames are not included as these are less permanent features of people's names and may vary across documents. Identical canonical strings with the same entity type (e.g., PR) are merged across documents. For example, in the \[NIST93\] collection, Alan Greenspan has the following variants across documents --Federal Reserve Chairman Alan Greenspan, Mr.</Paragraph>
    <Paragraph position="1"> Greenspan, Greenspan, Federal Reserve Board Chairman Alan Greenspan, Fed Chairman Alan Greenspan -- but a single canonical string --Alan Greenspan.</Paragraph>
    <Paragraph position="2"> The current aggregation also merges near-identical canonical strings: it normalizes over hyphens, slashes and spaces to merge canonical names such as Allied-Signal and Allied Signal, PC-TV and PC/TV. It normalizes over &amp;quot;empty&amp;quot; words (People's Liberation Army and People Liberation Army; Leadership Conference on Civil Rights and Leadership Conference of Civil Rights). Finally, it merges identical stemmed words of sufficient length (Communications Decency Act and Communication Decency Ac O.</Paragraph>
    <Paragraph position="3"> Normalization is not allowed for people's names, to avoid combining names such as Smithberg and Smithburg.</Paragraph>
    <Paragraph position="4">  Merging of identical names with different entity types is controlled by a table of aggregateable types. For example, PR? can merge with PL, as in Beverly Hills \[PR?\] and Beverly Hills \[PL\]. But ORG and PL cannot merge, so Boston \[ORG\] does not merge with Boston \[PL\]. As a further precaution, no aggregation occurs if the merge is ambiguous, that is, if a canonical name could potentially merge with more than one other canonical name. For example, President Clinton could be merged with Bill Clinton, Chelsea Clinton, or Hillary Rodham Clinton.</Paragraph>
    <Paragraph position="5"> To prevent erroneous aggregation of different entities, we currently do not aggregate over different canonical strings. We keep the canonical place New York (city or state) distinct from the canonical New York City and New York State. Similarly, with human names: Jerry O.</Paragraph>
    <Paragraph position="6"> Williams in one document is separate from Jerry Williams in another; or, more significantly, Jerry Lewis from one document is distinct from Jerry Lee Lewis from another. We are conservative with company names too, preferring to keep the canonical name Allegheny International and its variants separate from the canonical name Allegheny Ludlum and its variant, Allegheny Ludlum Corp. Even with such conservative criteria, aggregation over documents is quite drastic. The name dictionary for 20MB of WSJ text contains 120,257 names before aggregation and 42,033 names after.</Paragraph>
    <Paragraph position="7"> But conservative aggregation is not always right. We have identified several problems with our current algorithm that our new algorithm promises to handle.</Paragraph>
    <Paragraph position="8"> 1) Failure to merge -- often, particularly famous people or places, may be referred to by different canonical strings in different documents.</Paragraph>
    <Paragraph position="9"> Consider, for example, some of the canonical strings identified for President Clinton in our  Because of our decision not to merge under ambiguity (as mentioned above), our final list of names includes many names that should have been further aggregated.</Paragraph>
    <Paragraph position="10"> 2) Failure to split -- there is insufficient intra-document evidence for splitting &amp;quot;names&amp;quot; that are combinations of two or more component names, such as ABC, Paramount and Disney, or B.</Paragraph>
    <Paragraph position="11"> Brown of Dallas County Judicial District Court. Note that splitting is complex: sometimes even humans are undecided, for combinations such as Boston Consulting Group in San Francisco.</Paragraph>
    <Paragraph position="12"> 3) False merge -- due to an implementation decision, tl~e current aggregation does not involve a second pass over the intra-document vocabulary. This means that canonical names are aggregated depending on the order in which documents are analyzed, with the result that canonical names with different entity types are merged when they are encountered if the merge seems unambiguous at the time, even though subsequent names encountered may invalidate it.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML