File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1076_metho.xml

Size: 2,770 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1076">
  <Title>Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Learning Using Automatically Constructed
Corpora
</SectionTitle>
    <Paragraph position="0"> This section presents our machine learning scheme to estimate the conditional probabilities</Paragraph>
    <Paragraph position="2"> is in the form of {} a f , we re-formulate the two conditional probabilities as</Paragraph>
    <Paragraph position="4"> The learning scheme makes use of automatically constructed large corpora. The rationale is illustrated in the figure below. The symbol + represents a positive instance, namely, a mention pair that refers to the same entity. The symbol represents a negative instance, i.e. a mention pair that refers to different entities.</Paragraph>
    <Paragraph position="5">  As shown in the figure, two training corpora are automatically constructed. Corpus I contains mention pairs of the same names; these are the most frequently mentioned names in the document pool. It is observed that frequently mentioned person names in the news domain are fairly unambiguous, hence enabling the corpus to contain mainly positive instances.</Paragraph>
    <Paragraph position="6">  Corpus II contains mention pairs of different person names, these pairs overwhelmingly correspond to negative instances (with statistically negligible exceptions). Thus, typical patterns of negative instances can be learned from Corpus II. We use these patterns to filter away the negative instances in Corpus I. The purified Corpus I can then be used to learn patterns for positive instances. The algorithm is formulated as follows.</Paragraph>
    <Paragraph position="7"> Following the observation that different names usually refer to different entities, it is safe to derive Eq. (4).</Paragraph>
    <Paragraph position="8">  , we can derive the following relation (Eq. 5):  Based on our data analysis, there is no observable difference in linguistic expressions involving frequently mentioned vs. occasionally occurring person names. Therefore, the use of frequently mentioned names in the corpus construction process does not affect the effectiveness of the learned model to be applicable to all the person names in general.</Paragraph>
    <Paragraph position="9">  f can be automatically computed using Corpus I and Corpus II. Only X requires manual truthing. Because X is context independent, the required truthing is very limited (in our experiment, only 100 truthed mention pairs were used). The details of corpus construction and truthing will be presented in the next section.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML