File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1130_metho.xml

Size: 15,852 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1130">
  <Title>Fine Grained Classification of Named Entities</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Data Set Generation
</SectionTitle>
    <Paragraph position="0"> A large corpus of person instances was compiled from a TREC9 database consisting of articles from the Associated Press and the Wall Street Journal. Data was word tokenized, stemmed using the Porter stemming algorithm (Porter, 1980), part of speech tagged using Brill's tagger (Brill, 1994), and named entity tagged using BBN's IdentiFinder (Bikel, 1999). Person instances were classified into one of eight categories: athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, and police. These eight categories were chosen because of their high frequency in the corpus and also because of their usefulness in applications such as Question Answering. A training set of roughly 25,000 person instances was then created using a partially automated classification system.</Paragraph>
    <Paragraph position="1"> In generating the training data automatically we first attempted to use the simple tagging method described for location names in (Fleischman, 2001). This method involved collecting lists of instances of each category from the Internet and using those lists to classify person names found by IdentiFinder. Although robust with location names, this method proved inadequate with persons (in a sample of 300, over 25% of the instances were found to be incorrect). This was due to the fact that the same name will often refer to multiple individuals (e.g., &amp;quot;Paul Simon&amp;quot; refers to a politician, an entertainer, and Belgian scientist).</Paragraph>
    <Paragraph position="2"> In order to avoid this problem we implemented a simple bootstrapping procedure in which a seed data set of 100 instances of each of the eight categories was hand tagged and used to generate a decision list classifier using the C4.5 algorithm (Quinlan, 1993) with the word frequency and topic signature features described below. This simple classifier was then run over a large corpus and classifications with a confidence score above a 90% threshold were collected. These confident instances were then compared to the lists collected from the Internet, and, only if there was agreement between the two sources, were the instances included in the final training set. This procedure produced a large training set with very few misclassified instances (over 99% of the instances in a sample of 300 were found to be correct). A validation set of 1000 instances from this set was then hand tagged to assure proper classification.</Paragraph>
    <Paragraph position="3"> A consequence of using this method for data generation is that the training set created is not a random sample of person instances in the real world. Rather, the training set is highly skewed, including only those instances that are both easy enough to classify using a simple classifier and common enough to be included in lists found on the Internet. To examine the generalizability of classifiers trained on such data, a held out data set of 1300 instances, also from the AP and WSJ, was collected and hand tagged.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3. Features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Word Frequency Features
</SectionTitle>
      <Paragraph position="0"> Each instance in the text is paired with a set of features that represents how often the words surrounding the target instance occur with a specific sub-categorization in the training set. For example, in example sentence 2 in the introduction, the word &amp;quot;introduce&amp;quot; occurs immediately before the person instance. The feature set describing this instance would thus include eight different features; each denoting the frequency with which &amp;quot;introduce&amp;quot; occurred in the training set immediately preceding an instance of a politician, a businessperson, an entertainer, etc. The feature set includes these eight different frequencies for 10 distinct word positions (totaling 80 features per instance). The positions used include the three individual words before the occurrence of the instance, the three individual words after the instance, the two-word bigrams immediately before and after the instance, and the three-word trigrams immediately before and after the instance (see  example 2, above. Shows the frequency with which an n-gram appears in the training data in a specific position relative to instances of a specific category.</Paragraph>
      <Paragraph position="1"> These word frequency features provide information similar to the binary word features that are often used in text categorization (Yang, 1997) with only a fraction of the dimensionality. Such reduced dimensionality feature sets can be preferable when classifying very small texts (Fleischman, in preparation).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Topic Signature Features
</SectionTitle>
      <Paragraph position="0"> Inspection of the data made clear the need for semantic information during classification. We therefore created features that use topic signatures for each of the person subcategories. A topic signature, as described in (Lin and Hovy, 2000), is a list of terms that can be used to signal the membership of a text in the relevant topic or category. Each term in a text is given a topic signature score that indicates its ability to signal that the text is in a relevant category (the higher the score, the more that term is indicative of that category). The topic signatures are automatically generated for each specific term by computing the likelihood ratio (l-score) between two hypotheses (Dunning, 1993). The first hypothesis (h1) is that the probability (p1) that the text is in the relevant category, given a specific term, is equivalent to the probability (p2) that the text is in the relevant category, given any other term (h1: p1=p2). The second hypothesis (h2) is that these two probabilities are not equivalent, and that p1 is much greater than p2 (h2: p1&gt;&gt;p2). The calculation of this likelihood ratio [-2logL(h1)/L(h2)] for each feature and for each category gives a list of all the terms in a document set with scores indicating how much the presence of that term in a specific document indicates that the document is in a specific category.</Paragraph>
      <Paragraph position="1">  for two categories.</Paragraph>
      <Paragraph position="2"> In creating topic signature features for the subcategorization of persons, we created a database of topic signatures generated from the training set (see Figure 2).</Paragraph>
      <Paragraph position="3">  Each sentence from the training set was treated as a unique document, and the classification of the instance contained in that sentence was treated as the relevant topic. We implemented the algorithm described in (Lin and Hovy, 2000) with the addition of a cutoff, such that the topic signatures for a term are only included if the p1/p2 for that term is greater than the mean p1/p2 over all terms. This modification was made to ensure the assumption that p1 is much greater than p2. A weighted sum was then computed for each of the eight person subcategories according to the formula below:  To avoid noise, we used only those sentences in which each person instance was of the same category.</Paragraph>
      <Paragraph position="4">  where N is the number of words in the sentence, l-score of word n,Type is the topic signature score of word n for topic Type, and distance from instance is the number of words away from the instance that word n is. These topic signature scores are calculated for each of the eight subcategories. These eight topic signature features convey semantic information about the overall context in which each instance exists. The topic signature scores are weighted according to the inverse square of their distance under the (not always true) assumption that the farther away a word is from an instance, the less information it bears on classification. This weighting is particularly important when instances of different categories occur in the same sentence (e.g., &amp;quot;...of those donating to Bush's campaign was actor Arnold Schwarzenegger...&amp;quot;).</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.3 WordNet Features
</SectionTitle>
      <Paragraph position="0"> A natural limitation of the topic signature features is their inability to give weight to related and synonymous terms that do not appear in the training data. To address this limitation, we took advantage of the online resource WordNet (Fellbaum, 1998). The WordNet hypernym tree was expanded for each word surrounding the instance and each word in the tree was given a score based on the topic signature database generated from the training data. The scores were then weighted by the inverse of their height in the tree and then summed together, similarly to the procedure in (Resnik, 1993). These sums are computed for each word surrounding the instance, and are summed according to the weighting process described above. This produces a distinct WordNet feature for each of the eight classes and is described by the equation below:  where the variables are as above and M is the number of words in the WordNet hypernym tree. These WordNet features supplement the coverage of the topic signatures generated from the training data by including synonyms that may not have existed in that data set. Further, the features include information gained from the hypernyms themselves (e.g., the hypernym of &amp;quot;Congress&amp;quot; is &amp;quot;legislature&amp;quot;). These final hypernym scores are weighted by the inverse of their height in the tree to reduce the effect of concepts that may be too general (e.g., at the top of the hypernym tree for &amp;quot;Congress&amp;quot; is &amp;quot;group&amp;quot;). In order to avoid noise due to inappropriate word senses, we only used data from senses that matched the part of speech.</Paragraph>
      <Paragraph position="1"> These eight WordNet features add to the above features for a total of 96 features.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4. Methods
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Experiment 1: Held out data
</SectionTitle>
      <Paragraph position="0"> To examine the generalizability of classifiers trained on the automatically generated data, a C4.5 decision tree classifier (Quinlan, 1993) was trained and tested on the held out test set described above.</Paragraph>
      <Paragraph position="1"> Initial results revealed that, due to differing contexts, instances of the same name in a single text would often be classified into different subcategories. To deal with this problem, we augmented the classifier with another program, MemRun, which standardizes the subcategorization of instances based on their most frequent classification. Developed and tested in (Fleischman, 2001), MemRun is based upon the hypothesis that by looking at all the classifications an instance has received throughout the test set, an &amp;quot;average&amp;quot; sub-categorization can be computed that offers a better guess than a low confidence individual classification.</Paragraph>
      <Paragraph position="2"> MemRun operates in two rounds. In the first round, each instance of the test set is evaluated using the decision tree, and a classification hypothesis is generated. If the confidence level of this hypothesis is above a certain threshold (THRESH 1), then the hypothesis is entered into the temporary database (see Figure 3) along with the degree of confidence of that hypothesis, and the number of times that hypothesis has been received.</Paragraph>
      <Paragraph position="3"> Because subsequent occurrences of person instances frequently differ orthographically from their initial occurrence (e.g., &amp;quot;George Bush&amp;quot; followed by &amp;quot;Bush&amp;quot;) a simple algorithm was devised for surface reference disambiguation. The algorithm keeps a record of initial full name usages of all person instances in a text. When partial references to the instance are later encountered in the text, as determined by simple regular expression matching, they are entered into the MemRun database as further occurrences of the original instance. This record of full name references is cleared after a text is examined to avoid possible instance confusions (e.g., &amp;quot;George W. Bush&amp;quot; and &amp;quot;George Bush Sr.&amp;quot;). This simple algorithm operates on the assumption that partial references to individuals with the same last name in the same text will not occur due to human authors' desire to avoid any possible confusion.</Paragraph>
      <Paragraph position="4">  When all of the instances in the data set are examined, the round is complete.</Paragraph>
      <Paragraph position="5"> In MemRun's second round, the data set is reexamined, and hypothesis classifications are again produced. If the confidence of one of these hypotheses is below a second threshold (THRESH 2), then the hypothesis is ignored and the database value is used.</Paragraph>
      <Paragraph position="6">  In this experiment, the entries in the database are compared and the most frequent entry (i.e., the max classification based on confidence level multiplied by the increment) is returned.</Paragraph>
      <Paragraph position="7"> When all instances have been again examined, the round is complete.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Experiment 2: Learning Algorithms
</SectionTitle>
      <Paragraph position="0"> Having examined the generalizability when using automatically generated training data, we turn to the question of appropriate learning algorithms for the task. We chose to examine five different learning algorithms. Along with C4.5, we examined a feed-forward neural network with 50 hidden units, a k-Nearest Neighbors implementation (k=1) (Witten &amp; Frank, 1999), a Support Vector Machine implementation using a linear kernel (Witten &amp; Frank, 1999), and a naive Bayes classifier using discretized attributes and  This algorithm does not address definite descriptions and pronominal references because they are not classified by IdentiFinder as people names, and thus are not marked for fine-grained classification in the test set.</Paragraph>
      <Paragraph position="1">  The ability of the algorithm to ignore the database's suggestion in the second round allows instances with the same name (e.g., &amp;quot;Paul Simon&amp;quot;) to receive different classifications in different contexts.</Paragraph>
      <Paragraph position="2">  with feature subset selection (Kohavi &amp; Sommerfield, 1996). For each classifier, comparisons were based on results from the validation set (~1000 instances) described above.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Experiment 3: Feature sets
</SectionTitle>
      <Paragraph position="0"> To examine the effectiveness of the individual types of features, a C4.5 decision tree classifier (Quinlan, 1993) was trained on the 25,000 instance data set described above using all possible combinations of the three feature sets. The performance was ascertained on the validation set described above.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML