File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1043_metho.xml

Size: 17,109 bytes

Last Modified: 2025-10-06 14:08:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1043">
  <Title>A Bootstrapping Approach to Named Entity Classification Using Successive Learners</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Parsing-based NE Rule Learning
</SectionTitle>
    <Paragraph position="0"> The training of the first NE learner has three major properties: (i) the use of concept-based seeds, (ii) support from the parser, and (iii) representation as a decision list.</Paragraph>
    <Paragraph position="1"> This new bootstrapping approach is based on the observation that there is an underlying concept for any proper name type and this concept can be easily expressed by a set of common nouns or pronouns, similar to how concepts are defined by synsets in WordNet (Beckwith 1991).</Paragraph>
    <Paragraph position="2"> Concept-based seeds are conceptually equivalent to the proper name types that they represent. These seeds can be provided by a user intuitively. For example, a user can use pill, drug, medicine, etc. as concept-based seeds to guide the system in learning rules to tag MEDICINE names.</Paragraph>
    <Paragraph position="3"> This process is fairly intuitive, creating a favorable environment for configuring the NE system to the types of names sought by the user.</Paragraph>
    <Paragraph position="4"> An important characteristic of concept-based seeds is that they occur much more often than proper name seeds, hence they are effective in guiding the non-iterative NE bootstrapping.</Paragraph>
    <Paragraph position="5"> A parser is necessary for concept-based NE bootstrapping. This is due to the fact that concept-based seeds only share pattern similarity with the corresponding NEs at structural level, not at string sequence level. For example, at string sequence level, PERSON names are often preceded by a set of prefixing title words Mr./Mrs./Miss/Dr. etc., but the corresponding common noun seeds man/woman etc. cannot appear in such patterns.</Paragraph>
    <Paragraph position="6"> However, at structural level, the concept-based seeds share the same or similar linguistic patterns (e.g. Subject-Verb-Object patterns) with the corresponding types of proper names.</Paragraph>
    <Paragraph position="7"> The rationale behind using concept-based seeds in NE bootstrapping is similar to that for parsing-based word clustering (Lin 1998): conceptually similar words occur in structurally similar context. In fact, the anaphoric function of pronouns and common nouns to represent antecedent NEs indicates the substitutability of proper names by the corresponding common nouns or pronouns. For example, this man can be substituted for the proper name John Smith in almost all structural patterns.</Paragraph>
    <Paragraph position="8"> Following the same rationale, a bootstrapping approach is applied to the semantic lexicon acquisition task [Thelen &amp; Riloff. 2002].</Paragraph>
    <Paragraph position="9"> The InfoXtract parser supports dependency parsing based on the linguistic units constructed by our shallow parser (Srihari et al. 2003). Five types of the decoded dependency relationships are used for parsing-based NE rule learning. These are all directional, binary dependency links between  linguistic units: (1) Has_Predicate: from logical subject to verb e.g. He said she would want him to join. Gc6 he: Has_Predicate(say) she: Has_Predicate(want) him: Has_Predicate(join) (2) Object_Of : from logical object to verb e.g. This company was founded to provide new telecommunication services. Gc6 company: Object_Of(found) service: Object_Of(provide) (3) Has_Amod: from noun to its adjective modifier e.g. He is a smart, handsome young man. Gc6 man: Has_AMod(smart) man: Has_AMod(handsome) man: Has_AMod(young) (4) Possess: from the possessive noun-modifier to head noun e.g. His son was elected as mayor of the city. Gc6 his: Possess(son) city: Possess(mayor) (5) IsA: equivalence relation from one NP to another NP e.g. Microsoft spokesman John Smith is a popular man. Gc6 spokesman: IsA(John Smith)  The concept-based seeds used in the experiments are:  1. PER: he, she, his, her, him, man, woman 2. LOC: city, province, town, village 3. ORG: company, firm, organization, bank, airline, army, committee, government, school, university 4. PRO: car, truck, vehicle, product, plane,  aircraft, computer, software, operating system, data-base, book, platform, network Note that the last target tag PRO (PRODUCT) is beyond the MUC NE standards: we added this NE type for the purpose of testing the system's capability in supporting user-defined NE types. From the parsed corpus in the repository, all instances of the concept-based seeds associated with one or more of the five dependency relations are retrieved: 821,267 instances in total in our experiment. Each seed instance was assigned a concept tag corresponding to NE. For example, each instance of he is marked as PER. The marked instances plus their associated parsing relationships form an annotated NE corpus, as shown below:  It is noteworthy that the PER tag dominates the corpus due to the fact that the pronouns he and she occur much more often than the seeded common nouns. So the proportion of NE types in the instances of concept-based seeds is not the same as the proportion of NE types in the proper name instances. For example, considering a running text containing one instance of John Smith and one instance of a city name Rochester, it is more likely that John Smith will be referred to by he/him than Rochester by (the) city. Learning based on such a corpus is biased towards PER as the answer.</Paragraph>
    <Paragraph position="10"> To correct this bias, we employ the following modification scheme for instance count. Suppose there are a total of</Paragraph>
    <Paragraph position="12"> instances, then in the process of rule accuracy evaluation, the involved instance count for any NE type will be adjusted by the coefficient</Paragraph>
    <Paragraph position="14"> the number of the training instances of PER is ten times that of PRO, then when evaluating a rule accuracy, any positive/negative count associated with PER will be discounted by 0.1 to correct the bias.</Paragraph>
    <Paragraph position="15"> A total of 1,290 parsing-based NE rules are learned, with accuracy higher than 0.9. The following are sample rules of the learned decision  Due to the unique equivalence nature of the IsA relation, the above bootstrapping procedure can hardly learn IsA-based rules. Therefore, we add the following IsA-based rules to the top of the decision list: IsA(seed)Gc6 tag of the seed, for example:  In this step, we use the parsing-based first learner to tag a raw corpus in order to train the second NE learner.</Paragraph>
    <Paragraph position="16"> One issue with the parsing-based NE rules is modest recall. For incoming documents, approximately 35%-40% of the proper names are associated with at least one of the five parsing relations. Among these proper names associated with parsing relations, only ~5% are recognized by the parsing-based NE rules.</Paragraph>
    <Paragraph position="17"> So we adopted the strategy of applying the parsing-based rules to a large corpus (88 million words), and let the quantity compensate for the sparseness of tagged instances. A repository level consolidation scheme is also used to improve the recall.</Paragraph>
    <Paragraph position="18"> The NE classification procedure is as follows. From the repository, all the named entity candidates associated with at least one of the five parsing relationships are retrieved. An NE candidate is defined as any chunk in the parsed corpus that is marked with a proper name Part-Of-Speech (POS) tag (i.e. NNP or NNPS). A total of 1,607,709 NE candidates were retrieved in our experiment. A small sample of the retrieved NE candidates with the associated parsing relationships are shown below:  ............</Paragraph>
    <Paragraph position="19"> After applying the decision list to the above the NE candidates, 33,104 PER names, 16,426 LOC names, 11,908 ORG names and 6,280 PRO names were extracted.</Paragraph>
    <Paragraph position="20"> It is a common practice in the bootstrapping research to make use of heuristics that suggest conditions under which instances should share the same answer. For example, the one sense per discourse principle is often used for word sense disambiguation (Gale et al. 1992). In this research, we used the heuristic one tag per domain for multi-word NE in addition to the one sense per discourse principle. These heuristics were found to be very helpful in improving the performance of the bootstrapping algorithm for the purpose of both increasing positive instances (i.e. tag propagation) and decreasing the spurious instances (i.e. tag elimination). The following are two examples to show how the tag propagation and elimination scheme works.</Paragraph>
    <Paragraph position="21"> Tyco Toys occurs 67 times in the corpus, and 11 instances are recognized as ORG, only one instance is recognized as PER. Based on the heuristic one tag per domain for multi-word NE, the minority tag of PER is removed, and all the 67 instances of Tyco Toys are tagged as ORG.</Paragraph>
    <Paragraph position="22"> Three instances of Postal Service are recognized as ORG, and two instances are recognized as PER. These tags are regarded as noise, hence are removed by the tag elimination scheme.</Paragraph>
    <Paragraph position="23"> The tag propagation/elimination scheme is adopted from (Yarowsky 1995). After this step, a total of 386,614 proper names were recognized, including 134,722 PER names, 186,488 LOC names, 46,231 ORG names and 19,173 PRO names. The overall precision was ~90%. The benchmark details will be shown in Section 6. The extracted proper name instances then led to the construction of a fairly large training corpus sufficient for training the second NE learner. Unlike manually annotated running text corpus, this corpus consists of only sample string sequences containing the automatically tagged NE instances and their left and right neighboring words within the same sentence. The two neighboring words are always regarded as common words while constructing the corpus. This is based on the observation that the proper names usually do not occur continuously without any punctuation in between.</Paragraph>
    <Paragraph position="24"> A small sample of the automatically constructed corpus is shown below:  ............</Paragraph>
    <Paragraph position="25"> This corpus is used for training the second NE learner based on evidence from string sequences, to be described in Section 5 below.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 String Sequence-based NE Learning
</SectionTitle>
    <Paragraph position="0"> String sequence-based HMM learning is set as our final goal for NE bootstrapping because of the demonstrated high performance of this type of NE taggers.</Paragraph>
    <Paragraph position="1"> In this research, a bi-gram HMM is trained based on the sample strings in the annotated corpus constructed in section 4. During the training, each sample string sequence is regarded as an independent sentence. The training process is similar to (Bikel 1997).</Paragraph>
    <Paragraph position="2"> The HMM is defined as follows: Given a word  f denotes a single token feature which will be defined below), the goal for the NE tagging task is to find the optimal NE tag sequence n210 ttttsequence T G16= , which maximizes the conditional probability sequence)W |sequence Pr(T (Bikel 1997). By Bayesian equality, this is equivalent to maximizing the joint probability sequence) Tsequence,Pr(W . This joint probability can be computed by bi-gram HMM as follows:  where V denotes the size of the vocabulary, the back-off coefficients l's are determined using the Witten-Bell smoothing algorithm. The quantities  are computed by the maximum likelihood estimation.</Paragraph>
    <Paragraph position="3"> We use the following single token feature set for HMM training. The definitions of these features are the same as in (Bikel 1997).</Paragraph>
    <Paragraph position="5"> twoDigitNum, fourDigitNum, containsDigitAndAlpha, containsDigitAndDash, containsDigitAndSlash, containsDigitAndComma, containsDigitAndPeriod, otherNum, allCaps, capPeriod, initCap, lowerCase, other.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
6 Benchmarking and Discussion
</SectionTitle>
    <Paragraph position="0"> Two types of benchmarks were measured: (i) the quality of the automatically constructed NE corpus, and (ii) the performance of the HMM NE tagger. The HMM NE tagger is considered to be the resulting system for application. The benchmarking shows that this system approaches the performance of supervised NE tagger for two of the three proper name NE types in MUC, namely, PER NE and LOC NE.</Paragraph>
    <Paragraph position="1"> We used the same blind testing corpus of 300,000 words containing 20,000 PER, LOC and ORG instances that were truthed in-house originally for benchmarking the existing supervised NE tagger (Srihari, Niu &amp; Li 2000).</Paragraph>
    <Paragraph position="2"> This has the benefit of precisely measuring performance degradation from the supervised learning to unsupervised learning. The performance of our supervised NE tagger using the MUC scorer is shown in Table 1.</Paragraph>
    <Paragraph position="3">  To benchmark the quality of the automatically constructed corpus (Table 2), the testing corpus is first processed by our parser and then saved into the repository. The repository level NE classification scheme, as discussed in section 4, is applied. From the recognized NE instances, the instances occurring in the testing corpus are compared with the answer key.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
PERSON 94.3%
LOCATION 91.7%
ORGANIZATION 88.5%
</SectionTitle>
    <Paragraph position="0"> To benchmark the performance of the HMM tagger, the testing corpus is parsed. The noun chunks with proper name POS tags (NNP and NNPS) are extracted as NE candidates. The preceding word and the succeeding word of the NE candidates are also extracted. Then we apply the HMM to the NE candidates with their neighboring context. The NE classification results are shown in  Compared with our existing supervised NE tagger, the degradation using the presented bootstrapping method for PER NE, LOC NE, and ORG NE are 5%, 6%, and 34% respectively.</Paragraph>
    <Paragraph position="1"> The performance for PER and LOC are above 80%, approaching the performance of supervised learning. The reason for the low recall of ORG (~50%) is not difficult to understand. For PERSON and LOCATION, a few concept-based seeds seem to be sufficient in covering their sub-types (e.g. the sub-types COUNTRY, CITY, etc for LOCATION). But there are hundreds of sub-types of ORG that cannot be covered by less than a dozen concept-based seeds, which we used. As a result, the recall of ORG is significantly affected. Due to the same fact that ORG contains many more sub-types, the results are also noisier, leading to lower precision than that of the other two NE types. Some threshold can be introduced, e.g.</Paragraph>
    <Paragraph position="2"> perplexity per word, to remove spurious ORG tags in improving the precision. As for the recall issue, fortunately, in a real-life application, the organization type that a user is interested in usually is in a fairly narrow spectrum. We believe that the performance will be better if only company names or military organization names are targeted.</Paragraph>
    <Paragraph position="3"> In addition to the key NE types in MUC, our system is able to recognize another NE type, namely, PRODUCT (PRO) NE. We instructed our truthing team to add this NE type into the testing corpus which contains ~2,000 PRO instances.</Paragraph>
    <Paragraph position="4"> Table 4 shows the performance of the HMM on the PRO tag.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
TYPE PRECISION RECALL F-MEASURE
</SectionTitle>
    <Paragraph position="0"> PRODUCT 67.3% 72.5% 69.8% Similar to the case of ORG NEs, the number of concept-based seeds is found to be insufficient to cover the variations of PRO subtypes. So the performance is not as good as PER and LOC NEs.</Paragraph>
    <Paragraph position="1"> Nevertheless, the benchmark shows the system works fairly effectively in extracting the user-specified NEs. It is noteworthy that domain knowledge such as knowing the major sub-types of the user-specified NE type is valuable in assisting the selection of appropriate concept-based seeds for performance enhancement.</Paragraph>
    <Paragraph position="2"> The performance of our HMM tagger is comparable with the reported performance in (Collins &amp; Singer 1999). But our benchmarking is more extensive as we used a much larger data set (20,000 NE instances in the testing corpus) than theirs (1,000 NE instances).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML