File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1061_metho.xml
Size: 19,015 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1061"> <Title>Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Considerable amount of work has been done in recent years on NERs, partly due to the Message Understanding Conferences (MUC-6, 1995; MUC-7, 1998). Machine learning methods such as BBN's IdentiFinder (Bikel, Schwartz, and Weischedel, 1999) and Borthwick's MENE (Borthwick, 1999) have shown that machine learning NERs can achieve comparable performance with systems using hand-coded rules. Bikel, Schwartz, and Weischedel (1999) have also shown how mixed case text can be automatically converted to upper case SNOR or OCR format to train NERs to work on such formats. There is also some work on unsupervised learning for mixed case named entity recognition (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999). Collins and Singer (1999) investigated named entity classification using Adaboost, CoBoost, and the EM algorithm. However, features were extracted using a parser, and performance was evaluated differently (the classes were person, organization, location, and noise). Cucerzan and Yarowsky (1999) built a cross language NER, and the performance on English was low compared to supervised single-language NER such as Identi-Finder. We suspect that it will be hard for purely unsupervised methods to perform as well as supervised ones.</Paragraph> <Paragraph position="1"> Seeger (2001) gave a comprehensive summary of recent work in learning with labeled and unlabeled data. There is much recent research on co-training, such as (Blum and Mitchell, 1998; Collins and Singer, 1999; Pierce and Cardie, 2001). Most co-training methods involve using two classifiers built on different sets of features. Instead of using distinct sets of features, Goldman and Zhou (2000) used different classification algorithms to do co-training. Blum and Mitchell (1998) showed that in order for PAC-like guarantees to hold for co-training, features should be divided into two disjoint sets satisfying: (1) each set is sufficient for a classifier to learn a concept correctly; and (2) the two sets are conditionally independent of each other. Each set of features can be used to build a classifier, resulting in two independent classifiers, A and B. Classifications by A on unlabeled data can then be used to further train classifier B, and vice versa. Intuitively, the independence assumption is there so that the classifications of A would be informative to B. When the independence assumption is violated, the decisions of A may not be informative to B. In this case, the positive effect of having more data may be offset by the negative effect of introducing noise into the data (classifier A might not be always correct).</Paragraph> <Paragraph position="2"> Nigam and Ghani (2000) investigated the difference in performance with and without a feature split, and showed that co-training with a feature split gives better performance. However, the comparison they made is between co-training and self-training. In self-training, only one classifier is used to tag unlabeled data, after which the more confidently tagged data is reused to train the same classifier.</Paragraph> <Paragraph position="3"> Many natural language processing problems do not show the natural feature split displayed by the web page classification task studied in previous co-training work. Our work does not really fall under the paradigm of co-training. Instead of co-operation between two classifiers, we used a stronger classifier to teach a weaker one. In addition, it exhibits the following differences: (1) the features are not at all independent (upper case features can be seen as a subset of the mixed case features); and (2) The additional features available to the mixed case system will never be available to the upper case system. Co-training often involves combining the two different sets of features to obtain a final system that out-performs either system alone. In our context, however, the upper case system will never have access to some of the case-based features available to the mixed case system.</Paragraph> <Paragraph position="4"> Due to the above reason, it is unreasonable to expect the performance of the upper case NER to match that of the mixed case NER. However, we still manage to achieve a considerable reduction of errors between the two NERs when they are tested on the official MUC-6 and MUC-7 test data.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> We use the maximum entropy framework to build two classifiers: an upper case NER and a mixed case NER. The upper case NER does not have access to case information of the training and test data, and hence cannot make use of all the features used by the mixed case NER. We will first describe how the mixed case NER is built. More details of this mixed case NER and its performance are given in (Chieu and Ng, 2002). Our approach is similar to the MENE system of (Borthwick, 1999). Each word is assigned a name class based on its features.</Paragraph> <Paragraph position="1"> Each name class a0 is subdivided into 4 classes, i.e., N begin, N continue, N end, and N unique. Hence, there is a total of 29 classes (7 name classes a1 4 sub-classes a2 1 not-a-name class).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Maximum Entropy </SectionTitle> <Paragraph position="0"> The maximum entropy framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed. Such constraints are derived from training data, expressing some relationship between features and outcome. The probability distribution that satisfies the above property is the one with the highest entropy. It is unique, agrees with the maximum-likelihood distribution, and has the exponential form</Paragraph> <Paragraph position="2"> where a6 refers to the outcome, a10 the history (or context), and a16 a4a17a10a18a11 is a normalization function. In addition, each feature function a41 a21 a4a17a10 a40 a6a36a11 is a binary function. For example, in predicting if a word belongs to a word class, a6 is either true or false, and a10 refers to the surrounding context:</Paragraph> <Paragraph position="4"> a21 are estimated by a procedure called Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972). This is an iterative method that improves the estimation of the parameters at each iteration.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Features for Mixed Case NER </SectionTitle> <Paragraph position="0"> The features we used can be divided into 2 classes: local and global. Local features are features that are based on neighboring tokens, as well as the token itself. Global features are extracted from other occurrences of the same token in the whole document.</Paragraph> <Paragraph position="1"> Features in the maximum entropy framework are binary. Feature selection is implemented using a feature cutoff: features seen less than a small count during training will not be used. We group the features used into feature groups. Each group can be made up of many binary features. For each token a46 , zero, one, or more of the features in each group are set to 1.</Paragraph> <Paragraph position="2"> The local feature groups are: Non-Contextual Feature: This feature is set to 1 for all tokens. This feature imposes constraints that are based on the probability of each name class during training.</Paragraph> <Paragraph position="3"> Zone: MUC data contains SGML tags, and a document is divided into zones (e.g., headlines and text zones). The zone to which a token belongs is used as a feature. For example, in MUC-6, there are four zones (TXT, HL, DATELINE, DD). Hence, for each token, one of the four features zone-TXT, zone-HL, zone-DATELINE, or zone-DD is set to 1, and the other 3 are set to 0.</Paragraph> <Paragraph position="4"> Case and Zone: If the token a46 starts with a capital letter (initCaps), then an additional feature (init-Caps, zone) is set to 1. If it is made up of all capital letters, then (allCaps, zone) is set to 1. If it contains both upper and lower case letters, then (mixedCaps, zone) is set to 1. A token that is allCaps will also be initCaps. This group consists of (3 a1 total number of possible zones) features.</Paragraph> <Paragraph position="5"> Case and Zone of a46a48a47 a24 and a46a50a49 a24 : Similarly, if a46a48a47 etc.</Paragraph> <Paragraph position="6"> Token Information: This group consists of 10 features based on the string a46 , as listed in Table 1. For example, if a token starts with a capital letter and ends with a period (such as Mr.), then the feature InitCapPeriod is set to 1, etc.</Paragraph> <Paragraph position="7"> First Word: This feature group contains only one feature firstword. If the token is the first word of a sentence, then this feature is set to 1. Otherwise, it is set to 0.</Paragraph> <Paragraph position="8"> Lexicon Feature: The string of the token a46 is used as a feature. This group contains a large number of features (one for each token string present in the training data). At most one feature in this group will be set to 1. If a46 is seen infrequently during training (less than a small count), then a46 will not selected as a feature and all features in this group are set to 0.</Paragraph> <Paragraph position="9"> Lexicon Feature of Previous and Next Token: The string of the previous token a46 a49 a24 and the next token a46a62a47 a24 is used with the initCaps information of a46 . If a46 has initCaps, then a feature (initCaps,</Paragraph> <Paragraph position="11"> a52a60a53a55a54a14a56 is set to 1. If a46 is not initCaps, then (notinitCaps, a46a62a47 a24 )a52a18a53a27a54a14a56 is set to 1. Same for a46a48a49 a24 . In the case where the next token a46a5a47 a24 is a hyphen, then a46a63a47a65a64 is also used as a feature: (initCaps, a46a66a47a65a64 )a52a18a53a55a54a57a56 is set to 1. This is because in many cases, the use of hyphens can be considered to be optional (e.g., &quot;third-quarter&quot; or &quot;third quarter&quot;).</Paragraph> <Paragraph position="12"> Out-of-Vocabulary: We derived a lexicon list from WordNet 1.6, and words that are not found in this list have a feature out-of-vocabulary set to 1.</Paragraph> <Paragraph position="13"> Dictionaries: Due to the limited amount of training material, name dictionaries have been found to be useful in the named entity task. The sources of our dictionaries are listed in Table 2. A token a46 is tested against the words in each of the four lists of location names, corporate names, person first names, and person last names. If a46 is found in a list, the corresponding feature for that list will be set to 1.</Paragraph> <Paragraph position="14"> For example, if Barry is found in the list of person first names, then the feature PersonFirstName will be set to 1. Similarly, the tokens a46a67a47 a24 and a46a68a49 a24 are tested against each list, and if found, a corresponding feature will be set to 1. For example, if a46a66a47 a24 is found in the list of person first names, the feature PersonFirstNamea52a60a53a55a54a57a56 is set to 1.</Paragraph> <Paragraph position="15"> Month Names, Days of the Week, and Numbers: If a46 is one of January, February, . . . , December, then the feature MonthName is set to 1. If a46 is one of Monday, Tuesday, . . . , Sunday, then the feature DayOfTheWeek is set to 1. If a46 is a number string (such as one, two, etc), then the feature NumberString is set to 1.</Paragraph> <Paragraph position="16"> Suffixes and Prefixes: This group contains only two features: Corporate-Suffix and Person-Prefix.</Paragraph> <Paragraph position="17"> Two lists, Corporate-Suffix-List (for corporate suffixes) and Person-Prefix-List (for person prefixes), are collected from the training data. For a token a46 that is in a consecutive sequence of initCaps tokens</Paragraph> <Paragraph position="19"> then another feature Person-Prefix is set to 1. Note that we check for a46a62a49a9a69a63a49 a24 , the word preceding the consecutive sequence of initCaps tokens, since per-son prefixes like Mr., Dr. etc are not part of person names, whereas corporate suffixes like Corp., Inc.</Paragraph> <Paragraph position="20"> etc are part of corporate names.</Paragraph> <Paragraph position="21"> The global feature groups are: InitCaps of Other Occurrences: There are 2 features in this group, checking for whether the first occurrence of the same word in an unambiguous posi- null tion (non first-words in the TXT or TEXT zones) in the same document is initCaps or not-initCaps. For a word whose initCaps might be due to its position rather than its meaning (in headlines, first word of a sentence, etc), the case information of other occurrences might be more accurate than its own.</Paragraph> <Paragraph position="22"> Corporate Suffixes and Person Prefixes of Other Occurrences: With the same Corporate-Suffix-List and Person-Prefix-List used in local features, for a token a46 seen elsewhere in the same document with one of these suffixes (or prefixes), another feature Other-CS (or Other-PP) is set to 1.</Paragraph> <Paragraph position="23"> Acronyms: Words made up of all capitalized letters in the text zone will be stored as acronyms (e.g., IBM). The system will then look for sequences of initial capitalized words that match the acronyms found in the whole document. Such sequences are given additional features of A begin, A continue, or A end, and the acronym is given a feature A unique.</Paragraph> <Paragraph position="24"> For example, if &quot;FCC&quot; and &quot;Federal Communications Commission&quot; are both found in a document, then &quot;Federal&quot; has A begin set to 1, &quot;Communications&quot; has A continue set to 1, &quot;Commission&quot; has A end set to 1, and &quot;FCC&quot; has A unique set to 1. Sequence of Initial Caps: In the sentence &quot;Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement.&quot;, a NER may mistake &quot;Even News Broadcasting Corp.&quot; as an organization name. However, it is unlikely that other occurrences of &quot;News Broadcasting Corp.&quot; in the same document also co-occur with &quot;Even&quot;. This group of features attempts to capture such information. For every sequence of initial capitalized words, its longest substring that occurs in the same document is identified. For this example, since the sequence &quot;Even News Broadcasting Corp.&quot; only appears once in the document, its longest substring that occurs in the same document is &quot;News Broadcasting Corp.&quot;. In this case, &quot;News&quot; has an additional feature of I begin set to 1,&quot;Broadcasting&quot; has an additional feature of I continue set to 1, and &quot;Corp.&quot; has an additional feature of I end set to 1.</Paragraph> <Paragraph position="25"> Unique Occurrences and Zone: This group of features indicates whether the word a46 is unique in the whole document. a46 needs to be in initCaps to be considered for this feature. If a46 is unique, then a feature (Unique, Zone) is set to 1, where Zone is the document zone where a46 appears.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Features for Upper Case NER </SectionTitle> <Paragraph position="0"> All features used for the mixed case NER are used by the upper case NER, except those that require case information.</Paragraph> <Paragraph position="1"> Among local features, Case and Zone, InitCap-Period, and OneCap are not used by the upper case NER. Among global features, only Other-CS and Other-PP are used for the upper case NER, since the other global features require case information. For Corporate-Suffix and Person-Prefix, as the sequence of initCaps is not available in upper case text, only the next word (previous word) is tested</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> for Corporate-Suffix (Person-Prefix). 3.4 Testing </SectionTitle> <Paragraph position="0"> During testing, it is possible that the classifier produces a sequence of inadmissible classes (e.g., person begin followed by location unique). To eliminate such sequences, we define a transition probability between word classes a74 a4a7a75a7a76a77a8a75 a21 a11 to be equal to 1 if the sequence is admissible, and 0 otherwise. The probability of the classes a75 a24 a40a71a70a71a70a71a70a78a40 a75 a73 assigned to the words in a sentence a79 in a document where a74 a4a7a75 a76 a8a79 a40 a80 a11 is determined by the maximum entropy classifier. A dynamic programming algorithm is then used to select the sequence of word classes with the highest probability.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Teaching Process </SectionTitle> <Paragraph position="0"> The teaching process is illustrated in Figure 2. This process can be divided into the following steps: Training NERs. First, a mixed case NER (MNER) is trained from some initial corpus a83 , manually tagged with named entities. This corpus is also converted to upper case in order to train another upper case NER (UNER). UNER is required by our method of example selection.</Paragraph> <Paragraph position="1"> Baseline Test on Unlabeled Data. Apply the trained MNER on some unlabeled mixed case texts to produce mixed case texts that are machine-tagged with named entities (text-mner-tagged). Convert the original unlabeled mixed case texts to upper case, and similarly apply the trained UNER on these texts to obtain upper case texts machine-tagged with named entities (text-uner-tagged).</Paragraph> <Paragraph position="2"> Example Selection. Compare text-mner-tagged and text-uner-tagged and select tokens in which the classification by MNER differs from that of UNER.</Paragraph> <Paragraph position="3"> The class assigned by MNER is considered to be correct, and will be used as new training data. These tokens are collected into a set a83a85a84 .</Paragraph> <Paragraph position="4"> Retraining for Final Upper Case NER. Both a83 and a83a51a84 are used to retrain an upper case NER. However, tokens from a83 are given a weight of 2 (i.e., each token is used twice in the training data), and tokens from a83a68a84 a weight of 1, since a83 is more reliable than a83 a84 (human-tagged versus machine-tagged).</Paragraph> </Section> class="xml-element"></Paper>