File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1025_metho.xml
Size: 15,662 bytes
Last Modified: 2025-10-06 14:07:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1025"> <Title>Named Entity Recognition: A Maximum Entropy Approach Using Global Information</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> The system described in this paper is similar to the MENE system of (Borthwick, 1999). It uses a maximum entropy framework and classifies each word given its features. Each name class a3 is subdivided into 4 sub-classes, i.e., N begin, N continue, N end, and N unique. Hence, there is a total of 29 classes (7 name classes a0 4 sub-classes a1 1 not-a-name class).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Maximum Entropy </SectionTitle> <Paragraph position="0"> The maximum entropy framework estimates probabilities based on the principle of making as few assumptions as possible, other than the constraints imposed. Such constraints are derived from training data, expressing some relationship between features and outcome. The probability distribution that satisfies the above property is the one with the highest entropy. It is unique, agrees with the maximum-likelihood distribution, and has the exponential form (Della Pietra et al., 1997):</Paragraph> <Paragraph position="2"> where a15 refers to the outcome, a2 the history (or context), and a7 a1a20a2 a9 is a normalization function. In addition, each feature function a30 a11 a1a8a2 a11a14a15a18a9 is a binary function. For example, in predicting if a word belongs to a word class, a15 is either true or false, and a2 refers to the surrounding context:</Paragraph> <Paragraph position="4"> a11 are estimated by a procedure called Generalized Iterative Scaling (GIS) (Darroch and Ratcliff, 1972). This is an iterative method that improves the estimation of the parameters at each iteration. We have used the Java-based opennlp maximum entropy package1.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Testing </SectionTitle> <Paragraph position="0"> During testing, it is possible that the classifier produces a sequence of inadmissible classes (e.g., person begin followed by location unique). To eliminate such sequences, we define a transition probability between word classes a0 a1 a17a2a1 a5a8a17 a11 a9 to be equal to 1 if the sequence is admissible, and 0 otherwise. The probability of the classes a17 a14 a11a4a3a5a3a5a3 a11a14a17a7a6 assigned to the words in a sentence a8 in a document</Paragraph> <Paragraph position="2"> where a0 a1a4a17 a1 a5a8 a11a14a13a16a9 is determined by the maximum entropy classifier. A dynamic programming algorithm is then used to select the sequence of word classes with the highest probability.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Feature Description </SectionTitle> <Paragraph position="0"> The features we used can be divided into 2 classes: local and global. Local features are features that are based on neighboring tokens, as well as the token itself. Global features are extracted from other occurrences of the same token in the whole document.</Paragraph> <Paragraph position="1"> The local features used are similar to those used in BBN's IdentiFinder (Bikel et al., 1999) or MENE (Borthwick, 1999). However, to classify a token a13 , while Borthwick uses tokens from a13a14a11a16a15 to a13a18a17a19a15 (from two tokens before to two tokens after a13 ), we used only the tokens a13 a11 a14 , a13 , and a13 a17 a14 . Even with local features alone, MENERGI outperforms MENE (Borthwick, 1999). This might be because our features are more comprehensive than those used by Borthwick. In IdentiFinder, there is a priority in the feature assignment, such that if one feature is used for a token, another feature lower in priority will not be used. In the maximum entropy framework, there is no such constraint. Multiple features can be used for the same token.</Paragraph> <Paragraph position="2"> Feature selection is implemented using a feature cutoff: features seen less than a small count during training will not be used. We group the features used into feature groups. Each feature group can be made up of many binary features. For each token a13 , zero, one, or more of the features in each feature group are set to 1.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Local Features </SectionTitle> <Paragraph position="0"> The local feature groups are: Non-Contextual Feature: This feature is set to during training.</Paragraph> <Paragraph position="1"> Zone: MUC data contains SGML tags, and a document is divided into zones (e.g., headlines and text zones). The zone to which a token belongs is used as a feature. For example, in MUC-6, there are four zones (TXT, HL, DATELINE, DD). Hence, for each token, one of the four features zone-TXT, zone-HL, zone-DATELINE, or zone-DD is set to 1, and the other 3 are set to 0.</Paragraph> <Paragraph position="2"> Case and Zone: If the token a13 starts with a capital letter (initCaps), then an additional feature (init-Caps, zone) is set to 1. If it is made up of all capital letters, then (allCaps, zone) is set to 1. If it starts with a lower case letter, and contains both upper and lower case letters, then (mixedCaps, zone) is set to 1. A token that is allCaps will also be initCaps. This group consists of (3 a0 total number of possible zones) features.</Paragraph> <Paragraph position="3"> Case and Zone of a13 a17 a14 and a13 a11 a14 : Similarly, if a13 a17 a14 (or a13 a11 a14 ) is initCaps, a feature (initCaps, zone)a20a16a21a23a22a25a24 (or (initCaps, zone)a26a23a27a16a21a29a28 ) is set to 1, etc.</Paragraph> <Paragraph position="4"> Token Information: This group consists of 10 features based on the string a13 , as listed in Table 1. For example, if a token starts with a capital letter and ends with a period (such as Mr.), then the feature InitCapPeriod is set to 1, etc.</Paragraph> <Paragraph position="5"> First Word: This feature group contains only one feature firstword. If the token is the first word of a sentence, then this feature is set to 1. Otherwise, it is set to 0.</Paragraph> <Paragraph position="6"> Lexicon Feature: The string of the token a13 is used as a feature. This group contains a large number of features (one for each token string present in the training data). At most one feature in this group will be set to 1. If a13 is seen infrequently during training (less than a small count), then a13 will not be selected as a feature and all features in this group are set to 0.</Paragraph> <Paragraph position="7"> Lexicon Feature of Previous and Next Token: The string of the previous token a13 a11 a14 and the next token a13 a17 a14 is used with the initCaps information of a13 . If a13 has initCaps, then a feature (initCaps,</Paragraph> <Paragraph position="9"> hyphen, then a13 a17a19a15 is also used as a feature: (init-Caps, a13 a17a19a15 )a20a16a21a23a22a25a24 is set to 1. This is because in many cases, the use of hyphens can be considered to be optional (e.g., third-quarter or third quarter).</Paragraph> <Paragraph position="10"> Out-of-Vocabulary: We derived a lexicon list from WordNet 1.6, and words that are not found in this list have a feature out-of-vocabulary set to 1.</Paragraph> <Paragraph position="11"> Dictionaries: Due to the limited amount of training material, name dictionaries have been found to be useful in the named entity task. The importance of dictionaries in NERs has been investigated in the literature (Mikheev et al., 1999). The sources of our dictionaries are listed in Table 2. For all lists except locations, the lists are processed into a list of tokens (unigrams). Location list is processed into a list of unigrams and bigrams (e.g., New York). For locations, tokens are matched against unigrams, and sequences of two consecutive tokens are matched against bigrams. A list of words occurring more than 10 times in the training data is also collected (commonWords). Only tokens with initCaps not found in commonWords are tested against each list in Table 2. If they are found in a list, then a feature for that list will be set to 1. For example, if Barry is not in commonWords and is found in the list of per-son first names, then the feature PersonFirstName will be set to 1. Similarly, the tokens a13 a17 a14 and a13 a11 a14 are tested against each list, and if found, a corresponding feature will be set to 1. For example, if a13 a17 a14 is found in the list of person first names, the feature PersonFirstNamea20a16a21a23a22a25a24 is set to 1. Month Names, Days of the Week, and Numbers: If a13 is initCaps and is one of January, February, . . . , December, then the feature MonthName is set to 1. If a13 is one of Monday, Tuesday, . . . , Sunday, then the feature DayOfTheWeek is set to 1. If a13 is a number string (such as one, two, etc), then the feature NumberString is set to 1.</Paragraph> <Paragraph position="12"> Suffixes and Prefixes: This group contains only two features: Corporate-Suffix and Person-Prefix.</Paragraph> <Paragraph position="13"> Two lists, Corporate-Suffix-List (for corporate suffixes) and Person-Prefix-List (for person prefixes), are collected from the training data. For corporate suffixes, a list of tokens cslist that occur frequently as the last token of an organization name is collected from the training data. Frequency is calculated by counting the number of distinct previous tokens that each token has (e.g., if Electric Corp. is seen 3 times, and Manufacturing Corp. is seen 5 times during training, and Corp. is not seen with any other preceding tokens, then the &quot;frequency&quot; of Corp. is 2). The most frequently occurring last words of organization names in cslist are compiled into a list of corporate suffixes, Corporate-Suffix-List. A Person-Prefix-List is compiled in an analogous way. For MUC-6, for example, Corporate-Suffix-List is made up of a0 ltd., associates, inc., co, corp, ltd, inc, committee, institute, commission, university, plc, airlines, co., corp.a1 and Person-Prefix-List is made up of a0 succeeding, mr., rep., mrs., secretary, sen., says, minister, dr., chairman, ms.a1 . For a token a13 that is in a consecutive sequence of initCaps tokens a1 a13 a11a3a2 a11a5a3a4a3a5a3 a11a7a13 a11a4a3a5a3a2a3 a11 a13 a17 a6 a9 , if any of the tokens from a13 a17 a14 to a13 a17 a6 is in Corporate-Suffix-List, then a feature Corporate-Suffix is set to 1. If any of the tokens from a13 a11a3a2a18a11 a14 to a13 a11 a14 is in Person-Prefix-List, then another feature Person-Prefix is set to 1. Note that we check for a13a14a11a3a2 a11 a14 , the word preceding the consecutive sequence of initCaps tokens, since person prefixes like Mr., Dr., etc are not part of per-son names, whereas corporate suffixes like Corp., Inc., etc are part of corporate names.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Global Features </SectionTitle> <Paragraph position="0"> Context from the whole document can be important in classifying a named entity. A name already mentioned previously in a document may appear in abbreviated form when it is mentioned again later.</Paragraph> <Paragraph position="1"> Previous work deals with this problem by correcting inconsistencies between the named entity classes assigned to different occurrences of the same entity (Borthwick, 1999; Mikheev et al., 1998). We often encounter sentences that are highly ambiguous in themselves, without some prior knowledge of the entities concerned. For example: The McCann family . . . (3) In sentence (1), McCann can be a person or an organization. Sentence (2) and (3) help to disambiguate one way or the other. If all three sentences are in the same document, then even a human will find it difficult to classify McCann in (1) into either person or organization, unless there is some other information provided.</Paragraph> <Paragraph position="2"> The global feature groups are: InitCaps of Other Occurrences (ICOC): There are 2 features in this group, checking for whether the first occurrence of the same word in an un-ambiguous position (non first-words in the TXT or TEXT zones) in the same document is initCaps or not-initCaps. For a word whose initCaps might be due to its position rather than its meaning (in headlines, first word of a sentence, etc), the case information of other occurrences might be more accurate than its own. For example, in the sentence that starts with &quot;Bush put a freeze on . . . &quot;, because Bush is the first word, the initial caps might be due to its position (as in &quot;They put a freeze on . . . &quot;). If somewhere else in the document we see &quot;restrictions put in place by President Bush&quot;, then we can be surer that Bush is a name.</Paragraph> <Paragraph position="3"> Corporate Suffixes and Person Prefixes of Other Occurrences (CSPP): If McCann has been seen as Mr. McCann somewhere else in the document, then one would like to give person a higher probability than organization. On the other hand, if it is seen as McCann Pte. Ltd., then organization will be more probable. With the same Corporate-Suffix-List and Person-Prefix-List used in local features, for a token a13 seen elsewhere in the same document with one of these suffixes (or prefixes), another feature Other-CS (or Other-PP) is set to 1.</Paragraph> <Paragraph position="4"> Acronyms (ACRO): Words made up of all capitalized letters in the text zone will be stored as acronyms (e.g., IBM). The system will then look for sequences of initial capitalized words that match the acronyms found in the whole document. Such sequences are given additional features of A begin, A continue, or A end, and the acronym is given a feature A unique. For example, if FCC and Federal Communications Commission are both found in a document, then Federal has A begin set to 1, Communications has A continue set to 1, Commission has A end set to 1, and FCC has A unique set to 1.</Paragraph> <Paragraph position="5"> Sequence of Initial Caps (SOIC): In the sentence Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement., a NER may mistake Even News Broadcasting Corp. as an organization name. However, it is unlikely that other occurrences of News Broadcasting Corp. in the same document also co-occur with Even. This group of features attempts to capture such information. For every sequence of initial capitalized words, its longest substring that occurs in the same document as a sequence of initCaps is identified. For this example, since the sequence Even News Broadcasting Corp. only appears once in the document, its longest substring that occurs in the same document is News Broadcasting Corp. In this case, News has an additional feature of I begin set to 1, Broadcasting has an additional feature of I continue set to 1, and Corp. has an additional feature of I end set to 1.</Paragraph> <Paragraph position="6"> Unique Occurrences and Zone (UNIQ): This group of features indicates whether the word a13 is unique in the whole document. a13 needs to be in initCaps to be considered for this feature. If a13 is unique, then a feature (Unique, Zone) is set to 1, where Zone is the document zone where a13 appears.</Paragraph> <Paragraph position="7"> As we will see from Table 3, not much improvement is derived from this feature.</Paragraph> </Section> </Section> class="xml-element"></Paper>