File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0423_metho.xml

Size: 11,729 bytes

Last Modified: 2025-10-06 14:08:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0423">
  <Title>Named Entity Recognition with a Maximum Entropy Approach</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Feature Representation
</SectionTitle>
    <Paragraph position="0"> We present two systems: a system ME1 that does not make use of any external knowledge base other than the training data, and a system ME2 that makes use of additional features derived from name lists. ME1 is used for both English and German. For German, however, for features that made use of the word string, the lemma (provided in the German training and test data) is used instead of the actual word.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Lists derived from training data
</SectionTitle>
      <Paragraph position="0"> The training data is first preprocessed to compile a number of lists that are used by both ME1 and ME2. These lists are derived automatically from the training data.</Paragraph>
      <Paragraph position="1"> Frequent Word List (FWL) This list consists of words that occur in more than 5 different documents.</Paragraph>
      <Paragraph position="2"> Useful Unigrams (UNI) For each name class, words that precede the name class are ranked using correlation metric (Chieu and Ng, 2002a), and the top 20 are compiled into a list.</Paragraph>
      <Paragraph position="3"> Useful Bigrams (UBI) This list consists of bigrams of words that precede a name class. Examples are &amp;quot;CITY OF&amp;quot;, &amp;quot;ARRIVES IN&amp;quot;, etc. The list is compiled by taking bigrams with higher probability to appear before a name class than the unigram itself (e.g., &amp;quot;CITY OF&amp;quot; has higher probability to appear before a location than &amp;quot;OF&amp;quot;). A list is collected for each name class. We have attempted to use bigrams that appear after a name class, but for English at least, we have been unable to compile any such meaningful bigrams. A possible explanation is that in writing, people tend to explain with bigrams such as &amp;quot;CITY OF&amp;quot; before mentioning the name itself.</Paragraph>
      <Paragraph position="4"> Useful Word Suffixes (SUF) For each word in a name class, three-letter suffixes with high correlation metric score are collected. This is especially important for the MISC class, where suffixes such as &amp;quot;IAN&amp;quot; and &amp;quot;ISH&amp;quot; often appear.</Paragraph>
      <Paragraph position="5"> Useful Name Class Suffixes (NCS) A suffix list is compiled for each name class. These lists capture tokens that frequently terminate a particular name class. For example, the ORG class often terminates with tokens such as INC and COMMITTEE, and the MISC class often terminates with CUP, OPEN, etc.</Paragraph>
      <Paragraph position="6"> Function Words (FUN) Lower case words that occur within a name class. These include &amp;quot;van der&amp;quot;, &amp;quot;of&amp;quot;, etc.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Local Features
</SectionTitle>
      <Paragraph position="0"> The basic features used by both ME1 and ME2 can be divided into two classes: local and global (Chieu and Ng, 2002b). Local features of a token w are those that are derived from the sentence containing w. Global features are derived by looking up other occurrences of w within the same document.</Paragraph>
      <Paragraph position="1"> In this paper, w[?]i refers to the ith word before w, and w+i refers to the ith word after w. The features used are similar to those used in (Chieu and Ng, 2002b). Local features include: First Word, Case, and Zone For English, each document is segmented by simple rules into 4 zones: headline (HL), author (AU), dateline (DL), and text (TXT). To identify the zones, a DL sentence is first identified using a regular expression. The system then looks for an AU sentence that occurs before DL using another regular expression. All sentences other than AU that occur before the DL sentence are then taken to be in the HL zone. Sentences after the DL sentence are taken to be in the TXT zone. If no DL sentence can be found in a document, then the first sentence of the document is taken as HL, and the rest as TXT. For German, the first sentence of each document is taken as HL, and the rest as TXT. Zone is used as part of the following features: If w starts with a capital letter (i.e., initCaps), and it is the first word of a sentence, a feature (firstword-initCaps, zone) is set to 1. If it is initCaps but not the first word, a feature (initCaps, zone) is set to 1. If it is the first word but not initCaps, (firstword-notInitCaps, zone) is set to 1. If it is made up of all capital letters, then (allCaps, zone) is set to 1. If it starts with a lower case letter, and contains both upper and lower case letters, then (mixedCaps, zone) is set to 1. A token that is allCaps will also be initCaps.</Paragraph>
      <Paragraph position="2"> Case and Zone of w+1 and w[?]1 Similarly, if w+1 (or w[?]1) is initCaps, a feature (initCaps, zone)NEXT (or (initCaps, zone)PREV ) is set to 1, etc.</Paragraph>
      <Paragraph position="3"> Case Sequence Suppose both w[?]1 and w+1 are init-Caps. Then if w is initCaps, a feature I is set to 1, else a feature NI is set to 1.</Paragraph>
      <Paragraph position="4"> Token Information These features are based on the string w, such as contains-digits, contains-dollar-sign, etc (Chieu and Ng, 2002b).</Paragraph>
      <Paragraph position="5"> Lexicon Feature The string of w is used as a feature.</Paragraph>
      <Paragraph position="6"> This group contains a large number of features (one for each token string present in the training data).</Paragraph>
      <Paragraph position="7"> Lexicon Feature of Previous and Next Token The string of the previous token w[?]1 and the next token w+1 is used with the initCaps information of w. If w has init-Caps, then a feature (initCaps, w+1)NEXT is set to 1. If w is not initCaps, then (not-initCaps, w+1)NEXT is set to  1. Same for w[?]1.</Paragraph>
      <Paragraph position="8"> Hyphenated Words Hyphenated words w of the form s1-s2 have a feature U-U set to 1 if both s1 and s2 are initCaps. If s1 is initCaps but not s2, then the features U=s1, L=s2, and U-L are set to 1. If s2 is initCaps but not s1, then the features U=s2, L=s1, and L-U are set to 1.</Paragraph>
      <Paragraph position="9">  Within Quotes/Brackets Sequences of tokens within quotes or brackets have a feature to indicate that they are within quotes. We found this feature useful for MISC class, where names such as movie names often appear within quotes.</Paragraph>
      <Paragraph position="10"> Rare Words If w is not found in FWL, then this feature is set to 1.</Paragraph>
      <Paragraph position="11"> Bigrams If (w[?]2,w[?]1) is found in UBI for the name class nc, then the feature BI-nc is set to 1.</Paragraph>
      <Paragraph position="12"> Word Suffixes If w has a 3-letter suffix that can be found in SUF for the name class nc, then the feature SUF-nc is set to 1.</Paragraph>
      <Paragraph position="13"> Class Suffixes For w in a consecutive sequence of initCaps tokens (w,w+1,...,w+n), if any of the tokens from w+1 to w+n is found in the NCS list of the name class nc, then the feature NCS-nc is set to 1.</Paragraph>
      <Paragraph position="14"> Function Words If w is part of a sequence found in FUN, then this feature is set to 1.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Global Features
</SectionTitle>
      <Paragraph position="0"> The global features include: Unigrams If another occurrence of w in the same document has a previous word wp that can be found in UNI, then these words are used as features Otheroccurrence-prev=wp. null Bigrams If another occurrence of w has the feature BI-nc set to 1, then w will have the feature OtherBInc set to 1.</Paragraph>
      <Paragraph position="1"> Class Suffixes If another occurrence of w has the feature NCS-nc set to 1, then w will have the feature OtherNCS-nc set to 1.</Paragraph>
      <Paragraph position="2"> InitCaps of Other Occurrences This feature checks for whether the first occurrence of the same word in an unambiguous position (non first-words in the TXT zone) in the same document is initCaps or not. For a word whose initCaps might be due to its position rather than its meaning (in headlines, first word of a sentence, etc), the case information of other occurrences might be more accurate than its own.</Paragraph>
      <Paragraph position="3"> Acronyms Words made up of all capitalized letters in the text zone will be stored as acronyms (e.g., IBM). The system will then look for sequences of initial capitalized words that match the acronyms found in the whole document. Such sequences are given additional features of A begin, A continue, or A end, and the acronym is given a feature A unique. For example, if FCC and Federal Communications Commission are both found in a document, then Federal has A begin set to 1, Communications has A continue set to 1, Commission has A end set to 1, and FCC has A unique set to 1.</Paragraph>
      <Paragraph position="4"> Sequence of InitCaps In the sentence Even News Broadcasting Corp., noted for its accurate reporting, made the erroneous announcement., a NER may mistake Even News Broadcasting Corp. as an organization name.</Paragraph>
      <Paragraph position="5"> However, it is unlikely that other occurrences of News Broadcasting Corp. in the same document also co-occur with Even. This group of features attempts to capture such information. For every sequence of initial capitalized words, its longest substring that occurs in the same document as a sequence of initCaps is identified. For this example, since the sequence Even News Broadcasting Corp. only appears once in the document, its longest sub-string that occurs in the same document is News Broadcasting Corp. In this case, News has an additional feature of I begin set to 1, Broadcasting has an additional feature of I continue set to 1, and Corp. has an additional feature of I end set to 1.</Paragraph>
      <Paragraph position="6"> Name Class of Previous Occurrences The name class of previous occurrences of w is used as a feature, similar to (Zhou and Su, 2002). We use the occurrence where w is part of the longest name class phrase (name class with the most number of tokens). For example, if w is the second token in a person name class phrase of 5 tokens, then a feature 2Person5 is set to 1. During training, the name classes are known. During testing, the name classes are the ones already assigned to tokens in the sentences already processed.</Paragraph>
      <Paragraph position="7"> This last feature makes the order of processing important. As HL sentences usually contain less context, they are processed after the other sentences.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Name List
</SectionTitle>
      <Paragraph position="0"> In additional to the above features used by both ME1 and ME2, ME2 uses additional features derived from name lists compiled from a variety of sources. These sources are the Internet and the list provided by the organizers of this shared task. The list is a mapping of sequences of words to name classes. An example of an entry in the list is &amp;quot;JOHN KENNEDY : PERSON&amp;quot;. Words that are part of a sequence of words mapped to a name class nc will have a feature CLASS=nc set to 1. Another list of weekdays and month names is also used in the same way. For ME2, we have also manually added additional entries into the automatically compiled NCS lists.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML