File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0424_metho.xml

Size: 7,095 bytes

Last Modified: 2025-10-06 14:08:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0424">
  <Title>Language Independent NER using a Maximum Entropy Tagger</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The ME Tagger
</SectionTitle>
    <Paragraph position="0"> The ME tagger is based on Ratnaparkhi (1996)'s POS tagger and is described in Curran and Clark (2003) . The tagger uses models of the form:</Paragraph>
    <Paragraph position="2"> where y is the tag, x is the context and the fi(x,y) are the features with associated weights li. The probability of a tag sequence y1...yn given a sentence w1...wn is approximated as follows:</Paragraph>
    <Paragraph position="4"> where xi is the context for word wi. The tagger uses beam search to find the most probable sequence given the sentence.</Paragraph>
    <Paragraph position="5"> The features are binary valued functions which pair a tag with various elements of the context; for example:</Paragraph>
    <Paragraph position="7"> cate.</Paragraph>
    <Paragraph position="8"> Generalised Iterative Scaling (GIS) is used to estimate the values of the weights. The tagger uses a Gaussian prior over the weights (Chen et al., 1999) which allows a large number of rare, but informative, features to be used without overfitting.</Paragraph>
    <Paragraph position="9"> Condition Contextual predicate freq(wi) &lt; 5 X is prefix of wi, |X |[?] 4 X is suffix of wi, |X |[?] 4 wi contains a digit wi contains uppercase character wi contains a hyphen</Paragraph>
    <Paragraph position="11"/>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Data
</SectionTitle>
    <Paragraph position="0"> We used three data sets: the English and German data for the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003) and the Dutch data for the CoNLL-2002 shared task (Tjong Kim Sang, 2002). Each word in the data sets is annotated with a named entity tag plus POS tag, and the words in the German and English data also have a chunk tag. Our system does not currently exploit the chunk tags.</Paragraph>
    <Paragraph position="1"> There are 4 types of entities to be recognised: persons, locations, organisations, and miscellaneous entities not belonging to the other three classes. The 2002 data uses the IOB-2 format in which a B-XXX tag indicates the first word of an entity of type XXX and I-XXX is used for subsequent words in an entity of type XXX. The tag O indicates words outside of a named entity. The 2003 data uses a variant of IOB-2, IOB-1, in which I-XXX is used for all words in an entity, including the first word, unless the first word separates contiguous entities of the same type, in which case B-XXX is used.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Feature Set
</SectionTitle>
    <Paragraph position="0"> Table 1 lists the contextual predicates used in our base-line system, which are based on those used in the Curran and Clark (2003) CCG supertagger. The first set of features apply to rare words, i.e. those which appear less than 5 times in the training data. The first two kinds of features encode prefixes and suffixes less than length 5, and the remaining rare word features encode other morphological characteristics. These features are important for tagging unknown and rare words. The remaining features are the word, POS tag, and NE tag history features, using a window size of 2. Note that the NEi[?]2NEi[?]1 feature is a composite feature of both the previous and previous-previous NE tags.</Paragraph>
    <Paragraph position="1">  system. These features have been shown to be useful in other NER systems. The additional orthographic features have proved useful in other systems, for example Carreras et al. (2002), Borthwick (1999) and Zhou and Su (2002). Some of the rows in Table 2 describe sets of contextual predicates. The wi is only digits predicates apply to words consisting of all digits. They encode the length of the digit string with separate predicates for lengths 1-4 and a single predicate for lengths greater than 4. Titlecase applies to words with an initial uppercase letter followed by all lowercase (e.g. Mr). Mixedcase applies to words with mixed lower- and uppercase (e.g. CityBank). The length predicates encode the number of characters in the word from 1 to 15, with a single predicate for lengths greater than 15.</Paragraph>
    <Paragraph position="2"> The next set of contextual predicates encode extra information about NE tags in the current context. The memory NE tag predicate (see e.g. Malouf (2002)) records the NE tag that was most recently assigned to the current word. The use of beam-search tagging means that tags can only be recorded from previous sentences. This memory is cleared at the beginning of each document. The unigram predicates (see e.g.</Paragraph>
    <Paragraph position="3"> Tsukamoto et al. (2002)) encode the most probable tag for the next words in the window. The unigram probabilities are relative frequencies obtained from the training data. This feature enables us to know something about the likely NE tag of the next word before reaching it.</Paragraph>
    <Paragraph position="4"> Most systems use gazetteers to encode information about personal and organisation names, locations and trigger words. There is considerable variation in the size of the gazetteers used. Some studies found that gazetteers did not improve performance (e.g. Malouf (2002)) whilst others gained significant improvement using gazetteers and triggers (e.g. Carreras et al. (2002)). Our system incorporates only English and Dutch first name and last name gazetteers as shown in Table 6. These gazetteers are used for predicates applied to the current, previous and next word in the window.</Paragraph>
    <Paragraph position="5"> Collins (2002) includes a number of interesting contextual predicates for NER. One feature we have adapted encodes whether the current word is more frequently seen lowercase than uppercase in a large external corpus. This feature is useful for disambiguating beginning of sentence capitalisation and tagging sentences which are all capitalised. The frequency counts have been obtained from 1 billion words of English newspaper text collected by Curran and Osborne (2002).</Paragraph>
    <Paragraph position="6"> Collins (2002) also describes a mapping from words to word types which groups words with similar orthographic forms into classes. This involves mapping characters to classes and merging adjacent characters of the same type.</Paragraph>
    <Paragraph position="7"> For example, Moody becomes Aa, A.B.C. becomes A.A.A. and 1,345.05 becomes 0,0.0. The classes are used to define unigram, bigram and trigram contextual predicates over the window.</Paragraph>
    <Paragraph position="8"> We have also defined additional composite features which are a combination of atomic features; for example, a feature which is active for mid-sentence titlecase words seen more frequently as lowercase than uppercase in a large external corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML