File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0428_metho.xml
Size: 10,146 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0428"> <Title>Named Entity Recognition with Character-Level Models</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Character-Level HMM </SectionTitle> <Paragraph position="0"> Figure 1 shows a graphical model representation of our character-level HMM. Characters are emitted one at a time, and there is one state per character. Each state's identity depends only on the previous state. Each character's identity depends on both the current state and on the previous a4a7a6a9a8 characters. In addition to this HMM view, it may also be convenient to think of the local emission models as type-conditional a4 -gram models. Indeed, the character emission model in this section is directly based on the a4 -gram proper-name classification engine described in (Smarr and Manning, 2002). The primary addition is the state-transition chaining, which allows the model to do segmentation as well as classification.</Paragraph> <Paragraph position="1"> When using character-level models for word-evaluated tasks, one would not want multiple characters inside a single word to receive different labels. This can be avoided in two ways: by explicitly locking state transitions inside words, or by careful choice of transition topology. In our current implementation, we do the latter.</Paragraph> <Paragraph position="2"> Each state is a pair a10a12a11a14a13a16a15a18a17 where a11 is an entity type (such as PERSON, and including an other type) and a15 indicates the length of time the system has been in state a11 . Therefore, a state like (PERSON, 2) indicates the second letter inside a person phrase. The final letter of a phrase is a following space (we insert one if there is none) and the state is a special final state like (PERSON, F). Additionally, once a15 reaches our a4 -gram history order, it stays there.</Paragraph> <Paragraph position="3"> We then use empirical, unsmoothed estimates for state- null acter observations and a1 nodes are entity types.</Paragraph> <Paragraph position="4"> state transitions. This annotation and estimation enforces consistent labellings in practice. For example, (PERSON, 2) can only transition to the next state (PERSON, 3) or the final state (PERSON, F). Final states can only transition to beginning states, like (other, 1).</Paragraph> <Paragraph position="5"> For emissions, we must estimate a quantity of the form a2 a10 a0a4a3a6a5a0a8a7a10a9a12a11a13a7 a5a15a14 a13a17a16a17a16a18a16 a13 a0 a7 a5 a13 a1 a17 , for example,</Paragraph> <Paragraph position="7"> order a4a44a43a45a41 .2 The a4 -gram estimates are smoothed via deleted interpolation.</Paragraph> <Paragraph position="8"> Given this model, we can do Viterbi decoding in the standard way. To be clear on what this model does and does not capture, we consider a few examples ( indicates a space). First, we might be asked for a10a47a46 a5a48a49a24 a50 a46a52a51a13a53 a13a55a54a56a38a37a57 a13a59a58 a17 . In this case, we know both that we are in the middle of a location that begins with Denv and also that the preceding context was to. In essence, encoding a15 into the state lets us distinguish the beginnings of phrases, which lets us model trends like named entities (all the classes besides other) generally starting with capital letters in English. Second, we may be asked for quantities like a2 a10 a5a61a60a8a48a13a28a13a62a52a63 a13a55a54a56a38a37a57 a13a59a64 a17 , which allows us to model the ends of phrases. Here we have a slight complexity: by the notation, one would expect such emissions to have probability 1, since nothing else can be emitted from a final state. In practice, we have a special stop symbol in our n-gram counts, and the probability of emitting a space from a final state is the probability of the n-gram having chosen the stop character.3 tire process to have been a hierarchical HMM (Fine et al., 1998), where the a69 -gram model generates the entire phrase, followed by a tier pop up to the phrase transition tier.</Paragraph> <Paragraph position="9"> Using this model, we tested two variants, one in which preceding context was discarded (for example,</Paragraph> <Paragraph position="11"> lined above. For comparison, we also built a first-order word-level HMM; the results are shown in table 1. We give Fa5 both per-category and overall. The word-level model and the (context disabled) character-level model are intended as a rough minimal pair, in that the only information crossing phrase boundaries was the entity type, isolating the effects of character- vs word-level modeling (a more precise minimal pair is examined in section 3).</Paragraph> <Paragraph position="12"> Switching to the character model raised the overall score greatly, from 74.5% to 82.2%. On top of this, context helped, but substantially less, bringing the total to 83.2%.</Paragraph> <Paragraph position="13"> We did also try to incorporate gazetteer information by adding a4 -gram counts from gazetteer entries to the training counts that back the above character emission model.</Paragraph> <Paragraph position="14"> However, this reduced performance (by 2.0% with context on). The supplied gazetteers appear to have been built from the training data and so do not increase coverage, and provide only a flat distribution of name phrases whose empirical distributions are very spiked.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Character-Feature Based Classifier </SectionTitle> <Paragraph position="0"> Given the amount of improvement from using a model backed by character a4 -grams instead of word a4 -grams, the immediate question is whether this benefit is complementary to the benefit from features which have traditionally been of use in word level systems, such as syntactic context features, topic features, and so on.</Paragraph> <Paragraph position="1"> To test this, we constructed a maxent classifier which locally classifies single words, without modeling the entity type sequences a1 .4 These local classifiers map a feature representation of each word position to entity types, such as PERSON.5 We present a hill-climb over feature sets for the English development set data in table 2.</Paragraph> <Paragraph position="2"> First, we tried only the local word as a feature; the result was that each word was assigned its most common class in the training data. The overall F-score was 52.29%, well below the official CoNLL baseline of 71.18%.6 We next added a4 -gram features; specifically, we framed each word with special start and end symbols, and then added every contiguous substring to the feature list. Note that this subsumes the entire-word features. Using the sub-string features alone scored 73.10%, already breaking the no-context HMM, which better models the context inside phrases. Adding a current tag feature gave a score of 74.17%. At this point, the bulk of outstanding errors were plausibly attributable to insufficient context information.</Paragraph> <Paragraph position="3"> Adding even just the previous and next words and tags as (atomic) features raised performance to 82.39%. More complex, joint context features which paired the current word and tag with the previous and next words and tags raised the score further to 83.09%, nearly to the level of the HMM, still without actually having any model of previous classification decisions.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A Character-Based CMM </SectionTitle> <Paragraph position="0"> In order to include state sequence features, which allow the classifications at various positions to interact, we have to abandon classifying each position independently.</Paragraph> <Paragraph position="1"> Sequence-sensitive features can be included by chaining our local classifiers together and performing joint inference, i.e., by building a conditional markov model (CMM), also known as a maximum entropy markov model (McCallum et al., 2000).</Paragraph> <Paragraph position="2"> Previous classification decisions are clearly relevant: for example the sequenceGrace Road is a single location, not a person's name adjacent to a location (which is the erroneous output of the model in section 3). Adding features representing the previous classification decision (a1 a7 a5 ) raised the score 2.35% to 85.44%. We found knowing that the previous word was an other wasn't particularly useful without also knowing its part-of-speech (e.g., a preceding preposition might indicate a location).</Paragraph> <Paragraph position="3"> Joint tag-sequence features, along with longer distance sequence and tag-sequence features, gave 87.21%.</Paragraph> <Paragraph position="4"> The remaining improvements involved a number of other features which directly targetted observed error types. These features included letter type pattern features (for example 20-month would become d-x for digitlowercase andItalywould becomeXxfor mixed case).</Paragraph> <Paragraph position="5"> This improved performance substantially, for example allowing the system to detect ALL CAPS regions. Table 3 shows an example of a local decision for Grace in the context at Grace Road, using all of the features defined to date. Note that the evidence against Grace as a name completely overwhelms the a4 -gram and word preference for PERSON. Other features included secondprevious and second-next words (when the previous or next words were very short) and a marker for capitalized words whose lowercase forms had also been seen. The final system also contained some simple error-driven postprocessing. In particular, repeated sub-elements (usually last names) of multi-word person names were given type PERSON, and a crude heuristic restoration of B- prefixes was performed. In total, this final system had an F-score of 92.31% on the English development set. Table 4 gives a more detailed breakdown of this score, and also gives the results of this system on the English test set, and both German data sets.</Paragraph> </Section> class="xml-element"></Paper>