File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1118_metho.xml

Size: 20,093 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1118">
  <Title>Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition</Title>
  <Section position="4" start_page="152" end_page="152" type="metho">
    <SectionTitle>
3 SYSTEM ARCHITECTURE:
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="152" end_page="152" type="sub_section">
      <SectionTitle>
Histories and Futures
</SectionTitle>
      <Paragraph position="0"> MENE consists of a set of C++ and Perl modules which forms a wrapper around a publicly available M.E. toolkit (Ristad, 1998) which computes the values of the a parameters of equation 2 from a pair of training files created by MENE. MENE's flexibility is due to its object-based treatment of the three essential cOmponents of a maximum entropy system: histories, futures, and features (Borthwick et al., 1997).</Paragraph>
      <Paragraph position="1"> History objects in MENE act as containers for a list of &amp;quot;history views&amp;quot;. The history view classes each represent a different type of information about the history object. When the features attempt to determine whether or not they fire on a given history, they request an appropriate history view object from the history object and then query the history view object to determine whether their firing conditions are satisfied. Note that these history views generally hold information about a limited window around the current token. If the current token is denoted as w0, then our model only holds information about tokens w-1...wl for all history views except the lexicai ones. For these views, the window is w--2.., w.z.</Paragraph>
      <Paragraph position="2"> Future objects, on the other hand, are trivial in that their only piece of data is an integer indicating which of the 29 members of the future space they represent.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="152" end_page="153" type="metho">
    <SectionTitle>
4 FEATURES
</SectionTitle>
    <Paragraph position="0"> Features are implemented as binary valued functions which query the history and future objects to determine whether or not they &amp;quot;fire&amp;quot;. In the following sections, we will look at each of MENE's feature classes in turn.</Paragraph>
    <Section position="1" start_page="152" end_page="153" type="sub_section">
      <SectionTitle>
4.1 Binary Features
</SectionTitle>
      <Paragraph position="0"> While all of MENE's features have binary-valued output, the &amp;quot;binary&amp;quot; features are features whose associated history-view can be considered to be either on or off for a given token. Examples are &amp;quot;the token begins with a capitalized letter&amp;quot; or &amp;quot;the token is a four-digit number&amp;quot;. Equation 1 gives an example of a binary feature. The 11 binary history-views used by MENE's binary features are very similar to those used in BBN's Nymble/ldentifinder system (Bikel et al., 1997) with two exceptions: * Nymble used a feature for &amp;quot;significant&amp;quot; (i.e.</Paragraph>
      <Paragraph position="1"> non-sentence-beginning) capitalization. We didn't include this, believing that MENE could make these judgments from the surrounding lexicai content.</Paragraph>
      <Paragraph position="2"> * Nymble's features were non-overlapping. I.e.</Paragraph>
      <Paragraph position="3"> the all-cap feature took precedence over the initial-cap feature. Given two features, a and  b, when the (history, filture) space on which feature b activates must be a subset of the space for feature a, it can be shown that the M.E.</Paragraph>
      <Paragraph position="4"> model will yield the same results whether a and b are included as features or if (a - b) arid b are features. Consequently, MENE allows all features to fire in overlapping cases. For instance, in MENE the initial cap features activate on the histories &amp;quot;Clinton&amp;quot;, &amp;quot;IBM&amp;quot;, and &amp;quot;ValuJet&amp;quot; while in Nymble the feature would only be active on &amp;quot;Clinton&amp;quot; because the &amp;quot;All-Cap&amp;quot; feature would take precedence on &amp;quot;IBM&amp;quot; and an &amp;quot;Initial-and-internal-cap&amp;quot; feature would take precedence on &amp;quot;ValuJet&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
4.2 Lexical Features
</SectionTitle>
      <Paragraph position="0"> To create a lexical history view, the tokens at w-2 ... w2 are compared with a vocabulary and their vocabulary indices are recorded. For a given training corpus, we define the vocabulary to be all tokens with a count of three or more. Words not found in the vocabulary are assigned a distinguished &amp;quot;Unknown&amp;quot; index. Lexical feature example:</Paragraph>
      <Paragraph position="2"> A more subtle feature picked up by MENE: preceding word is &amp;quot;to&amp;quot; and future is &amp;quot;location_unique&amp;quot;. Given the domain of the MUC-7 training data (aviation disasters), &amp;quot;to&amp;quot; is a weak indicator, but a real one. This is an example of a feature which MENE can make use of but which the constructor of a hand-coded system would probably regard as too risky to incorporate. This feature, in conjunction with other weak features, can allow MENE to pick up names that other systems might miss.</Paragraph>
      <Paragraph position="3"> As discussed later, these features are automatically acquired and the system can attain a very high level of performance using these features alone. This is encouraging since these lexical features are not dependent on any external knowledge source or linguistic intuition and thus are completely portable to new domains.</Paragraph>
    </Section>
    <Section position="3" start_page="153" end_page="153" type="sub_section">
      <SectionTitle>
4.3 Section Features
</SectionTitle>
      <Paragraph position="0"> The New York Times articles which constituted the MUC-7 test and training corpora were composed of six distinct sections including &amp;quot;Date&amp;quot;, &amp;quot;Preamble&amp;quot;, and &amp;quot;Text&amp;quot;. Section features activate according to which of these sections the current token is in. Example feature:</Paragraph>
      <Paragraph position="2"> Activation example: CLINTON WARNS HUS-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="153" end_page="154" type="metho">
    <SectionTitle>
SEIN ABOUT IRAQI DEFIANCE. Note that, as-
</SectionTitle>
    <Paragraph position="0"> suming that this headline is in the preamble, the above feature will fire on all of these words. Of course, this feature's prediction will only be correct on &amp;quot;CLINTON&amp;quot; and &amp;quot;HUSSEIN&amp;quot;.</Paragraph>
    <Paragraph position="1"> Section features establish the background probability of the occurrence of the different futures.</Paragraph>
    <Paragraph position="2"> For instance, in NYU's evaluation system, the value assigned to the feature which predicts &amp;quot;other&amp;quot; given a current section of &amp;quot;main body of text&amp;quot; is 7.9 times stronger than the feature which predicts &amp;quot;person_unique&amp;quot; in the same section. Thus the system predicts &amp;quot;other&amp;quot; by default. On the other hand, in the preamble (which contains headline, author, etc. information), the feature predicting &amp;quot;other&amp;quot; is much weaker in most cases. It is only about 2.6 times as strong as &amp;quot;organization_start&amp;quot; and &amp;quot;organization_end&amp;quot;, for instance.</Paragraph>
    <Section position="1" start_page="153" end_page="154" type="sub_section">
      <SectionTitle>
4.4 Dictionary Features
</SectionTitle>
      <Paragraph position="0"> Multi-word dictionaries are a key element of MENE.</Paragraph>
      <Paragraph position="1"> Each entry in a MENE dictionary consists of a term which is one or more tokens long. Dictionaries can be case-sensitive or not on a dictionary-bydictionary basis. A pre-processing step summarizes the information in the dictionary on a token-bytoken basis by assigning to every token one of the following five tags for each dictionary: start, continue, end, unique, other. I.e. if &amp;quot;British Airways&amp;quot; was in our dictionary, a dictionary feature would see the phrase &amp;quot;on British Airways Flight 962&amp;quot; as &amp;quot;other, start, end, other, other&amp;quot;. Table 1 lists the dictionaries.used by MENE in the MUC-7 evaluation. Below is an example of a dictionary feature:</Paragraph>
      <Paragraph position="3"> &amp;quot;Richard&amp;quot; is in the first name dictionary.</Paragraph>
      <Paragraph position="4"> Note that, similar to the case of overlapping binary features, we don't have to worry about words appearing in the dictionary which are commonly used in another sense. I.e. we can leave dangerouslooking names like &amp;quot;April&amp;quot; in the first-name dictionary because whenever the first-name feature fires on &amp;quot;April&amp;quot;, the lexical and date-dictionary features for &amp;quot;April&amp;quot; will also fire and, assuming that the use of April as &amp;quot;date&amp;quot; exceeded the use of April as person.start or person_unique, we can expect that the lexical feature will have a high enough c~ value to outweigh the first-name-dictionary feature. This was confirmed in our test runs: no instance of &amp;quot;April&amp;quot; was tagged as a name, including one case, &amp;quot;The  death of Ron Brown in April in a similar plane crash ...&amp;quot; which could be thought of as somewhat tricky because the month was not followed by a specific date. Note that the system isn't foolproof: if a &amp;quot;dangerous&amp;quot; dictionary word appeared in only one dictionary and did not appear often enough in the training corpus to be included in the vocabulary, but did appear in the test corpus, we would probably mistag it.</Paragraph>
    </Section>
    <Section position="2" start_page="154" end_page="154" type="sub_section">
      <SectionTitle>
4.5 External System Features
</SectionTitle>
      <Paragraph position="0"> For NYU's official entry in the MUC-7 evaluation, MENE took in the output of an enhanced version of the more traditional, hand-coded &amp;quot;Proteus&amp;quot; named-entity tagger which we entered in MUC-6(Grishman, 1995). In addition, subsequent to the evaluation, the University of Manitoba (Lin, 1998) and IsoQuest, Inc. (Krupka and Hausman, 1998) shared with us the outputs of their systems on our training corpora as well as on various test corpora. The output sent to us was the standard MUC-7 output, so our collaborators didn't have to do any special processing for us. These systems were incorporated into MENE as simply three more history views by the following  2 step process: 1. Each system's output is tokenized by MENE's tokenizer and cross-system tokenization discrepancies are resolved.</Paragraph>
      <Paragraph position="1"> 2. The tag assigned to each token by each system is noted. This tag will be one of the 29  tags mentioned above (i.e. person-start, location.continue, etc.) The result of all this is that the &amp;quot;futures&amp;quot; produced by the three external systems become three &amp;quot;external system histories&amp;quot; for MENE. Here is an example feature:</Paragraph>
      <Paragraph position="3"> Proteus has correctly tagged &amp;quot;Richard&amp;quot;.</Paragraph>
      <Paragraph position="4"> It is important to note that MENE has features which predict a different future than the future predicted by the external system. This can be seen as the process by which MENE learns the errors which the external system is likely to make. An example of i:his is that on the evaluation system the feature which predicted person_unique given a tag of person_unique by Proteus had only a 76% higher weight than the feature which predicted person-start given person_unique. In other words, Proteus had a tendency to chop off multi-word names at the first word. MENE learned this and made it easy to override Proteus in this way. In fact, an analysis of the differences between the Proteus output and the MENE + Proteus output turned up a significant number of instances in which MENE extended or contracted name boundaries in this way. Given proper training data, MENE can pinpoint and selectively correct the weaknesses of a handcoded system.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="154" end_page="155" type="metho">
    <SectionTitle>
5 Compound Features
</SectionTitle>
    <Paragraph position="0"> MENE currently has no direct ability to learn compound features or &amp;quot;patterns&amp;quot;-the &amp;quot;history&amp;quot; side of a lexical feature activates based on only a single word, for instance. A sort of pattern-like ability comes into the system from multiple features firing at once.</Paragraph>
    <Paragraph position="1"> I.e. to predict that &amp;quot;York&amp;quot; in the name &amp;quot;New York&amp;quot; is the end of a location, we will have two features firing: one predicts location_end when token-i is  &amp;quot;new&amp;quot;. The other predicts location_end when tokeno is &amp;quot;york&amp;quot;.</Paragraph>
    <Paragraph position="2"> Nevertheless, it is possible that compound fi.,atures would behave differently from two simultaneously firing &amp;quot;atomic&amp;quot; features. We integrated t:his into the model in an ad hoc manner for the external system features, where we constructed features which essentially query the external system history and the section history simultaneously to determ!ine whether they fire. I.e. a particular feature might fire if Proteus predicts person_start, the current section is &amp;quot;main body of text&amp;quot;, and the future is &amp;quot;person_start&amp;quot;. This allows MENE to assign a lower a to a Proteus prediction in the preamble vs. a prediction in the main body of text. Proteus, like many hand-coded systems, is more accurate in the main body of the text than in headline-type material. We found that this compound feature gave the system slightly higher performance than we got when we just used section features and external system features separately.</Paragraph>
    <Paragraph position="3"> It seems reasonable that adding an ability to handle fully general compound features (i.e. feature A fires if features B and C both fire) would improve system performance based on this limited experiment. In addition to allowing us to predict futures based on multi-word patterns, it would also let us use other promising combinations of features such as distinguishing between capitalization in a headline vs. in the main body of the text. Unfortunately, this experiment will have to wait until we deploy a more sophisticated method of feature selection, as discussed in the next section.</Paragraph>
  </Section>
  <Section position="8" start_page="155" end_page="155" type="metho">
    <SectionTitle>
6 FEATURE SELECTION
</SectionTitle>
    <Paragraph position="0"> Features are chosen by a very simple method. All possible features from the classes we want included in our model are put into a &amp;quot;feature pool&amp;quot;. For instance, if we want lexical features in our model which activate on a range of token_~.., token.x, our vocabulary has a size of V, and we have 29 futures, we will add (5. (V + 1). 29) lexical features to the pool.</Paragraph>
    <Paragraph position="1"> The V + 1 term comes from the fact that we include all words in the vocabulary plus the unknown word.</Paragraph>
    <Paragraph position="2"> From this pool, we then select all features which fire at least three times on the training corpus. Note that this algorithm is entirely free of human intervention. Once the modeler has selected the classes of features, MENE will both select all the relevant features and train the features to have the proper weightings.</Paragraph>
    <Paragraph position="3"> We deviate from this basic algorithm in three ways: 1. We exclude features which activate on some sort of &amp;quot;default&amp;quot; value of a history view. Many history views have some sort of default value which they display for the vast majority of tokens. For instance, a first-name-dictionary history view would say that the current token is not a name in over 99% of the cases. Rather than adding features which activate both when the token in question is and when it is not a first name, we only include features which activate when the token is a first name. A feature which activated when a token was not a first name, while theoretically not harmful, would have practical disadvantages. First of all, the feature would probably be redundant, because if the frequency of a future given a first-name-dictionary hit is constrained (by equation 4), then the future frequency given a non-hit is also implicitly constrained. Secondly, since this feature would fire on nearly every token, it would slow down run-time performance. Finally, while maximum entropy models are designed to handle feature overlap, a very high degree of overlap requires more iterations of the maximum entropy estimation routine and can lead to numerical difficulties (Ristad, 1998).</Paragraph>
    <Paragraph position="4"> 2. Features which predict the future &amp;quot;other&amp;quot; have to fire six times to be included in the model rather than three. Experiments showed that doing this had no impact on performance and reduced the size of the model by about 20%.</Paragraph>
    <Paragraph position="5"> 3. As another way of reducing the model size, lexical features which activate on token_2 and token2 are excluded if they predict &amp;quot;other&amp;quot;. Like the previous heuristic, this is based on the idea that features predicting named entities are more useful than features predicting the default.</Paragraph>
    <Paragraph position="6"> Note that this method of feature selection would probably break down if we tried to incorporate general compound features into our model as described in the previous section. The model currently has about 24,000 features when trained on 350 articles of text. If we even considered all pairs of features as potential compound features, the O(n 2) compound features which we could build from our atomic features would undoubtedly yield an unacceptable slowdown in the model's performance. Clearly a more sophisticated feature selection routine such as the ones in (Berger et al., 1996), or (Berger and Printz, 1998) would be required in this case.</Paragraph>
  </Section>
  <Section position="9" start_page="155" end_page="156" type="metho">
    <SectionTitle>
7 DECODING and VITERBI
SEARCH
</SectionTitle>
    <Paragraph position="0"> After having trained the features of an M.E. model and assigned the proper weight (a values) to each of the features, decoding (i.e. &amp;quot;marking up&amp;quot;) a new piece of text is a fairly simple process:  1. Tokenize the text.</Paragraph>
    <Paragraph position="1">  2. Compute each of the history views by looking up words in the dictionary, checking the output of the external systems, checking whether words are capitalized or not, etc.</Paragraph>
    <Paragraph position="2"> 3. For each article of the text (a) For each token of the text, check each feature to see whether it fires, and combine the a values of the firing features according to equation 2. This will give us a conditional probability for each of the 29 futures for each token in the article.</Paragraph>
    <Paragraph position="3"> (b) Run a Viterbi search to find the highest  probability legal path through the lattice of conditional probabilities.</Paragraph>
    <Paragraph position="4"> The Viterbi search is necessary because simply taking the highest-probability future assigned to each token would result in incompatible assignments. For instance, an assignment of \[person_start, location_end\] to two consecutive tokens would be invalid. The Viterbi search finds the highest probability path in which there are no two tokens in which the second one cannot follow the first, as defined by a table of all such invalid transitions (a similar approach to (Sekine et al., 1998)).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML