File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1018_metho.xml

Size: 10,791 bytes

Last Modified: 2025-10-06 14:14:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1018">
  <Title>NYU: Description of the MENE Named Entity System as Used in MUC-7</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
MAXIMUM ENTROPY
</SectionTitle>
    <Paragraph position="0"> Given a tokenization of a test corpus and a set of n #28for MUC-7, n = 7#29 tags which de#0Cne the name categories of the task at hand, the problem of named entity recognition can be reduced to the problem of assigning one of 4n+ 1 tags to each token. For any particular tag x from the set of n tags, we could be in one of 4 states: x start, x continue, x end, and x unique. In addition, a token could be tagged as #5Cother&amp;quot; to indicate that it is not part of a named entity. For instance, wewould tag the phrase #5BJerry Lee Lewis #0Dew to Paris#5D as #5Bperson start, person continue, person end, other, other, location unique#5D. This approachis essentially the same as #5B7#5D.</Paragraph>
    <Paragraph position="1"> The 29 tags of MUC-7 form the space of #5Cfutures&amp;quot; for a maximum entropy formulation of our N.E.</Paragraph>
    <Paragraph position="2"> problem. A maximum entropy solution to this, or any other similar problem allows the computation of p#28fjh#29 for any f from the space of possible futures, F, for every h from the space of possible histories, H. A #5Chistory&amp;quot; in maximum entropy is all of the conditioning data which enables you to make a decision among the space of futures. In the named entity problem, this could be broadly viewed as all information derivable from the test corpus relative to the current token #28i.e. the token whose tag you are trying to determine#29.</Paragraph>
    <Paragraph position="3"> The computation of p#28fjh#29 in M.E. is dependent on a set of binary-valued #5Cfeatures&amp;quot; which, hopefully, are helpful in making a prediction about the future. For instance, one of our features is</Paragraph>
    <Paragraph position="5"> if current token capitalized#28h#29 = true and f = location start</Paragraph>
    <Paragraph position="7"> Given a set of features and some training data, the maximum entropy estimation process produces a model in which every feature g i has associated with it a parameter #0B i . This allows us to compute the conditional probabilityby combining the parameters multiplicatively as follows:</Paragraph>
    <Paragraph position="9"> according to the M.E. model will equal the empirical expectation of g i in the training corpus.</Paragraph>
    <Paragraph position="10"> More complete discussions of M.E., including a description of the M.E. estimation procedure and references to some of the many new computational linguistics systems which are successfully using M.E. can be found in the following useful introduction: #5B5#5D. As many authors have remarked, though, the key thing about M.E. is that it allows the modeler to concentrate on #0Cnding the features that characterize the problem while letting the M.E. estimation routine worry about assigning relativeweights to the features.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SYSTEM ARCHITECTURE
</SectionTitle>
    <Paragraph position="0"> MENE consists of a set of C++ and Perl modules which forms a wrapper around an M.E. toolkit #5B6#5D which computes the values of the alpha parameters of equation 2 from a pair of training #0Cles created by MENE. MENE's #0Dexibility is due to the fact that it can incorporate just about any binary-valued feature which is a function of the history and future of the current token. In the following sections, we will discuss each of MENE's feature classes in turn.</Paragraph>
    <Paragraph position="1"> Binary Features While all of MENE's features have binary-valued output, the #5Cbinary&amp;quot; features are features whose #5Chistory&amp;quot; can be considered to be either on or o#0B for a given token. Examples are #5Cthe token begins with a capitalized letter&amp;quot; or #5Cthe token is a four-digit number&amp;quot;. The binary features which MENE uses are very similar to those used in BBN's Nymble system #5B1#5D. Figure 1 gives an example of a binary feature.</Paragraph>
    <Paragraph position="2">  A more subtle feature picked up by MENE: preceding word is #5Cto&amp;quot; and future is #5Clocation unique&amp;quot;. Given the domain of the MUC-7 training data, #5Cto&amp;quot; is a weak indicator, but a real one. This is an example of a feature which MENE can make use of but which the constructor of a hand-coded system would probably regard as too risky to incorporate. This feature, in conjunction with other weak features, can allow MENE to pick up names that other systems might miss.</Paragraph>
    <Paragraph position="3"> The bulk of MENE's power comes from these lexical features. Aversion of the system which stripped out all features other than section and lexical features achieved a dry run F-score of 88.13. This is very encouraging because these features are completely portable to new domains since they are acquired with absolutely no human intervention or reference to external knowledge sources.</Paragraph>
    <Paragraph position="4"> Section Features MENE has features which make predictions based on the current section of the article, like #5CDate&amp;quot;, #5CPreamble&amp;quot;, and #5CText&amp;quot;. Since section features #0Cre on every token in a given section, they havevery low precision, but they play a key role by establishing the background probability of the occurrence of the di#0Berent futures. For instance, in NYU's evaluation system, the alpha value assigned to the feature which predicts #5Cother&amp;quot; given a current section of #5Cmain body of text&amp;quot; is 7.9 times stronger than the feature which predicts #5Cperson unique&amp;quot; in the same section. Thus the system predicts #5Cother&amp;quot; by default. Dictionary Features Multi-word dictionaries are an important element of MENE. A pre-processing step summarizes the information in the dictionary on a token-by-token basis by assigning to every token one of the following #0Cve tags for each dictionary: start, continue, end, unique, other. I.e. if #5CBritish Airways&amp;quot; was in our dictionary,a dictionary feature would see the phrase #5Con British Airways Flight 962&amp;quot; as #5Cother, start, end, other, other&amp;quot;. The following table lists the dictionaries used by MENE in the MUC-7 evaluation:  Note that we don't havetoworry about words appearing in the dictionary which are commonly used in another sense. I.e. we can leave dangerous-looking names like #5CStorm&amp;quot; in the #0Crst-name dictionary because whenever the #0Crst-name feature #0Cres on Storm, the lexical feature for Storm will also #0Cre and, assuming that the use of Storm as #5Cother&amp;quot; exceeded the use of Storm as person start, we can expect that the lexical feature will have a high enough alpha value to outweigh the dictionary feature.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
External Systems Features
</SectionTitle>
      <Paragraph position="0"> For NYU's o#0Ecial entry in the MUC-7 evaluation, MENE took in the output of a signi#0Ccantly enhanced version of the traditional, hand-coded #5CProteus&amp;quot; named-entity tagger whichweentered in MUC-6 #5B2#5D. In addition, subsequenttotheevaluation, the University of Manitoba #5B4#5D and IsoQuest, Inc. #5B3#5D shared with us the outputs of their systems on our training corpora as well as on various test corpora. The output sent to us was the standard MUC-7 output, so our collaborators didn't havetodoany special processing for us.</Paragraph>
      <Paragraph position="1"> These systems were incorporated into MENE by a fairly simple process of token alignment which resulted in the #5Cfutures&amp;quot; produced by the three external systems become three di#0Berent #5Chistories&amp;quot; for MENE. The external system features can query this data in a windowofw  #0F Correctly predicts: Richard M. Nixon, in a case where Proteus has correctly tagged #5CRichard&amp;quot;. It is important to note that MENE has features which predict a di#0Berent future than the future predicted by the external system. This can be seen as the process by which MENE learns the errors which the external system is likely to make. An example of this is that on the evaluation system the feature which predicted person unique given a tag of person unique by Proteus had only a 76#25 higher weight than the feature which predicted person start given person unique. In other words, Proteus had a tendency to chop o#0B multi-word names at the #0Crst word. MENE learned this and made it easy to override Proteus in this way. Given proper training data, MENE can pinpoint and selectively correct the weaknesses of a handcoded system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="2" type="metho">
    <SectionTitle>
FEATURE SELECTION
</SectionTitle>
    <Paragraph position="0"> Features are chosen byavery simple method. All possible features from the classes wewant included in our model are put into a #5Cfeature pool&amp;quot;. For instance, if we want lexical features in our model which activate on a range of token  , our vocabulary has a size of V , and wehave 29 futures, we will add #285 #01 #28V +1#29#0129#29 lexical features to the pool. The V + 1 term comes from the fact that we include all words in the vocabulary plus the unknown word. From this pool, we then select all features which #0Cre at least three times on the training corpus. Note that this algorithm is entirely free of human intervention. Once the modeler has selected the classes of features, MENE will both select all the relevant features and train the features to have the proper weightings.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
DECODING
</SectionTitle>
    <Paragraph position="0"> After having trained the features of an M.E. model and assigned the proper weight #28alpha values#29 to eachof the features, decoding #28i.e. #5Cmarking up&amp;quot;#29 a new piece of text is a fairly simple process of tokenizing the text and doing various preprocessing steps like looking up words in the dictionaries. Then for each token wecheck each feature to whether if #0Cres and combine the alpha values of the #0Cring features according to equation 2. Finally, we run a viterbi searchto#0Cnd the highest probability path through the lattice of conditional probabilities which doesn't produce anyinvalid tag sequences #28for instance we can't produce the sequence #5Bperson start, location end#5D#29. Further details on the viterbi search can be found in #5B7#5D.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML