File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1118_intro.xml
Size: 3,800 bytes
Last Modified: 2025-10-06 14:06:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1118"> <Title>Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition</Title> <Section position="3" start_page="152" end_page="152" type="intro"> <SectionTitle> 2&quot; MAXIMUM ENTROPY </SectionTitle> <Paragraph position="0"> Given a tokenization of a test corpus and a set of n (for MUC-7, n = 7) tags which define the name categories of the task at hand~ the problem of named entity recognition can be reduced to the problem of assigning one of 4n + l tags to each token. For any particular tag x from the set of n tags, we could be in one of 4 states: x_start, x_continue, x_end, and x_unique. In addition, a token could be tagged as &quot;other&quot; to indicate that it is not part of a named entity. For instance, we would tag the phrase \[Jerry Lee Lewis flew to Paris\] as \[person_start, person_continue, person_end, other, other, location_unique I. This approach is essentially the same as (Sekine et al., 1998).</Paragraph> <Paragraph position="1"> The 29 tags of MUC-7 form the space of &quot;futures&quot; for a maximum entropy formulation of our N.E. problem. A maximum entropy solution to this, or any other similar problem allows the computation of p(f\[h) for any f from the space of possible futures, F, for every h from the space of possible histories, H. A &quot;history&quot; in maximum entropy is all of the conditioning data which enables you to make a decision among the space of futures. In the named entity problem, we Could reformulate this in terms of finding the probability of f associated with the token at index ~ in the test corpus as: /&quot;Infdegrmatidegn derivable fromtthe) p(f\]ht) p \\]ltest corpus relative to token The computation of p(flh) in M.E. is dependent on a set of &quot;features&quot; which, hopefully, are helpful in making a prediction about the future. Like most current M.E. modeling efforts in computational linguistics we restrict ourselves to features which are binary functions of the history and future. For instance, one of our features is</Paragraph> <Paragraph position="3"> Here &quot;current-token-capitalized(h)&quot; is a binary function which returns true if the &quot;current token&quot; of the history h (the token whose tag we are trying to determine) has an initial capitalized letter.</Paragraph> <Paragraph position="4"> Given a set of features and some training data, the maximum entropy estimation process produces a model in which every feature gi has associated with it a parameter ai. This allows us to compute the conditional probability as follows (Berger et al., 1996):</Paragraph> <Paragraph position="6"> The maximum entropy estimation technique guarantees that for every feature gi, the expected value of gi according to the M.E. model will equal the empirical expectation of gi in the training corpus. In other words:</Paragraph> <Paragraph position="8"> Here P is an empirical probability and PME is the probability assigned by the M.E. model.</Paragraph> <Paragraph position="9"> More complete discussions of M.E. as applied to computational linguistics, including a description of the M.E. estimation procedure can be found in (Berger et al., 1996) and (Della Pietra et al., 1995). The following are some additional references which are useful as introductions and examples of applications: (Ramaparkhi, 1997b) (Ristad, 1.998) (Jaynes, 1996). As many authors have remarked, though, the most useful thing about maximum entropy modeling is that it allows the modeler to concentrate on finding the features that characterize the problem while letting the M.E. estimation routine worry about assigning the relative weights to the features.</Paragraph> </Section> class="xml-element"></Paper>