File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0430_intro.xml

Size: 4,760 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0430">
  <Title>Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Models for many natural language tasks benefit from the flexibility to use overlapping, non-independent features.</Paragraph>
    <Paragraph position="1"> For example, the need for labeled data can be drastically reduced by taking advantage of domain knowledge in the form of word lists, part-of-speech tags, character ngrams, and capitalization patterns. While it is difficult to capture such inter-dependent features with a generative probabilistic model, conditionally-trained models, such as conditional maximum entropy models, handle them well. There has been significant work with such models for greedy sequence modeling in NLP (Ratnaparkhi, 1996; Borthwick et al., 1998).</Paragraph>
    <Paragraph position="2"> Conditional Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models, a special case of which correspond to conditionally-trained finite state machines. While based on the same exponential form as maximum entropy models, they have efficient procedures for complete, non-greedy finite-state inference and training. CRFs have shown empirical successes recently in POS tagging (Lafferty et al., 2001), noun phrase segmentation (Sha and Pereira, 2003) and Chinese word segmentation (McCallum and Feng, 2003).</Paragraph>
    <Paragraph position="3"> Given these models' great flexibility to include a wide array of features, an important question that remains is what features should be used? For example, in some cases capturing a word tri-gram is important, however, there is not sufficient memory or computation to include all word tri-grams. As the number of overlapping atomic features increases, the difficulty and importance of constructing only certain feature combinations grows.</Paragraph>
    <Paragraph position="4"> This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase loglikelihood, the approach builds on that of Della Pietra et al (1997), but is altered to work with conditional rather than joint probabilities, and with a mean-field approximation and other additional modifications that improve efficiency specifically for a sequence model. In comparison with traditional approaches, automated feature induction offers both improved accuracy and significant reduction in feature count; it enables the use of richer, higher-order Markov models, and offers more freedom to liberally guess about which atomic features may be relevant to a task.</Paragraph>
    <Paragraph position="5"> Feature induction methods still require the user to create the building-block atomic features. Lexicon membership tests are particularly powerful features in natural language tasks. The question is where to get lexicons that are relevant for the particular task at hand? This paper describes WebListing, a method that obtains seeds for the lexicons from the labeled data, then uses the Web, HTML formatting regularities and a search engine service to significantly augment those lexicons. For example, based on the appearance of Arnold Palmer in the labeled data, we gather from the Web a large list of other golf players, including Tiger Woods (a phrase that is difficult to detect as a name without a good lexicon).</Paragraph>
    <Paragraph position="6"> We present results on the CoNLL-2003 named entity recognition (NER) shared task, consisting of news articles with tagged entities PERSON, LOCATION, ORGANI-ZATION and MISC. The data is quite complex; for example the English data includes foreign person names (such as Yayuk Basuki and Innocent Butare), a wide diversity of locations (including sports venues such as The Oval, and rare location names such as Nirmal Hriday), many types of organizations (from company names such as 3M, to acronyms for political parties such as KDP, to location names used to refer to sports teams such as Cleveland), and a wide variety of miscellaneous named entities (from software such as Java, to nationalities such as Basque, to sporting competitions such as 1,000 Lakes Rally).</Paragraph>
    <Paragraph position="7"> On this, our first attempt at a NER task, with just a few person-weeks of effort and little work on development-set error analysis, our method currently obtains overall English F1 of 84.04% on the test set by using CRFs, feature induction and Web-augmented lexicons. German F1 using very limited lexicons is 68.11%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML