File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/a00-1034_abstr.xml

Size: 5,468 bytes

Last Modified: 2025-10-06 13:41:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1034">
  <Title>A Hybrid Approach for Named Entity and Sub-Type Tagging*</Title>
  <Section position="1" start_page="0" end_page="247" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper presents a hybrid approach for named entity (NE) tagging which combines Maximum Entropy Model (MaxEnt), Hidden Markov Model (HMM) and handcrafted grammatical rules. Each has innate strengths and weaknesses; the combination results in a very high precision tagger. MaxEnt includes external gazetteers in the system. Sub-category generation is also discussed.</Paragraph>
    <Paragraph position="1"> Introduction Named entity (NE) tagging is a task in which location names, person names, organization names, monetary amounts, time and percentage expressions are recognized and classified in unformatted text documents. This task provides important semantic information, and is a critical first step in any information extraction system. Intense research has been focused on improving NE tagging accuracy using several different techniques. These include rule-based systems \[Krupka 1998\], Hidden Markov Models (HMM) \[Bikel et al. 1997\] and Maximum Entropy Models (MaxEnt) \[Borthwick 1998\]. A system based on manual rules may provide the best performance; however these require painstaking intense skilled labor.. Furthermore, shifting domains involves significant effort and may result in performance degradation. The strength of HMM models lie in their capacity for modeling local contextual information. HMMs have been widely used in continuous speech recognition, part-of-speech tagging, OCR, etc., and are generally regarded as the most successful statistical modelling paradigm in these domains. MaxEnt is a powerful tool to be used in situations where several ambiguous information sources need to be combined. Since statistical techniques such as HMM are only as good as the data they are trained on, they are required to use back-off models to compensate for unreliable statistics. In contrast to empirical back-off models used in HMMs, MaxEnt provides a systematic method by which a statistical model consistent with all obtained knowledge can be trained. \[Borthwick et al. 1998\] discuss a technique for combining the output of several NE taggers in a black box fashion by using MaxEnt. They demonstrate the superior performance of this system; however, the system is computationally inefficient since many taggers need to be run.</Paragraph>
    <Paragraph position="2"> In this paper we propose a hybrid method for NE tagging which combines all the modelling techniques mentioned above. NE tagging is a complex task and high-performance systems are required in order to be practically usable.</Paragraph>
    <Paragraph position="3"> Furthermore, the task demonstrates characteristics that can be exploited by all three techniques. For example, time and monetary expressions are fairly predictable and hence processed most efficiently with handcrafted grammar rules. Name, location and organization entities are highly variable and thus lend themselves to statistical training algorithms such as HMMs. Finally, many conflicting pieces of information regarding the class of a tag are * This work was supported in part by the SBIR grant F30602-98-C-0043 from Air Force Research Laboratory (AFRL)/IFED.</Paragraph>
    <Paragraph position="4">  frequently present. This includes information from less than perfect gazetteers. For this, a MaxEnt approach works well in utilizing diverse sources of information in determining the final tag. The structure of our system is shown in</Paragraph>
    <Paragraph position="6"> The first module is a rule-based tagger containing pattern match rules, or templates, for time, date, percentage, and monetary expressions. These tags include the standard MUC tags \[Chinchor 1998\], as well as several other sub-categories defined by our organization. More details concerning the sub-categories are presented later. The pattern matcher is based on Finite State Transducer (FST) technology \[Roches &amp; Schabes 1997\] that has been implemented in-house. The subsequent modules are focused on location, person and organization names. The second module assigns tentative person and location tags based on external person and location gazetteers. Rather than relying on simple lookup of the gazetteer which is very error prone, this module employs MaxEnt to build a statistical model that incorporates gazetteers with common contextual information. The core module of the system is a bigram-based HMM \[Bikel et a1.1997\]. Rules designed to correct errors in NE segmentation are incorporated into a constrained HMM network. These rules serve as constraints on the HMM model and enable it to utilize information beyond bigrams and remove obvious errors due to the limitation of the training corpus. HMM generates the standard MUC tags, person, location and organization. Based on MaxEnt, the last module derives sub-categories such as city, airport, government, etc. from the basic tags.</Paragraph>
    <Paragraph position="7"> Section 1 describes the FST rule module.</Paragraph>
    <Paragraph position="8"> Section 2 discusses combining gazetteer information using MaxEnt. The constrained HMM is described in Section 3. Section 4 discusses sub-type generation by MaxEnt. The experimental results and conclusion are presented finally.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML