File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2207_intro.xml

Size: 4,555 bytes

Last Modified: 2025-10-06 14:04:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2207">
  <Title>A Hybrid Approach for the Acquisition of Information Extraction Patterns</Title>
  <Section position="2" start_page="0" end_page="48" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Traditionally, Information Extraction (IE) identifies domain-specific events, entities, and relations among entities and/or events with the goals of: populating relational databases, providing event-level indexing in news stories, feeding link discovery applications, etcetera.</Paragraph>
    <Paragraph position="1"> By and large the identification and selective extraction of relevant information is built around a set of domain-specific linguistic patterns. For example, for a &amp;quot;financial market change&amp;quot; domain one relevant pattern is &lt;NOUN fall MONEY to MONEY&gt;. When this pattern is matched on the text &amp;quot;London gold fell $4.70 to $308.35&amp;quot;, a change of $4.70 is detected for the financial instrument &amp;quot;London gold&amp;quot;.</Paragraph>
    <Paragraph position="2"> Domain-specific patterns are either hand-crafted or acquired automatically (Riloff, 1996; Yangarber et al., 2000; Yangarber, 2003; Stevenson and Greenwood, 2005). To minimize annotationcosts,someofthelatterapproachesuselightly null supervised bootstrapping algorithms that require as input only a small set of documents annotated with their correspondingcategory label. The focus of this paper is to improve such lightly supervised pattern acquisition methods. Moreover, we focus on robust bootstrapping algorithms that can handle real-world document collections, which contain many domains.</Paragraph>
    <Paragraph position="3"> Although a rich literature covers bootstrapping methods applied to natural language problems (Yarowsky, 1995; Riloff, 1996; Collins and Singer, 1999; Yangarber et al., 2000; Yangarber, 2003; Abney, 2004) several questions remain unanswered when these methods are applied to syntactic or semantic pattern acquisition. In this paper we answer two of these questions:  (1) Can pattern acquisition be improved with  text categorization techniques? Bootstrapping-based pattern acquisition algorithms can also be regarded as incremental text categorization (TC), since in each iteration documents containing certain patterns are assigned the corresponding category label. Although TC is obviously not the main goal of pattern acquisition methodologies,itisneverthelessanintegralpartof the learning algorithm: each iteration of the acquisition algorithm depends on the previous assignments of category labels to documents. Hence, if the quality of the TC solution proposed is bad, the quality of the acquired patterns will suffer.</Paragraph>
    <Paragraph position="4"> Motivated by this observation, we introduce a co-training-based algorithm (Blum and Mitchell, 1998) that uses a text categorization algorithm as reinforcement for pattern acquisition. We show, using both a direct and an indirect evaluation, that the combination of the two methodologies always improves the quality of the acquired patterns.</Paragraph>
    <Paragraph position="5">  (2) Which pattern selection strategy is best?  Whilemostbootstrapping-basedalgorithmsfollow the same framework, they vary significantly in what they consider the most relevant patterns in each bootstrapping iteration. Several approaches have been proposed in the context of word sense disambiguation (Yarowsky, 1995), named entity (NE) classification (Collins and Singer, 1999), patternacquisitionforIE(Riloff,1996; Yangarber, 2003), or dimensionality reduction for text categorization (TC) (Yang and Pedersen, 1997). However, it is not clear which selection approach is the best for the acquisition of syntactico-semantic patterns. To answer this question, we have implemented a modular pattern acquisition architecture where several of these ranking strategies are implemented and evaluated. The empirical study presented in this paper shows that a strategy previouslyproposedforfeaturerankingforNErecogni- null tion outperforms algorithms designed specifically for pattern acquisition.</Paragraph>
    <Paragraph position="6"> The paper is organized as follows: Section 2 introduces the bootstrapping framework used throughout the paper. Section 3 introduces the data collections. Section 4 describes the direct and indirect evaluation procedures. Section 5 introduces a detailed empirical evaluation of the proposed system. Section 6 concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML