File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1659_intro.xml

Size: 4,111 bytes

Last Modified: 2025-10-06 14:03:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1659">
  <Title>Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement</Title>
  <Section position="4" start_page="501" end_page="501" type="intro">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> Most of the previous work on Information Extraction (IE) focused on supervised learning. Relation Detection and Characterization (RDC) was introduced in the Automatic Content Extraction Program (ACE) (ACE, 2004). The approaches proposed to the ACE RDC task such as kernel methods (Zelenko et al., 2002) and Maximum Entropy methods (Kambhatla, 2004) required the availability of large set of human annotated corpora which are tagged with relation instances.</Paragraph>
    <Paragraph position="1"> However human annotated instances are limited, expensive, and time consuming to obtain, due to the lack of experienced human annotators and the low inter-annotator agreements.</Paragraph>
    <Paragraph position="2"> Some previous work adopted weakly supervised or unsupervised learning approaches.</Paragraph>
    <Paragraph position="3"> These approaches have the advantage of not needing large tagged corpora but need seed examples or seed extraction patterns. The major drawback of these approaches is their dependency on seed examples or seed patterns which may lead to limited generalization due to dependency on handcrafted examples. Some of these approaches are briefed here: (Brin,98) presented an approach for extracting the authorship information as found in books description on the World Wide Web. This technique is based on dual iterative pattern relation extraction wherein a relation and pattern set is iteratively constructed. This approach has two major drawbacks: the use of handcrafted seed examples to extract more examples similar to these handcrafted seed examples and the use of a lexicon as the main source for extracting information. null (Blum and Mitchell, 1998) proposed an approach based on co-training that uses unlabeled data in a particular setting. They exploit the fact that, for some problems, each example can be described by multiple representations.</Paragraph>
    <Paragraph position="4"> (Riloff &amp; Jones, 1999) presented the Meta-Bootstrapping algorithm that uses an un-annotated training data set and a set of seeds to learn a dictionary of extraction patterns and a domain specific semantic lexicon. Other works tried to exploit the duality of patterns and their extractions for the purpose of inferring the semantic class of words like (Thelen &amp; Riloff, 2002) and (Lin et al, 2003).</Paragraph>
    <Paragraph position="5"> (Muslea et al., 1999) introduced an inductive algorithm to generate extraction rules based on user labeled training examples. This approach suffers from the labeled data bottleneck.</Paragraph>
    <Paragraph position="6"> (Agichtein et. al, 2000) presented an approach using seed examples to generate initial patterns and to iteratively obtain further patterns. Then ad-hoc measures were deployed to estimate the relevancy of the patterns that have been newly obtained. The major drawbacks of this approach are: its dependency on seed examples leads to limited capability of generalization, and the estimation of patterns relevancy requires the deployment of ad-hoc measures.</Paragraph>
    <Paragraph position="7"> (Hasegawa et. al. 2004) introduced unsupervised approach for relation extraction depending on clustering context words between named entities; this approach depends on ad-hoc context similarity between phrases in the context and focused on certain types of relations.</Paragraph>
    <Paragraph position="8"> (Etzioni et al, 2005) proposed a system for building lists of named entities found on the web. Their system uses a set of eight domain-independent extraction patterns to generate candidate facts.</Paragraph>
    <Paragraph position="9"> All approaches, proposed so far, suffer from either requiring large amount of labeled data or the dependency on seed patterns (or examples) that result in limited generalization.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML