File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2045_intro.xml

Size: 2,872 bytes

Last Modified: 2025-10-06 14:02:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2045">
  <Title>Unsupervised Feature Selection for Relation Extraction</Title>
  <Section position="2" start_page="0" end_page="262" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Relation extraction is the task of finding relationships between two entities from text contents.</Paragraph>
    <Paragraph position="1"> There has been considerable work on supervised learning of relation patterns, using corpora which have been annotated to indicate the information to be extracted (e.g. (Califf and Mooney, 1999; Zelenko et al., 2002)). A range of extraction models have been used, including both symbolic rules and statistical rules such as HMMs or Kernels.</Paragraph>
    <Paragraph position="2"> These methods have been particularly successful in some specific domains. However, manually tagging of large amounts of training data is very time-consuming; furthermore, it is difficult for one extraction system to be ported across different domains.</Paragraph>
    <Paragraph position="3"> Due to the limitation of supervised methods, some weakly supervised (or semi-supervised) approaches have been suggested (Brin, 1998; Eugene and Luis, 2000; Sudo et al., 2003). One common characteristic of these algorithms is that they need to pre-define some initial seeds for any particular relation, then bootstrap from the seeds to acquire the relation. However, it is not easy to select representative seeds for obtaining good results.</Paragraph>
    <Paragraph position="4"> Hasegawa, et al. put forward an unsupervised approach for relation extraction from large text corpora (Hasegawa et al., 2004). First, they adopted a hierarchical clustering method to cluster the contexts of entity pairs. Second, after context clustering, they selected the most frequent words in the contexts to represent the relation that holds between the entities. However, the approach exists its limitation. Firstly, the similarity threshold for the clusters, like the appropriate number of clusters, is somewhat difficult to predefined. Secondly, the representative words selected by frequency tends to obscure the clusters.</Paragraph>
    <Paragraph position="5"> For solving the above problems, we present a novel unsupervised method based on model order selection and discriminative label identification. For achieving model order identification, stability-based criterion is used to automatically estimate the number of clusters. For removing noisy feature words in clustering procedure, feature selection is conducted by optimizing a trace based criterion subject to some constraint in an  unsupervised manner. Furthermore, after relation clustering, we employ a discriminative category matching (DCM) to find typical and discriminative words to represent different relations types.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML