File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1122_intro.xml

Size: 2,342 bytes

Last Modified: 2025-10-06 14:02:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1122">
  <Title>Named Entity Discovery Using Comparable News Articles</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recently, Named Entity (NE) recognition has been getting more attention as a basic building block for practical natural language applications. A Named Entity tagger identifies proper expressions such as names, locations and dates in sentences. We are trying to extend this to an Extended Named Entity tagger, which additionally identifies some common nouns such as disease names or products. We believe that identifying these names is useful for many applications such as information extraction or question answering (Sekine et al., 2002).</Paragraph>
    <Paragraph position="1"> Normally a Named Entity tagger uses lexical or contextual knowledge to spot names which appear in documents. One of the major problem of this task is its data sparseness. Names appear very frequently in regularly updated documents such as news articles or web pages. They are, however, much more varied than common nouns, and changing continuously. Since it is hard to construct a set of pre-defined names by hand, usually some corpus based approaches are used for building such taggers.</Paragraph>
    <Paragraph position="2"> However, as Zipf's law indicates, most of the names which occupy a large portion of vocabulary are rarely used. So it is hard for Named Entity tagger developers to keep up with a contemporary set of words, even though a large number of documents are provided for learning. There still might be a &amp;quot;wild&amp;quot; noun which doesn't appear in the corpora. Several attempts have been made to tackle this problem by using unsupervised learning techniques, which make vast amount of corpora available to use. (Strzalkowski and Wang, 1996) and (Collins and Singer, 1999) tried to obtain either lexical or contextual knowledge from a seed given by hand. They trained the two different kind of knowledge alternately at each iteration of training. (Yangarber et al., 2002) tried to discover names with a similar method. However, these methods still suffer in the situation where the number of occurrences of a certain name is rather small.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML