File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0435_intro.xml

Size: 1,467 bytes

Last Modified: 2025-10-06 14:01:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0435">
  <Title>Memory-Based Named Entity Recognition using Unannotated Data</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a memory-based approach to learning names in English and German newspaper text.</Paragraph>
    <Paragraph position="1"> The first system used no unannotated data - only the provided training material, and a number of gazetteers.</Paragraph>
    <Paragraph position="2"> It was shown that the gazetteers made for a better performance in the German task, but not in the English task. Type-token generalization was helpful for neither English nor German.</Paragraph>
    <Paragraph position="3"> The second system used unannotated data, but only for the English task. The extra data were used in two ways: first, more gazetteers were derived from the corpus by exploiting conjunctions: if in a conjunction of capitalized strings one string is recognized as being a certain type of name, the other strings are assumed to be of the same type and stored in a new gazetteer. This list was then used to construct an additional feature for training the machine learning algorithm. The second approach counts how often each word form in the additional corpus is capitalized, and how often it is not. This is used as another feature for the learning algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML