File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/p02-1061_intro.xml

Size: 2,318 bytes

Last Modified: 2025-10-06 14:01:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1061">
  <Title>Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
LONGTIME HOUSE STAFFER AND AN EX-
PERT IN SECURITIES LAWS, IS A LEADING
CANDIDATE TO BE CHAIRWOMAN OF THE
SECURITIES AND EXCHANGE COMMIS-
SION IN THE CLINTON ADMINISTRATION.
</SectionTitle>
    <Paragraph position="0"> can also be applied on transcribed text from automatic speech recognizers in Speech Normalized Orthographic Representation (SNOR) format, or from optical character recognition (OCR) output. For the English language, a word starting with a capital letter often designates a named entity. Upper case NERs do not have case information to help them to distinguish named entities from non-named entities. When data is sparse, many named entities in the test data would be unknown words. This makes upper case named entity recognition more difficult than mixed case. Even a human would experience greater difficulty in annotating upper case text than mixed case text (Figure 1).</Paragraph>
    <Paragraph position="1"> We propose using a mixed case NER to &amp;quot;teach&amp;quot; an upper case NER, by making use of unlabeled mixed case text. With the abundance of mixed case un-Computational Linguistics (ACL), Philadelphia, July 2002, pp. 481-488. Proceedings of the 40th Annual Meeting of the Association for labeled texts available in so many corpora and on the Internet, it will be easy to apply our approach to improve the performance of NER on upper case text. Our approach does not satisfy the usual assumptions of co-training (Blum and Mitchell, 1998).</Paragraph>
    <Paragraph position="2"> Intuitively, however, one would expect some information to be gained from mixed case unlabeled text, where case information is helpful in pointing out new words that could be named entities. We show empirically that such an approach can indeed improve the performance of an upper case NER.</Paragraph>
    <Paragraph position="3"> In Section 5, we show that for MUC-6, this way of using unlabeled text can bring a relative reduction in errors of 38.68% between the upper case and mixed case NERs. For MUC-7 the relative reduction in errors is 22.49%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML