File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-2108_intro.xml

Size: 2,470 bytes

Last Modified: 2025-10-06 14:05:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2108">
  <Title>CONTENT CHARACTERIZATION USING WORD SHAPE TOKENS</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> There are nlany text processing tasks that we would like to accomplish, such as document classification, text database structuring, matching documents with queries, and topic characterization. The field of computatiomd linguistics has developed a variety of techniques for accomplishing these tasks for text &amp;vuments represented by character codes (e.g., ASCII). llowever, many documents for which we would like to use otn automated techniques arc not stored online in character-coded \[ornla\[, but instead exist only on paper. Optical character recognition (OCR) is a technique for converting scanned document images into character codes. By using ()CR, document images can \[y,2 converted into a form amenable to existing text processing techniques, t towcvcr, OCR is expensive, slow, and o\[\[cn illaccnrate. Because of these drawbacks, we would like to avoid OCR it we can, c.r at the least, postpone using OCR until we are confident that a document wammts detailed processing. In other words, we would like a high-bandwidth document processing system that is sensitive enough to detect desired document Icatnrcs.</Paragraph>
    <Paragraph position="1"> Our document understanding goals at the Fuji Xerox Pale Alto Laboratory include latlgaage determination (Nakayama and Spitz, 1993; Sibun and Spitz, forthcoming), (:otllettl ('hara(Terizalion, and style charucteri=alion. Toward these goals, we are developing it set of methods for extracting inlk)rmation from document images which do not depend on OCR. We have been working toward our goal of inexpensive content characterization by adapting a part of-.v)eech tagger to process word shape tokens rather than character coded words. Part-el-speech tagging is a technique that has been developed and refined over the past several years, and it provides an inexpensive, last, and reliable source of inlormation for recognizing noun phlases and other syntax-related text features which help characterize a doeunlen\[rs content.</Paragraph>
    <Paragraph position="2"> In this paper, we describe how we combine our technology for determining word shape tokens with texttagging technology. We are developing systems that can</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML