File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/c94-1094_abstr.xml

Size: 4,512 bytes

Last Modified: 2025-10-06 13:48:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1094">
  <Title>Encoding standards for large text resources: The Text Encoding Initiative</Title>
  <Section position="2" start_page="0" end_page="574" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The past few years have seen a burst of activity in the development of statistical methods which, applied to massive text data, have in turn enabled the dcvelopnmnt of increasingly comprehensive and robust models of language structure and use. Such models are increasingly recognized as an inwduable resource for natural langu:lge processing (NLP) tasks, inch,ding machine translation.</Paragraph>
    <Paragraph position="1"> The upsurge of interest in empricial methods for language modelling has led inevitably to a need for massive collections of texts of all kinds, including text collections which span genre, register, spoken and written data, etc., as well as domain- or application-specific collections, and, especially, multi-lingual collections with parallel translations. In tile latter half of the 1980's, very few appropriate or adequately large text collections existed for use in computational linguistics research, especially for languages other than English.</Paragraph>
    <Paragraph position="2"> Consequently, several efforts to collect and disseminate large mono- and multi-lingual text collections have been recently established, including the ACL Data Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), which has developed a multilingual, partially parallel corpus, the U.S. Linguistic Data Consortium (LDC), RELATOR and MULTEXT in EuropE, etc. (see Arm.strong-Warwick, 1993). It is widely recognized that such efforts constitute only a beginning for the necessary data collection and dissemination efforts, and that considerable work to develop adequately large and appropriately constituted textual resources still remains.</Paragraph>
    <Paragraph position="3"> The demand for extensive reusability of large text collections in turn requires the development of standardized encoding formats for this data. It is no longer realistic to distribute data in ad hoc formats, since the eflbrt and resources required to clean tip and reformat the data for local use is at best costly, and in many cases prohibitive. Because much existing and potentially available data was originally formatted R)r the purposes of printing, the information explicitly represented in the encoding concerns a imrticular physical realization of a text rather than its logical strttcture (which is of greater interest for most NLP applications), and the correspondence between the two is often difficult or impossihle to Establish without substantial work.</Paragraph>
    <Paragraph position="4"> Further, as data become more and more available and tile USE of large text collections become more central to NLP research, general and publicly awdlable software to manipt, late tile texts is being developed which, to he itself reusable, also requires the existence of a standard encoding format.</Paragraph>
    <Paragraph position="5"> A standard encoding format adequate for representing textual data for NLP research must be (1) capable of representing the different kinds of information across the spectrum of text types and languages potentially of interest to tile NLP research community, including prosE, technical documents, newspapers, verse, drama, letters, dictionaries, lexicons, etc.; (2) capable of representing different levels of information, including not only physical characterstics and logical structure (as well as other more complex phenomena such as intra- and inter-textual references, aligtunent of parallel elements, etc.), but also interpretive or analytic annotation which may be added to the data (for exainple, markup for part of speech, syntactic structure, Etc.); (3) application independent, that is, it must provide the required flexibility and generality to enable, possibly siumltaneously, the explicit encoding of potentially disparate types of information withiu thc same text, as well as accomodate all potential types of processing. The development of such a suitably flexible and cornprehensivE encoding system is a substantial intellectual task, demanding (just to start) the development of suitably complex models for the wirious text types as well as an overall model of text and ,'m architecture for the encoding scheme that is to embody it.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML