File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1102_intro.xml
Size: 4,077 bytes
Last Modified: 2025-10-06 14:06:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1102"> <Title>Encoding Linguistic Corpora</Title> <Section position="3" start_page="11" end_page="12" type="intro"> <SectionTitle> 2 CES Overview </SectionTitle> <Paragraph position="0"> The development of the CES involves the following steps: (1) analysis of the needs of corpus-based NLP research, both in terms of the kinds and degree of annotation required and the requirements for efficient processing, accessibility, etc.; and (2) analysis of general properties and configuration of corpora, the relevant structural and logical features of component text types, and the design of encoding mechanisms that can represent all required elements and features while accommodating the requirements determined in(l).</Paragraph> <Paragraph position="1"> The CES applies to monolingual corpora including texts from a variety of western and eastern European languages, as well as multi-lingual corpora and parallel corpora comprising texts in any of these languages.</Paragraph> <Paragraph position="2"> The term &quot;corpus&quot; here refers to any collection of linguistic data, whether or not it is selected or structured according to some design criteria. According to this definition, a corpus can potentially contain any text type, including not only prose, newspapers, as well as poetry, drama, etc., but also word lists, dictionaries, etc. The CES is also intended to cover transcribed spoken data.</Paragraph> <Paragraph position="3"> The CES distinguishes primary data, which is &quot;unannotated&quot; data in electronic form, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc.; and linguistic annotation, which comprises information generated and added to the primary data as a result of some linguistic analysis. The CES covers the encoding of objects in the primary data that are seen to be relevant to corpus-based work in language engineering research and applications, including: (1) Document-wide markup: bibliographic description of the document, encoding description, etc.</Paragraph> <Paragraph position="4"> (2) Gross structural markup: - structural units of text, such as volume, chapter, etc., down to the level of paragraph; also footnotes, titles, headings, tables, figures, etc.</Paragraph> <Paragraph position="5"> - normalization to recommended character sets and entities (3) Markup for sub-paragraph structures: - sentence.s, quotations - words - abbreviations, names, dates, terms, cited words, etc.</Paragraph> <Paragraph position="6"> In addition, the CES covers encoding conventions for linguistic annotation of text and speech, currently including morpho-syntactic tagging and parallel text alignment. We intend to extend the CES in the near future to cover speech annotation, including prosody, phonetic transcription, alignment of levels of speech analysis, etc.; discourse elements; terminology; and lexicon encoding.</Paragraph> <Paragraph position="7"> Markup types (2) and (3) above include text elements down to the level of paragraph, which is the smallest unit that can be identified language-independently, as well as sub-paragraph structures which are usually signaled (sometimes ambiguously) by typography in the text and which are language-dependent.</Paragraph> <Paragraph position="8"> Document-wide markup and markup for linguistic annotation provide &quot;extra-textual&quot; information: the former provides information about the provenance, form, content and encoding of the text, and the latter enriches the text with the results of some linguistic analysis. As such, both add information about the text rather than identify constituent elements.</Paragraph> <Paragraph position="9"> The CES is intended to cover those areas of corpus encoding on which there exists consensus among the language engineering community, or on which consensus can be easily achieved. Areas where no consensus can be reached (for example, sense tagging) are not treated at this time.</Paragraph> </Section> class="xml-element"></Paper>