File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1102_metho.xml

Size: 17,891 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1102">
  <Title>Encoding Linguistic Corpora</Title>
  <Section position="4" start_page="12" end_page="13" type="metho">
    <SectionTitle>
3 Levels of Conformance
</SectionTitle>
    <Paragraph position="0"> The CES provides a TEI-conformant Document Type Definition (DTD) for three levels of encoding for primary data together with its documentation (the &amp;quot;cesDoc DTD&amp;quot;): Level 1 : the minimum encoding level required for CES conformance, requiring markup for gross document structure (major text divisions), down to the level of the paragraph. Specifically, the following must be fulfilled:  paragraph level is included. However, note that for Level 1 CES conformance, paragraph-level markup need not be refined. For example, via automatic means all carriage returns may be changed to &lt;p&gt; (paragraph) tags; identification of instances where the carriage return signals a list, a long quote, etc. is not required.</Paragraph>
    <Paragraph position="1"> It is also recommended that there should be no information loss for sub-paragraph elements. Sub-paragraph elements identified in the original by special typography not directly representable in the SGML encoded version (e.g., distinction by font such as italics, vs. distinction by capital letters or quote marks, which is directly representable in the encoded version) should be marked, typically using a &lt;hi&gt; (&amp;quot;highlighted&amp;quot;) tag.</Paragraph>
    <Paragraph position="2"> Level 2 : requires that paragraph level elements are correctly marked, and (where possible) the function of rendition information at the sub-paragraph level is determined and elements marked accordingly. Specific requirements are: * The requirements for a Level 1 document are satisfied.</Paragraph>
    <Paragraph position="3"> * If a sub-paragraph element is marked, every occurrence of that element has been identified and marked in the text.</Paragraph>
    <Paragraph position="4"> * SGML entities replace all special characters (e.g., &amp;mdash;, &amp;pound;, etc.).</Paragraph>
    <Paragraph position="5"> * Quotation marks are removed and either replaced by appropriate standard SGML entities, or represented in a rend attribute on a &lt;q&gt; or &lt;quote&gt; tag.</Paragraph>
    <Paragraph position="6"> * The document validates against the cesDoc DTD, using an SGML parser such as sgmls.</Paragraph>
    <Paragraph position="7"> It is further recommended that all paragraph level elements (lists, quotes, etc.) are correctly identified, and, where possible, &lt;hi&gt; tags are resolved to more precise tags (foreign, term, etc.) Level 3 : the most restrictive and refined level of markup for primary data. It places additional constraints on the encoding of sunits and quoted dialogue, and demands more sub-paragraph level tagging. Conformance to this level demands: * Requirements for a Level 2 document are satisfied.</Paragraph>
    <Paragraph position="8"> * All paragraph level elements (lists, quotes, etc.) are correctly identified * Where possible, &lt;hi&gt; tags are resolved to more precise tags (foreign, term, etc.) * The following sub-paragraph elements have been identified and marked (either with explicit tags such as &lt;abbr&gt;, &lt;num.&gt;, etc. or with user-defined morpho-syntactic tags.</Paragraph>
    <Paragraph position="9">  - abbreviations - numbers - names - foreign words and phrases  * Where s-units and dialogue are tagged:, the &lt;p&gt; - &lt;s&gt; - &lt;q&gt; hierarchy mus~t be  followed.</Paragraph>
    <Paragraph position="10"> * The encoding for all elements including and below the level of the paragraph has been validated for a I0 percent sample of the text. Note: this does not include morpho-syntactic tagging, if present.</Paragraph>
    <Paragraph position="11"> * The document validates against the cesDoc DTD, using an SGML parser such as sgmls.</Paragraph>
  </Section>
  <Section position="5" start_page="13" end_page="13" type="metho">
    <SectionTitle>
4 Data Architecture
</SectionTitle>
    <Paragraph position="0"> The CES adopts a strategy whereby annotation information is not merg.ed with the original, but rather retained m separate SGML documents (with different DTDs) and linked to the original or other annotation documents.</Paragraph>
    <Paragraph position="1"> Linkage between original and annotation documents is accomplished using the TEl addressing mechanisms for element linkage.</Paragraph>
    <Paragraph position="2"> The CES linkage specifications are currently being updated to conform to XML (Mater &amp; DeRose, 1998).</Paragraph>
    <Paragraph position="3"> The hyper-document comprising each text in the corpus and its annotations consists Of several documents. The base or &amp;quot;hub&amp;quot; document is the unannotated document containing only primary data markup. The hub document is &amp;quot;read only&amp;quot; and is not modified in the annotation process. Each annotation document is a proper SGML document with a DTD, containing annotation information linked to its appropriate location in the hub document or another annotation document.</Paragraph>
    <Paragraph position="4"> All annotation documents are linked to the SGML original (containing the primary data) or other annotation documents using one-way links. The exception is output of the aligner for parallel texts, which consists of an SGML document containing only two-way links associating locations in two documents in different languages. The two linked documents are two documents containing the relevant structural information, such as sentence or word boundaries. The overall architecture is given in Figure 1.</Paragraph>
  </Section>
  <Section position="6" start_page="13" end_page="16" type="metho">
    <SectionTitle>
5 The CES DTDs
</SectionTitle>
    <Paragraph position="0"> Because the CES is an application of SGML, document structure is defined using a context free grammar in a document type definition (DTD).</Paragraph>
    <Paragraph position="1"> At present, the CES provides three different TEI customizafions, each instantiated using the TEI.2 DTD and the appropriate TEI customization files, for use with different documents. For convenience, a version of each of these three TEI instanfiations is provided as a stand-alone DTD, together with a means to browse the element tree as a hypertext</Paragraph>
    <Section position="1" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
S.1 The cesDoc DTD
</SectionTitle>
      <Paragraph position="0"> The cesDoc DTD is used to encode primary documents, including texts with gross structural markup only to texts heavily and consistently marked for elements of relevance for corpus-based work. It defines the required structure for marking Level 1 conformant documents down to the paragraph level. It also defines additional elements at the sub-paragraph level which may appear, but are not required, in a Level 1 encoding, and which are used in Level 2 and Level 3 encodings.</Paragraph>
      <Paragraph position="1"> There are five mare categories of sub-paragraph elements:  There have been two main defining forces behind the choice of linguistic elements: (1) the needs of corpus-annotation tools, such as morpho-syntactic taggers, whose performance can often be improved by pre-identification of elements such as  names, addresses, title, dates, measures, foreign words and phrases, etc.</Paragraph>
      <Paragraph position="2"> (2) the need to identify objects which have intrinsic linguistic interest, or are often useful for the purposes of translation, text alignment, etc., such as abbreviations, names, terms, linguistically distinct words and phrases, etc.</Paragraph>
      <Paragraph position="3"> The CES documentation provides an informal semantics for tags used in the cesDoc DTD, especially sub-paragraph linguistic elements. For example, the CES provides precise description of the textual phenomena that should be marked with &lt;name&gt; tags (e.g., do not tag laws named after people, etc.). The documentation also includes specifications for the format of such encoding. For example, titles and roles (e.g., &amp;quot;President&amp;quot; in &amp;quot;President Clinton) should not be included inside the &lt;name&gt; tag, punctuation not a part of the name is not enclosed in the &lt;name&gt; tag (e.g., &amp;quot;President &lt;name type=person&gt; Clinton&lt;/name&gt;,&amp;quot;), etc. In addition, precise rules for handling punctuation in abbreviations, sentences, quotations, as well as apostrophes, etc., are provided, as well as a hierarchical referencing system used to generate distinct identifiers (SGML id's) for structural elements such as chapters, paragraphs, sentences, and words.</Paragraph>
      <Paragraph position="4"> In general, the rules for encoding sub-paragraph elements are driven by two considerations:  (1) Retrieval: it is essential that items marked  with like tags in a document represent the same kind of object. Therefore, while &amp;quot;Clinton&amp;quot; in a phrase such as &amp;quot;President Clinton today said...&amp;quot; is marked as a name, it is not marked as a name in the phrase &amp;quot;the Clinton doctrine&amp;quot;. (2) Processing needs: There is a small class of tags which mark the presence of tokens that have been isolated and classified by the encoder, e.g., abbreviations, names, dates, numbers, terms, etc. For many language processing tools, when such an element is identified in the input stream, it is not desirable to further tokenize the string inside the tag; rather, the string inside the tag can be regarded as a single token (possibly with the type indicated by the tag name). For example, in some languages it may be possible for lexical lookup routines and morpho-syntactic taggers to assume that an element with the tag &lt;name&gt; is a single token with the grammatical category PROPER NOUN. Therefore, adjectival forms in English (e.g., &amp;quot;Estonian&amp;quot;) are not marked as names; generally, for any language, only nouns or noun phrases are marked as names.</Paragraph>
      <Paragraph position="5"> Similarly, for language processing purposes &amp;quot;Big Brother&amp;quot; can be regarded as a single token instead of two distinct tokens; if marked with a &lt;name&gt; tag, processing software may opt to avoid further tokenization of the marked entity. Based on this possibility, punctuation that is not a part of the token is not included inside the tag; in English, possessives are marked by placing the &amp;quot;%&amp;quot; outside the tag, etc. The CES recommends that linguistic annotation be encoded in a separate SGML document with its own DTD, which is linked to the primary data. However, for some applications it is still desirable to retain morpho-syntactic annotation in the same SGML document as the primary data.</Paragraph>
      <Paragraph position="6"> Therefore, the CES provides means to accomplish this in-file tagging. To implement it, a pre-defined module containing all the required definitions for the morpho-syntactic information is brought in at the beginning of the document.</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="15" type="sub_section">
      <SectionTitle>
5.2 The cesAna DTD
</SectionTitle>
      <Paragraph position="0"> The cesAna DTD is used for segmentation and grammatical annotation, including: * sentence boundary markup * tokens, each of which consists of the following: * the orthographic form of the token as it appears in the corpus * grammatical annotation, comprising one or more sets of the following: * the base form (lemma) * a morpho-syntactic specification * acorpus tag Allowing more than one possible set of grammatical annotation enables representing data for which lexical lookup or some other morpho-syntactic analysis has been performed, but which has not been disambiguated. When disambiguation has been accomplished, an optional element can be included containing the disambiguated form.</Paragraph>
      <Paragraph position="1"> The structure of the DTD constituents is based on the overall principle that one or more &amp;quot;chunks&amp;quot; of a text may be included in the annotation document. These chunks may correspond to parts of the document extracted at different times for annotation, or simply to some subset of the text that has been extracted for analysis. For example, it is likely that within any text, only the paragraph content will undergo morpho-syntactic analysis, and rifles, footnotes, captions, long quotations, etc. will be omitted or analyzed separately.</Paragraph>
      <Paragraph position="2"> The following example, which shows the annotation for the first word (&amp;quot;le&amp;quot; in .French) of  a primary data document stored in a file called &amp;quot;MyTextl&amp;quot;, shows the use of many of the options provided in the cesAna DTD. This set of annotation data could be the final result after tokenization, segmentation, lexical lookup or morpho-syntactic analysis, and part of speech disambiguation. All the original options for morpho-syntactic class are retained here, and the disambiguated tag is provided in the &lt;ais~b&gt; element.</Paragraph>
    </Section>
    <Section position="3" start_page="15" end_page="16" type="sub_section">
      <SectionTitle>
5.3 The cesAlign DTD
</SectionTitle>
      <Paragraph position="0"> The cesAlign DTD defines the annotation document containing alignment information for parallel texts. It consists entirely of llinks between the documents that have been aligned.</Paragraph>
      <Paragraph position="1"> Alignment may be between primary data documents or between annotation documents containing segmentation information for the aligned-units (paragraphs, sentences, tokens etc.). Alignment may be between two or more  such documents, which are identified in the header of the alignment document.</Paragraph>
      <Paragraph position="2"> The most common situation in aligning parallel translations is to align data that comprises the content of an entire SGML element, such as an &lt;s&gt;, &lt;par&gt;, or &lt;tok&gt; element. Especially when the aligned data is not in the SGML original document, it is likely that the elements to be associated will have id attributes by which they can be referenced in the alignment document, in order to specify the elements to be aligned or &amp;quot;linked&amp;quot;. Note that when the SGML ID and IDref mechanism is used to point from one element to another in the same SGML document, the SGML parser will validate the references to ensure that every IDREF points to a valid ID. In the CES, all alignment documents are separate from the documents that are being aligned, and therefore this validation of IDrefs by the SGML parser is lost. However, other software may be used to validate cross-document references, if necessary.</Paragraph>
      <Paragraph position="3"> The CES provides a simple means to point to SGML elements in other SGML documents by referring to IDs or any other unique identifying attribute on those elements, using the xtargets attribute on the &lt;\].ink&gt; element. Here is a simple example: DOCi: &lt;s id=plsl&gt;According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products.&lt;/s&gt; &lt;s id=pls2&gt;Cola drink manufacturers in particular achieved above-average growth rates. &lt;/s&gt;  When the data to be linked does not include IDs on relevant elements (or for some reason it is not desired to use IDrefs for alignment), or when the data to be linked is not the entire content of an SGML element, it is necessary to reference locations in the documents using the CES notation, which consists of a combination of ESIS tree location and character offset.</Paragraph>
      <Paragraph position="4"> Conclusion By far the greatest need for the development of linguistic corpora is to ensure their usability and reusability in integrated platforms. This demands (at least): * the development and use of consistent and coherent encoding formats for data representation, as well as standardized schemes for annotation of linguistic information; * the development of reusable, integrated systems and tool architectures for language processing and analysis, including the corresponding development of a data architecture to best suit research needs.</Paragraph>
      <Paragraph position="5"> It is imperative that these activities be undertaken in collaboration. For example, an encoding format that maximizes processability and retrievability must be devised in view of the capabilities and architecture of the tools that will handle them; similarly, reusable tool design must be informed by full knowledge of the nature and representation of linguistic information, desired processes, etc.</Paragraph>
      <Paragraph position="6"> The development of the CES is an attempt to achieve this kind of integration between the development of encoding schemes and corpus processing and use. Very little study has been made to date of the relation between encoding conventions and the demands of processing and retrieval, despite the fact that with the development of digital libraries and web-based document delivery, consideration of these relationships is critical. The CES is in some sense an experiment to develop a principled basis for further work on this topic; it is in no way intended to be the complete and final answer to the problem. Rather, the CES is being developed from the bottom-up, by starting with a relatively minimal set of encoding conventions and successively incorporating feedback to enlarge the standard as needed by the language processing community, and as processing and retrieval needs become better understood. Testing of the current CES specifications and feedback are both invited and encouraged, as well input and suggestions concerning the treatment of other areas of corpus encoding.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML