File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2716_intro.xml
Size: 1,801 bytes
Last Modified: 2025-10-06 14:04:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2716"> <Title>Layering and Merging Linguistic Anotations</Title> <Section position="2" start_page="0" end_page="4" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The American National Corpus (ANC) project recently released its 2 nd release consisting of approximately 2 milion words of data, representing a variety of genres of both writen and spoken data. The corpus is annotated with several layers of automatically produced linguistic information, including sentence and token boundaries, part of speech using two different POS tag-sets (a version of the Penn tagset and the Biber tagset ), and noun chunks and verb chunks. ANC primary documents are plain text (UTF16) documents and are treated as &quot;read only&quot; resources. Al annotations are represented in stand-off XML documents referencing spans in the primary data or other annotation documents, using the XCES implementation of the specifications of ISO TC37 SC4's Linguistic Anotation Framework (LAF) (Ide and Romary, 204). Because few systems that enable search and access of the corpus currently suport stand-off markup, the project has developed a parser that generates ANC data with annotations in-line, in a variety of output formats.</Paragraph> <Paragraph position="1"> htp:/ww.xces.org This demonstration wil show the &quot;life-cycle&quot; of an ANC document, from acquisition of a document in any of a variety of formats (MS Word, PDF, HTML, etc.) through annotation and final representation in the stand-off format. The ANC tol for merging annotations of the user's choice with the primary data to produce a single document with in-line annotations wil also be demonstrated.</Paragraph> </Section> class="xml-element"></Paper>