File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-2708_intro.xml
Size: 2,349 bytes
Last Modified: 2025-10-06 14:02:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2708"> <Title>Prague Czech-English Dependency Treebank Any Hopes for a Common Annotation Scheme?</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Prague Czech-English Dependency Treebank (PCEDT) is a project of creating a Czech-English syntactically annotated parallel corpus motivated by research in the field of machine translation. Parallel data are needed for designing, training, and evaluation of both statistical and rule-based machine translation systems.</Paragraph> <Paragraph position="1"> Since Czech is a language with relatively high degree of word-order freedom, and its sentences contain certain syntactic phenomena, such as discontinuous constituents (non-projective constructions), which cannot be straight-forwardly handled using the annotation scheme of Penn Treebank (Marcus et al., 1993; Linguistic Data Consortium, 1999), based on phrase-structure trees, we decided to adopt for the PCEDT the dependency-based annotation scheme of the Prague Dependency Treebank - PDT (Linguistic Data Consortium, 2001). The PDT is annotated on three levels: morphological layer (lowest), analytic layer (middle) - surface syntactic annotation, and tectogrammatical layer (highest) - level of linguistic meaning. Dependency trees, representing the sentence structure as concentrated around the verb and its valency, are used for the analytical and tectogrammatical levels, as proposed by Functional Generative Description (Sgall et al., 1986).</Paragraph> <Paragraph position="2"> In Section 2, we describe the process of translating the Penn Treebank into Czech. Section 3 sketches the general procedure for transforming phrase topology of Penn Treebank into dependency structure and describes the specific conversions into analytical and tectogrammatical representations. The following Section 4 describes the automatic process of parsing of Czech into analytical representation and its automatic conversion into tectogrammatical representation. Section 5 briefly discusses some of the problems of annotation from the point of view of mutual compatibility of annotation schemes. Section 6 gives an overview of additional resources included in the PCEDT.</Paragraph> </Section> class="xml-element"></Paper>