File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2705_intro.xml

Size: 4,487 bytes

Last Modified: 2025-10-06 14:04:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2705">
  <Title>Multi-dimensional Annotation and Alignment in an English-German Translation Corpus</Title>
  <Section position="2" start_page="0" end_page="35" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In translation studies the question of how translated texts differ systematically from original texts has been an issue for quite some time with a surge of research in the last ten or so years. Example-based contrastive analyses of small numbers of source texts and their translations had previously described characteristic features of the translated texts, without the availability of more large-scale empirical testing. Blum-Kulka (1986), for instance, formulates the hypothesis that explicitation is a characteristic phenomenon of translated versus original texts on the basis of linguistic evidence from individual sample texts showing that translators explicitate optional cohesive markers in the target text not realised in the source text. In general, explicitation covers all features that make implicit information in the source text clearer and thus explicit in the translation (cf. Steiner 2005).</Paragraph>
    <Paragraph position="1"> Building on example-based work like Blum-Kulka's, Baker put forward the notion of translation universals (cf. Baker 1996) which can be analysed in corpora of translated texts regardless of the source language in comparison to original texts in the target language. Olohan and Baker (2000) therefore analyse explicitation in English translations concentrating on the frequency of the optional that versus zero-connector in combination with the two verbs say and tell. While being extensive enough for statistical interpretation, corpus-driven research like Olohan and Baker's is limited in its validity to the selected strings.</Paragraph>
    <Paragraph position="2"> More generally speaking, there is a gap between the abstract research object and the low level features used as indicators. This gap can be reduced by operationalising notions like explicittation into syntactic and semantic categories, which can be annotated and aligned in a corpus.</Paragraph>
    <Paragraph position="3"> Intelligent queries then produce linguistic evidence with more explanatory power than low level data obtained from raw corpora. The results are not restricted to the queried strings but extend to more complex units sharing the syntactic and/ or semantic properties obtained by querying the annotation.</Paragraph>
    <Paragraph position="4"> This methodology serves as a basis for the CroCo project, in which the assumed translation property of explicitation is investigated for the language pair English - German. The empirical evidence for the investigation consists in a corpus of English originals, their German translations as well as German originals and their English translations. Both translation directions are represented in eight registers. Biber's calculations, i.e. 10 texts per register with a length of at least 1,000 words, serve as an orientation for the size of the sub-corpora (cf. Biber 1993). Alto- null gether the CroCo Corpus comprises one million words. Additionally, reference corpora are included for German and English. The reference corpora are register-neutral including 2,000 word samples from 17 registers (see Neumann &amp; Hansen-Schirra 2005 for more details on the CroCo corpus design).</Paragraph>
    <Paragraph position="5"> The CroCo Corpus is tokenised and annotated for part-of-speech, morphology, phrasal categories and grammatical functions. Furthermore, the following (annotation) units are aligned: words, grammatical functions, clauses and sentences.</Paragraph>
    <Paragraph position="6"> The annotation and alignment steps are described in section 2.</Paragraph>
    <Paragraph position="7"> Each annotation and alignment layer is stored separately in a multi-layer stand-off XML representation format. In order to empirically investigate the parallel corpus (e.g. to find evidence for explicitation in translations), XQuery is used for posing linguistic queries. The query process itself works on each layer separately, but can also be applied across different annotation and alignment layers. It is described in more detail in section 3. This way, parallel text segments and/or parallel annotation units can be extracted and compared for translations and originals in German and English.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML