File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1050_metho.xml

Size: 10,869 bytes

Last Modified: 2025-10-06 14:14:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1050">
  <Title>Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages</Title>
  <Section position="2" start_page="0" end_page="315" type="metho">
    <SectionTitle>
1 The Multext-East corpora
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="315" type="sub_section">
      <SectionTitle>
1.1 Encoding format
</SectionTitle>
      <Paragraph position="0"> Based on the principle that its corpus encoding format should be standardized and homogeneous both for interchange and for facilitating open-ended retrieval tasks, Multext-East adopted the  Corpus Encoding Standard (CES) 3 (Ide, 1998), which has been developed to be optimally suited for use in language engineering and corpus-based work. The CES is an application of SGML (ISO-8879, Standard Generalized Markup Language) and is based on the TEl Guidelines for Electronic Text Encoding and Interchange.</Paragraph>
      <Paragraph position="1"> In addition to providing encoding conventions for elements relevant to corpus-based work, the CES provides a data architecture for linguistic corpora and their annotations. Each corpus component, comprising a single text and its annotations, is organized as a hyper-document, with various levels of annotation stored in separate SGML documents (each with a separate DTD). Low-density (i.e., above the token level) annotation is expressed indirectly in terms of inter-document links. Markup for different types of annotation (e.g., part of speech, alignment, etc.) is described by a separate Data Type Definition (DTD) specifically tailored to that information.</Paragraph>
    </Section>
    <Section position="2" start_page="315" end_page="315" type="sub_section">
      <SectionTitle>
1.2 The parallel corpus
</SectionTitle>
      <Paragraph position="0"> The Multext-East parallel corpus consists of seven translations of George Orwell's Nineteen Eighty-Four: besides the original English version, the corpus contains translations in the six project languages. There are three versions of each text in the parallel corpus, corresponding to different levels of annotation: a cesDoc encoding (SGML markup up to the sub-paragraph level, including markup for sentence boundaries); and a cesAna encoding, containing word-level morphosyntactic markup together with links to each sentence (and in some versions, to each word) in the cesDoc version. A fourth document, the cesAlign document, is associated with each of the non-English versions, which includes links between sentences in the cesDoc encoding for each and the English version, thus providing a parallel alignment at the sentence level. The cesAna versions, which are the most linguistically</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="315" end_page="98426" type="metho">
    <SectionTitle>
3 The CES was developed in a joint effort of the
</SectionTitle>
    <Paragraph position="0"> European projects Multext (LRE) and EAGLES (in particular, the EAGLES Text Representation subgroup), together with the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation).</Paragraph>
    <Paragraph position="1"> informative, are marked up as shown below for the English phrase &amp;quot;smell of bugs&amp;quot;: &lt;tok type=WORD from=' Oen. 1.6.15. i\62' &gt;  &lt;ct ag&gt;NNS&lt;/ct ag&gt;&lt; / lex&gt;&lt;/tok&gt; In this example, the position of each token in the parallel corpus is given in the from attribute whose value specifies the hierarchical position of the token within the text (here, the token &amp;quot;smell&amp;quot; appears in part 1, chapter 6, paragraph 15, sentence 1, byte offset 62). All possible morphosyntactic interpretations of the token are given in the &lt;lox&gt; field consisting of the base form, a morphosyntactic description (see Section 2), and an associated corpus tag. The &lt;disamb&gt; field contains the interpretation that has been identified as valid within the respective context; within this tag, the &lt;eeag&gt; element provides the corresponding corpus tag (see section 2). 4 The  4 In the Czech and Slovene versions, &lt;ctag&gt; is omitted because its contents are identical to the &lt;msd&gt; tag contents.</Paragraph>
    <Paragraph position="2">  disambiguation of each language version in the parallel corpus was aocomplished using automatic POS tagging algorithms and then partially or entirely hand-validated.</Paragraph>
    <Paragraph position="3"> Table 1 provides the main characteristics per language of this corpus. In this table:  The texts from the corpora were segmented using the corpus annotation toolset developed within the Multext project, augmented by language-specific resources developed by Multext-East. The Multext segmenter is a language-independent and configurable tokenizer whose output includes token, paragraph and sentence boundary markers.</Paragraph>
    <Paragraph position="4"> Punctuation, lexical items, numbers, and various alphanumeric sequences (such as dates and hours) are annotated with tags defined in a hierarchical, class-structured tagset. The language-specific behavior of the segmenter is enabled by its engine-driven design, in which all language-specific information is provided as data. Within Multext-East, resource data, including rules describing the form of sentence boundaries, word splitting (cliticized forms decomposition), word compounding, quotations, numbers, dates, punctuation, capitalization, abbreviations etc., was developed for the six project languages.</Paragraph>
    <Paragraph position="5"> Once the input text was tokenized, a dictionary look-up procedure was used to assign each lexical token all its possible morphosyntactic descriptors (MSDs). The ambiguously MSD-annotated texts were then hand-disambiguated (entirely for some languages and partially for the others). This time-consuming and error-prone process was sped up significantly by a special XEMACS mode, developed within the project, which is aware of the morphosyntactic descriptors' significance and allows for natural language expansion of the linear encoding of the MSDs. The ambiguously MSD-annotated texts and the corresponding disambiguated texts provided the basis for building the cesAna encoded version of the multilingual parallel corpus.</Paragraph>
    <Paragraph position="6"> The corpus also contains six language pair-wise alignments between each of the six project languages and English. The alignments were performed by three different automatic aligners (Multext-aligner, &amp;quot;vanilla-aligner&amp;quot;, Silfidealigner) with accuracy ranging between 75-90%, and then hand validated. Table 2 shows the distribution of sentence alignments for each pair of languages.</Paragraph>
    <Section position="1" start_page="316" end_page="98426" type="sub_section">
      <SectionTitle>
1.3 Multilingual comparable corpus
</SectionTitle>
      <Paragraph position="0"> Multext-East also produced a multilingual comparable corpus, including two subsets of at least 100,000 words each for each of the six project languages. The texts include fiction, comprising a single novel or excerpts from several novels, and newspaper data.</Paragraph>
      <Paragraph position="1"> The data is comparable across the six languages, in terms of the number and size of texts. The entire multilingual comparable corpus was prepared in CES format manually or using ad hoc tools.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="98426" end_page="98426" type="metho">
    <SectionTitle>
2 Morpho-lexical resources
</SectionTitle>
    <Paragraph position="0"> Multext-East, in collaboration with EAGLES, evaluated, adapted and extended the EAGLES morphosyntactic specifications (rule format, lexical specifications, corpus tagset, etc.) to cover the six Multext-East languages (Erjavec and Monachini, 1997). Accommodating the different language families represented among the Multext-East languages demanded substantial assessment and modification of the pre-existing specifications, which were originally developed for western European languages only.</Paragraph>
    <Paragraph position="1"> For corpus morpho-lexical processing purposes, the Multext-East consortium developed language-specific wordform dictionaries, which, for all languages except Estonian and Hungarian, contain the full inflectional paradigm for at least the lemmas appearing in the corpus. Each dictionary entry has the following structure: wordform \[TAB\] 1emma \[TAB\] MSD \[TAB\] where wordform represents an inflected form of the lemma, characterised by a combination of feature values encoded by a Morphosyntactic Description (MSD). The Multext-East lexicons and MSDs are fully described in Tufts, Ide, and Erjavec (1998).</Paragraph>
    <Paragraph position="2"> A general overview of the lexicons is shown in  number of dictionary entries, that is, triplets: &lt;wordform lemma MSD&gt;. The Wordforms column gives the number of distinct wordforms appearing in the lexicon, irrespective of their lemma and MSD. The Lemma column gives the number of distinct lemmas in the lexicon, eliminating duplications that appear due to lemma homography. The difference between the Lemma and &amp;quot;=&amp;quot; fields provides an estimate of the number of homographic lemmas. The MSD field gives the total number of distinct MSDs used in the encoding of the lexicon stock.</Paragraph>
    <Paragraph position="3"> The last two columns in Table 3 (AMB_POS and AMB_MSD) provide information about the number of ambiguity classification clusters. An ambiguity classification cluster provides the number of ways a homographic wordform can be classified. AMB_POS (&amp;quot;part of speech ambiguity&amp;quot;) and AMB_MSD (&amp;quot;MSDambiguity&amp;quot;) provide the classification based on the part of speech and MSD, respectively. The number of ambiguity classes (based either on POS or MSD) is a key figure in estimating the space needed to construct a statistical language  model (such as HMM) useful for morphosyntactic disambiguation. This number was a key factor in the tagset design.</Paragraph>
    <Paragraph position="4"> For several of the project languages and for English, a set of corpus tags has also been developed which are appropriate for use with stochastic disambiguators. Where corpus tags have been developed, mapping rules from MSDs to corpus tags (n-to-1 mapping) are also provided as a resource.</Paragraph>
    <Paragraph position="5">  The multilingual resources (lexicons, rules, corpora) developed in Multext-East are among of the most comprehensive resources currently available for most of the project languages. In addition to resource development, the work carried out in Multext-East has contributed significantly to defining general mechanisms for lexical specification, and it has provided a test of</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML