File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1045_metho.xml

Size: 4,987 bytes

Last Modified: 2025-10-06 14:15:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1045">
  <Title>Encoding a Parallel Corpus for Automatic Terminology Extraction</Title>
  <Section position="3" start_page="0" end_page="275" type="metho">
    <SectionTitle>
2 The CATEx Project
</SectionTitle>
    <Paragraph position="0"> Due to the equal status of the Italian and the German language in South Tyrol, legal and administrative documents have to be written in both languages. A prerequisite for high quality translations is a consistent and comprehensive bilingual terminology, which also forms the basis for an independent German legal language which reflects the Italian legislation. The first systematic effort in this direction was initiated a few years ago at the European Academy Bolzano/Bozen with the goal to compile an Italian/German legal and administrative terminology for South Tyrol.</Paragraph>
    <Paragraph position="1"> The CATEx (C_omputer A_.ssisted Terminology E___~raction) project emerged from the need to support and improve, both qualitatively and quantitatively, the manual acquisition of terminological data. Thus, the main objective of CATEx is the development of a computational framework for (semi-)antomatic terminology acquisition, which consists of four modules: a parallel text corpus, term-extraction programs, a term bank linked to the text corpus, and a user-interface for browsing the corpus and the term bank.</Paragraph>
    <Paragraph position="2"> 3 Building a Parallel Text Corpus Building the text corpus comprises the following tasks: corpus design, preprocessing, encoding primary data, and encoding linguistic information.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Corpus Design and Preprocessing
</SectionTitle>
      <Paragraph position="0"> Corpus design selects a collection of texts which should be included in the corpus. An important criteria is that the texts represent a realistic model of the language to be studied (Bowker, 1996). In its current form, our corpus contains only one sort of texts, namely the bilingual version of Italian laws such as the Civil Code. A particular feature of our corpus, which contains both German and Italian translations, is the structural equivalence of the original text and its translation down to the sentence level, i.e. each sentence in the original text has a corresponding one in the translation.</Paragraph>
      <Paragraph position="1"> The corpus is one of the largest special language corpora. It contains ca. 5 Mio. words and 35,898 (66,934) different Italian (German) word forms.</Paragraph>
      <Paragraph position="2"> In the preprocessing phase we correct (mainly OCR) errors in the raw text material and produce a unified electronic version in such a way as to simplify the programs for consequent annotation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="275" type="sub_section">
      <SectionTitle>
3.2 Encoding Primary Data and
Linguistic Annotation
</SectionTitle>
      <Paragraph position="0"> Corpus encoding successively enriches the raw text material with explicitly encoded informa- null tion. We apply the Corpus Encoding Standard (CES), which is an application of SGML and provides guidelines for encoding corpora that are used in language engineering applications (Ide et al., 1996). CES distinguishes primary data (raw text material) and linguistic annotation (information resulting from linguistic analyses of the raw texts). Primary data encoding covers the markup of relevant objects in the raw text material. It comprises documentation information (bibliographic information, etc.) and structural information (sections, lists, footnotes, references, etc.). These pieces of information are required to automatically extract the source of terms, e.g. &amp;quot;Codice Civile, art. 12&amp;quot;. Structural information helps also to browse the corpus; this is important in our case, since the corpus will be linked to the terminological database.</Paragraph>
      <Paragraph position="1"> Encoding linguistic annotation enriches the primary data with information which results from linguistic analyses of these data. We consider the segmentation of texts into sentences and words, the assignment/disambiguation of lemmas and part-of-speech (POS) tags, and word alignment.</Paragraph>
      <Paragraph position="2"> Due to the structural equivalence of our parallel texts, we can easily build a perfectly sentence-aligned corpus which is useful for word alignment. The above mentioned linguistic information is required for term extraction, which is mainly inspired by the work in (Dagan and Church, 1997).</Paragraph>
      <Paragraph position="3"> The monolingual recognition of terms is based on POS patterns which characterize valid terms and the recognition of translation equivalents is based on bilingual word alignment. Lemmas abstract from singular/plural variations, which is useful for alignment and term recognition.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML