File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0308_metho.xml

Size: 13,605 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0308">
  <Title>TREQ-AL: A word alignment system with limited language resources</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 The preliminary data processing
</SectionTitle>
    <Paragraph position="0"> The TREQ system requires sentence aligned parallel text, tokenized, tagged and lemmatized. The first problem we had with the training and test data was related to the tokenization. In the training data there were several occurrences of glued words (probably due to a problem in text export of the initial data files) plus an unprintable character (hexadecimal code A0) that generated several tagging errors due to guesser imperfect performance (about 70% accurate).</Paragraph>
    <Paragraph position="1"> To remedy these inconveniences we wrote a script that automatically split the glued words and eliminated the unprintable characters occurring in the training data. The set of splitting rules, learnt from the training data was posted on the site of the shared task. The set of rules is likely to be incomplete (some glued words might have survived in the training data) and also might produce wrong splitting in some cases (e.g. turnover being split always in turn over).</Paragraph>
    <Paragraph position="2"> The text tokenization, as considered by the evaluation protocol, was the simplest possible one, with white spaces and punctuation marks taken as separators.</Paragraph>
    <Paragraph position="3"> The hyphen ('-') was always considered a separator and consequently taken to be always a token by itself.</Paragraph>
    <Paragraph position="4"> However, in Romanian, the hyphen is more frequently used as an elision marker (as in &amp;quot;intr-o&amp;quot;= &amp;quot;intru o&amp;quot;/in a), a clitics separator (as in &amp;quot;da-mi-l&amp;quot;=&amp;quot;da -mi -l&amp;quot;=&amp;quot;da mie el&amp;quot;/give to me it/him) or as a compound marker (as in &amp;quot;terchea-berchea&amp;quot; /(approx.) loafer) than as a separator. In such cases the hyphen cannot be considered a token.</Paragraph>
    <Paragraph position="5"> A similar problem appeared in English with respect to the special quote character, which was dealt with in three different ways: it was sometimes split as a distinct token (we'll = we + ' + ll), sometimes was adjoined to the string (a contracted positive form or a genitival) immediately following it (I'm = I + 'm, you've = you+'ve, man's = man + 's etc.) and systematically left untouched in the negative contracted forms (couldn't, wasn't, etc).</Paragraph>
    <Paragraph position="6"> Since our processing tools (especially the tokeniser) were built with a different segmentation strategy in mind, we generated the alignments based on our own tokenization and, at the end, we &amp;quot;re-tokenised&amp;quot; the text according to the test data model (and consequently reindex) all the linking pairs.</Paragraph>
    <Paragraph position="7"> For tagging the Romanian side of the training bitext we used the tiered-tagging approach (Tufis, 1999) but we had to construct a new language model since our standard model was created from texts containing diacritics. As the Romanian training data did not contain diacritical characters, this was by no means a trivial task in the short period of time at our disposal (actually it took most of the training time). The lack of diacritics in the training data and the test data induced spurious ambiguities that degraded the tagging accuracy with at least 1%. This is to say that we estimate that on a normal Romanian text (containing the diacritical characters) the performance of our system would have been better. The English training data was tagged by Eric Gaussier, warmly acknowledged here. As the tagsets used for the two languages in the parallel training corpus were quite different, we defined a tagset mapping and translated the tagging of the English part into a tagging closer to the Romanian one. This mapping introduced some ambiguities that were solved by hand. Based on the training data (both Romanian and English texts), tagged with similar tagsets, we built the language models used for the test data alignment.</Paragraph>
    <Paragraph position="8"> POS-preserving translation equivalence is a too restrictive condition for the present task and we defined a meta-tagset, common for both languages that considered frequent POS alternations. For instance, the verb, noun and adjective tags, in both languages were prefixed with a common symbol, given that verbadjective, noun-verb, noun-adjective and the other combinations are typical for Romanian-English translation equivalents that do not preserve the POS.</Paragraph>
    <Paragraph position="9"> With these prefixes, the initial algorithm for extracting POS-preserving translation equivalents could be used without any further modifications. Using the tagprefixes seems to be a good idea not only for legitimate POS-alternating translations, but also for overcoming some typical tagging errors, such as participles versus adjectives. In both languages, this is by far the most frequent tagging error made by our tagger.</Paragraph>
    <Paragraph position="10"> The last preprocessing phase is encoding the corpus in a XCES-Align-ana format as used in the MULTEXT-EAST corpus (see http://nl.ijs.si/ME/V2/) which is the standard input for the TREQ translation equivalents extraction program. Since the description of TREQ is extensively given elsewhere, we will not go into further details, except of saying that the resulted translation dictionary extracted from the training data contains 49283 entries (lemma-form). The filtering of the translation equivalents candidates (Tufis and Barbu, 2002) was based on the log-likelihood and the cognate scores with a threshold value set to 15 and 0,43 respectively. We roughly estimated the accuracy of this dictionary based on the aligned gold standard: precision is about 85% and recall is about 78% (remember, the dictionary is evaluated in terms of lemma entries, and the non-matching meta-category links are excluded).</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 The TREQ-AL linking program
</SectionTitle>
    <Paragraph position="0"> This program takes as input the dictionary created by TREQ and the parallel text to be word-aligned. The alignment procedure is a greedy one and considers the aligned translation units independent of the other  translation units in the parallel corpus. It has 4 steps: 1. left-to-right pre-alignment 2. right-to-left adjustment of the pre-alignment 3. determining alignment zones and filtering them out 4. the word-alignment inside the alignment zones</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 The left-to-right pre-alignment
</SectionTitle>
      <Paragraph position="0"> For each sentence-alignment unit, this step scans the words from the first to the last in the source-language part (Romanian). The considered word is initially linked to all the words in the target-language part (English) of the current sentence-alignment unit, which are found in the translation dictionary as potential translations. If for the source word no translations are identified in the target part of the translation unit, the control advances to the next source word. The cognate score and the relative distance are decision criteria to choose among the possible links. When consecutive words in the source part are associated with consecutive or close to each other words in the target part, these are taken as forming an &amp;quot;alignment chain&amp;quot; and, out of the possible links, are considered those that correspond to the densest grouping of words in each language. High cognate scores in an alignment chain reinforce the alignment.</Paragraph>
      <Paragraph position="1"> One should note that at the end of this step it is possible to have 1-to-many association links if multiple translations of one or more source words are found in the target part of the current translation unit (and, obviously, they satisfy the selection criteria).</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 The right-to-left adjustment of the pre-alignment
</SectionTitle>
      <Paragraph position="0"> This step tries to correct the pre-alignment errors (when possible) and makes a 1-1 choice in case of the 1-m links generated before. The alignment chains (found in the previous step) are given the highest priority in alignment disambiguation. That is, if for one word in the source language there are several alignment possibilities, the one that belongs to an alignment chain is always selected. Then, if among the competing alignments one has a cognate score higher than the others then this is the preferred one (this heuristics is particularly useful in case of several proper names occurring in the same translation unit). Finally, the relative position of words in the competing links is taken into account to minimize the distance between the surrounding already aligned words.</Paragraph>
      <Paragraph position="1"> The first two phases result in a 1-1 word mapping.</Paragraph>
      <Paragraph position="2"> The next two steps use general linguistic knowledge trying to align the words that remain unaligned (either due to no translation equivalents or because of failure to meet the alignment criteria) after the previous steps.</Paragraph>
      <Paragraph position="3"> This could result in n-m word alignments, but also in unlinking two previously linked words since a wrong translation pair existing in the extracted dictionary might license a wrong link.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Alignment zones and filtering suspicious links out
</SectionTitle>
      <Paragraph position="0"> An alignment zone (in our approach) is a piece of text that begins with a conjunction, a preposition, or a punctuation mark and ends with the token preceding the next conjunction, preposition, punctuation or end of sentence. A source-language alignment zone is mapped to one or more target-language alignment zones via the links assigned in the previous steps (based on the translation equivalents). One has to note that the mapping of the alignment zones is not symmetric. An alignment zone that contains no link is called a virgin zone.</Paragraph>
      <Paragraph position="1"> In most of the cases the words in the source alignment zone (starting zone) are linked to words in the target alignment zone/s (ending zone/s). The links with either side outside the alignment zones are suspicious and they are deleted. This filtering proved to be almost 100% correct in case the outlier resides in a zone non-adjacent to the starting or ending zones. The failures of this filtering were in the majority of cases due to a wrong use of punctuation in one or the other part of the translation unit (such as omitted comma, a comma between the subject and predicate).</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.4 The word-alignment inside the alignment zones
</SectionTitle>
      <Paragraph position="0"> For each un-linked word in the starting zone the algorithm looks for a word in the ending zone/s of the same category (not meta-category). If such a mapping was not possible, the algorithm tries to link the source word to a target word of the same meta-category, thus resulting in a cross-POS alignment. The possible meta-category mappings are specified by the user in an external mapping file. Any word in the source or target languages that is not assigned a link after the four processing steps described above is automatically assigned a null link.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Post-processing
</SectionTitle>
    <Paragraph position="0"> As said in the second section, our tokenization was different from the tokenization in the training and test data. To comply with the evaluation protocol, we had to re-tokenize the aligned text and re-compute the indexes of the links. Re-tokenizing the text meant splitting compounds and contracted future forms and gluing together the previously split negative contracted forms (do+n't=don't). Although the re-tokenization was a post-processing phase, transparent for the task itself, it was a source of missing some links for the negative contracted forms. In our linking the English &amp;quot;n't&amp;quot; was always linked to the Romanian negation and the English auxiliary/modal plus the main verb were linked to the Romanian translation equivalent found for the main verb. Some multi-word expressions recognized by the tokenizer as one token, such as dates (25 Ianuarie, 2001), compound prepositions (de la, pina la), conjunctions (pentru ca, de cind, pina cind) or adverbs (de jur imprejur, in fata) as well as the hyphen separated nominal compounds (mass-media, primministru) were split, their positions were re-indexed and the initial one link of a split compound was replaced with the set obtained by adding one link for each constituent of the compound to the target English word.</Paragraph>
    <Paragraph position="1"> If the English word was also a compound the number of links generated for one aligned multiword expression was equal to the N*M, where N represented the number of words in the source compound and M the number of words in the target compound.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML