File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/p94-1051_evalu.xml

Size: 2,536 bytes

Last Modified: 2025-10-06 14:00:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1051">
  <Title>AUTOMATIC ALIGNMENT IN PARALLEL CORPORA</Title>
  <Section position="5" start_page="335" end_page="335" type="evalu">
    <SectionTitle>
EVALUATION
</SectionTitle>
    <Paragraph position="0"> The application on which we are developing and testing the method is implemented on the Greek-English language pair of sentences of the CELEX corpus (the computerised documentation system on European Community Law).</Paragraph>
    <Paragraph position="1"> Training was performed on 40 Articles of the CELEX corpus accounting for 30000 words.</Paragraph>
    <Paragraph position="2"> We have tested this algorithm on a randomly selected corpus of the same text type of about 3200 sentences. Due to the sparseness of acs (associated only with content words) in our training data, we reconstruct (1) by using four variables. For inflective languages like Greek, morphological information associated to word forms plays a crucial role in assigning a single category. Moreover, by counting instances of acs in the training corpus, we observed that words that, for example, can be a noun or a verb, are (due to the lack of the second singular person in the corpus) exclusively nouns. Hence : Y=bo+b 1 x 1 +b2x2+b3x3+b4x4+s (2) where x 1 represents verbs, x 2 stands for nouns, unknown words, vernou (verb or noun) and nouadj (noun or adjective), x 3 adjectives and veradj (verb or adjective), x 4 adverbs and advadj (adverb or adjective ) 02 was estimated at 3.25 on our training sample, while the regression coefficients were: b 0 = 0.2848,b 1 = 1.1075, b 2 = 0.9474, b 3 = 0.8584,b 4 = 0.7579 An accuracy that approximated a 100% success rate was recorded. Results are shown in Table 1. It is remarkable that there is no need for any lexical constraints or certain anchor points to improve the performance. Additionally, the same model and parameters can be used in order to cope with the infra-sentence alignment.</Paragraph>
    <Paragraph position="3"> In order to align all the CELEX texts, we intend to prepare the material (text handling, pos tagging in different languages pairs and different tag sets, etc.) so that we will be able to evaluate the method on a more reliable basis. We also hope to test the method's efficiency at phrase level endowed with necessary bilingual information about phrase delimiters. It will be shown there, that reusability of previous information facilitates tuning and resolving of inconsistencies between various delimiters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML