File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0819_intro.xml

Size: 1,509 bytes

Last Modified: 2025-10-06 14:03:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0819">
  <Title>Aligning words in English-Hindi parallel corpora</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper describes a word alignment system developed as a part of shared task on word alignment for languages with scarce resources at the ACL 2005 workshop on &amp;quot;building and using parallel texts: data driven machine translation and beyond&amp;quot;. Participants in the shared task were provided with common sets of training data, consisting of English-Inuktitut, Romanian-English, and English-Hindi parallel texts and the participating teams could choose to evaluate their system on one, two, or all three language pairs.</Paragraph>
    <Paragraph position="1"> Our system is for aligning English-Hindi parallel data at the word level. The word-alignment algorithm described here is based on a hybrid multi-feature approach, which groups Hindi words locally within a Hindi sentence and uses dictionary lookup (DL) as the main method of aligning words along with other methods such as Transliteration Similarity (TS), Expected English Words (EEW) and Nearest Aligned Neighbors (NAN). We used the training data supplied to derive rules for local word grouping in Hindi sentences and to find Named Entities (NE) and cognates using our TS approach. In the following sections we briefly describe our approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML