File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0320_intro.xml
Size: 1,331 bytes
Last Modified: 2025-10-06 14:01:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0320"> <Title>Aligning and Using an English-Inuktitut Parallel Corpus</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We present an aligned parallel corpus of Inuktitut and English from the Nunavut Hansards. The alignment at the sentence level and the word correspondence follow techniques described in the literature with augmentations suggested by the specific properties of this language pair.</Paragraph> <Paragraph position="1"> The lack of lexical resources for Inuktitut, the unrelatedness of the two languages, the fact that the languages use a different script and the richness of the morphology in Inuktitut have guided our choice of technique. Sentences have been aligned using the length-based dynamic programming approach of Gale and Church (1993) enhanced with a small number of lexical and non-alphabetic anchors. Word correspondences have been identified with the goal of finding an extensive high quality candidate glossary for English and Inuktitut words. Crucially, the algorithm considers not only full word correspondences, as most approaches do, but also multiple substring correspondences resulting in far greater coverage.</Paragraph> </Section> class="xml-element"></Paper>