File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0320_metho.xml
Size: 18,850 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0320"> <Title>Aligning and Using an English-Inuktitut Parallel Corpus</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 An English-Inuktitut Corpus </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Parallel Texts </SectionTitle> <Paragraph position="0"> The corpus of parallel texts we present consists of 3,432,212 words of English and 1,586,423 words of Inuktitut from the Nunavut Hansards. These Hansards are available to the public in electronic form in both English and Inuktitut (www.assembly.nu.ca). The Legislative Assembly of the newly created territory of Nunavut began sitting on April 1, 1999. Our corpus represents 155 days of transcribed proceedings of the Nunavut Legislative Assembly from that first session through to November 1, 2002, which was part way through the sixth session of the assembly.</Paragraph> <Paragraph position="1"> We gather and process these 155 documents in various ways described in the rest of this paper and make available a sentence-aligned version of the parallel texts (www.InuktitutComputing.ca/NunavutHansards). Like the French-English Canadian Hansards of parliamentary proceedings, this corpus represents a valuable resource for Machine Translation research and corpus research as well as for the development of language processing tools for Inuktitut. The work reported here takes some first steps toward these ends, and it is hoped that others will find ways to expand on this work. One reason that the Canadian Hansards, a large parallel corpus of English-French, are particularly useful for research is that they are comparatively noise free as parallel text collections go (Simard and Plamondon, 1996). This should be true of the Nunavut Hansard collection as well. The Canadian Hansard is transcribed in both languages so what was said in English is transcribed in English and then translated into French and vice versa. For the Nunavut Hansard, in contrast, a complete English version of the proceedings is prepared and then this is translated into Inuktitut, even when the original proceedings were spoken in Inuktitut.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The Inuktitut Language </SectionTitle> <Paragraph position="0"> Inuktitut is the language of the Inuit living in North Eastern Canada, that is, Nunavut (Keewatin and Baffin Island), Nunavik and Labrador. It includes six closely related spoken dialects: Kivalliq, Aivilik, North Baffin, South Baffin, Arctic Quebec (Nunavik), and Labrador.</Paragraph> <Paragraph position="1"> Inuktitut is a highly agglutinative language. Noun and verb roots occur with two main types of suffixes and there are many instantiations of these suffixes. The semantic suffixes modify the meaning of the root (over 250 of these in North Baffin dialect) and the grammatical suffixes indicate features like agreement and mood (approximately 700 verbal endings and over 300 nominal endings in North Baffin dialect).</Paragraph> <Paragraph position="2"> A single word in Inuktitut is often translated with multiple English words, sometimes corresponding to a full English clause. For example, the Inuktitut word a114a139a192a147a39a204a206a128a114a107a235a39a110a110a99a195a107a139a236a128a114a107a240a114a107 (which is transliterated as qaisaaliniaqquunngikkaluaqpuq) corresponds to these eight English words: 'Actually he will probably not come early today'. The verbal root is qai 'come', the semantic suffixes are -saali-, -niaq-, -qquu-, -nngit- and -galuaqmeaning 'soon', 'a little later today or tomorrow', 'probability', 'negation', and 'actuality' respectively, and finally the grammatical suffix -puq expresses the 3rd person singular of the indicative mood. This frequently occurring one-to-many correspondence represents a challenge for word correspondence. The opposite challenging situation, namely instances of many-to-one correspondences, also arises for this language pair but less frequently. The latter is therefore not addressed in this paper.</Paragraph> <Paragraph position="3"> Yet another challenge is the morphophonological complexity of Inuktitut as reflected in the orthography, which has two components. First, the sheer number of possible suffixes mentioned above is problematic. Second, the shape of these suffixes is variable. That is, there are significant orthographic changes to the individual morphemes when they occur in the context of a word. This type of variability can be seen in the above example at the interface of -nngit- and -galuaq-, which together become -nngikkaluaq-.</Paragraph> <Paragraph position="4"> Finally, it is important to note that Inuktitut has a syllabic script for which there is a standard Romanization.</Paragraph> <Paragraph position="5"> To give an idea of how the scripts compare, our corpus of parallel texts consists of 20,124,587 characters of English and 13,457,581 characters in Inuktitut syllabics as compared to 21,305,295 characters of Inuktitut in Roman script.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Sentence Alignment </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Sentence Alignment Approach </SectionTitle> <Paragraph position="0"> The algorithm used to align English-Inuktitut sentences is an extension of that presented in Gale and Church (1993).</Paragraph> <Paragraph position="1"> It does not identify crossing alignments where the sentence order within paragraphs in the parallel texts differs.</Paragraph> <Paragraph position="2"> Sentence alignments typically involve one English sentence matching one Inuktitut sentence (a 1-to-1 bead), but may also involve 2-to-1, 1-to-2, 0-to-1, 1-to-0 and 2-to-2 sentence matching patterns, or beads. Using such a length-based approach where the length of sentences is measured in characters is appropriate for our language pair since the basic assumption generally holds. Namely, longer English sentences typically correspond to longer Inuktitut sentences as measured in characters.</Paragraph> <Paragraph position="3"> One problem with the approach, as pointed out by Macklovitch and Hannan (1998), is that from the point where a paragraph is misaligned, it is difficult to ensure proper alignment for the remainder of the paragraph. We observed this effect in our alignment. We also observed that the large number of small paragraphs with almost identical length caused problems for the algorithm.</Paragraph> <Paragraph position="4"> Many alignment approaches have addressed such problems by making use of additional linguistic clues specific to the languages to be aligned. For our language pair, it was not feasible to use most of these. For example, some alignment techniques make good use of cognates (Simard and Plamondon, 1996). The assumption is that words in the two languages that share the first few letters are usually translations of each other. English and Inuktitut, however, are too distantly related to have many cognates. Even the translation of a proper name does not usually result in a cognate for our language pair, since the translation between scripts induces a phonetic translation rather than a character-preserving translation of the name, as these pairs illustrate Peter, Piita; Canada, Kanata; McLean, Makalain.</Paragraph> <Paragraph position="5"> Following a suggestion in Gale and Church (1993), the alignment was aided by the use of additional anchors that were available for the language pair. These anchors consisted of non-alphabetic sequences (such as 9:00, 42-1(1) and 1999) and 8 reliable word correspondences that occurred frequently in the corpus, including words beginning with these character sequences speaker/uqaqti and motion/pigiqati, for example. null</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Steps in Sentence Alignment </SectionTitle> <Paragraph position="0"> Preprocessing: Preprocessing the Inuktitut and the English raised separate issues. For English, the main issue was ensuring that illegal or unusual characters are mapped to other characters to simplify later processing.</Paragraph> <Paragraph position="1"> For Inuktitut the main issue was the array of encodings used for the syllabic script. Inuktitut syllabics can be represented using a 7-bit encoding called ProSyl, which is in many cases extended to an 8-bit encoding Tunngavik.</Paragraph> <Paragraph position="2"> Each syllabic character can be encoded in multiple ways that need to be mapped into a uniform scheme, such as Unicode. Each separate file was converted to HTML using a commercial product LogicTran r2net. Then, the Perl package HTML::TreeBuilder was used to purge the text of anomalies and set up the correct mappings. The output of this initial preprocessing step was a collection of HTML files in pure Unicode UTF8.</Paragraph> <Paragraph position="3"> Boundary Identification: The next step was to identify the paragraph and sentence boundaries for the Inuktitut and English texts. Sentences were split at periods, question marks, colons and semi-colons except where the following character was a lower case letter or a number.</Paragraph> <Paragraph position="4"> This resulted in a number of errors but was quite accurate in general. Paragraph boundaries were inserted where such logical breaks occurred as signaled in the HTML and generally correspond to natural breaks in the original document. Using HTML indicators contributed to the number of very short paragraphs, especially toward the beginning of each document. As mentioned in section 3.1, these short paragraphs were problematic for the alignment algorithm. The collection consists of 348,619 sentences in 112,346 paragraphs in English and 352,486 sentences in 118,733 paragraphs in Inuktitut. After this step, document, paragraph and sentence boundaries were available to use as hard and soft boundaries for the Gale and Church algorithm.</Paragraph> <Paragraph position="5"> Syllabic Script Conversion: The word correspondence phase required a Roman script representation of the Inuktitut texts. The conversion from unicode syllabics to Roman characters was performed at this stage in the sentence alignment process using the standard ICI conversion method.</Paragraph> <Paragraph position="6"> Anchors: The occurrences of the lexical anchors mentioned above were found and used with a dynamic programming search to find the path with the largest number of alignments. This algorithm was written in Perl and required about two hours to process the whole corpus. All alignments that occurred in the first two sentences of each paragraph were marked as hard boundaries for the Gale and Church (1993) program as provided in their paper.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Sentence Alignment Evaluation </SectionTitle> <Paragraph position="0"> Three representative days of Hansard (1999/04/01, 2001/02/21 and 2002/10/29) were selected and manually aligned at the sentence level as a gold standard. Precision and recall were then measured as suggested in Isabelle and Simard (1996).</Paragraph> <Paragraph position="1"> Results: The number of sentence alignments in the gold standard was 3424. The number automatically aligned by our method was 3459. The number of those automatic alignments that were correct as measured against the gold standard was 3161. This represents a precision of 91.4% and a recall rate of 92.3%. For comparison, the Gale and Church (1993) program, which did not make use of additional anchors, had poorer results over our corpus. Their one-pass approach, which ignores paragraph boundaries, had a precision of 66.7% and a recall of 71.5%. Their two-pass approach, which aligns paragraphs in one pass and then aligns sentences in a second pass, had a precision of 85.6% and a recall of 87.0%.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Word Correspondence </SectionTitle> <Paragraph position="0"> Having built a sentence-aligned parallel corpus, we next attempted to use that corpus. Our goal was to extract as many reliable word associations as possible to aid in developing a morphological analyzer and in expanding Inuktitut dictionaries. The output of this glossary discovery phase is a list of suggested pairings that a human can consider for inclusion in a dictionary. Inuktitut dictionaries often disagree because of spelling and dialectical differences. As well, many contemporary words are not in the existing dictionaries. The parallel corpus presented here can be used to augment the dictionaries with current words, thereby providing an important tool for students, translators, and others.</Paragraph> <Paragraph position="1"> In our approach, a glossary is populated with pairs of words that are consistent translations of each other. For many language pairs, considering whole word to whole word correspondences for inclusion in a glossary would yield good results. However, because Inuktitut is agglutinative, the method must discover pairs of an English word and the corresponding root of the Inuktitut word, or the corresponding Inuktitut suffix, or sometimes the whole Inuktitut word. In other words, it is essential to consider substrings of words for good coverage for a language pair like ours.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Substring Correspondence Method </SectionTitle> <Paragraph position="0"> Searching for substring correspondences is reduced to a counting exercise. For any pair of substrings, you need to know how many parallel regions contained the pair, how many regions in one language contained the first, how many regions in the other language contained the second, and how many regions there are in total. For example, the English word 'today' and the Inuktitut word 'ullumi' occur in 2092 parallel regions. The word 'today' appears in a total of 3065 English regions; and 'ullumi' appears in 2702 Inuktitut regions. All together, there are 332,154 aligned regions. It is fairly certain that these two words should be a glossary pair because each usually occurs as a translation of the other.</Paragraph> <Paragraph position="1"> The PMI Measure: We measure the degree of association between any two substrings, one in the English and one in the Inuktitut, using Pointwise Mutual Information (PMI). PMI measures the amount of information that each substring conveys about the occurrence of the other. We recognize that PMI is badly behaved when the counts are near 1. To protect against that problem, we compute the 99.99999% confidence intervals around the PMI (Lin, 1999), and use the lower bound as a measure of association. This lower bound rises as the PMI rises or as the amount of data increases. Many measures of association would likely work as well as the lower confidence bound on PMI. We used that bound as a metric in this study for three reasons. First, that metric led to better performance than Chi-squared on this data. Second, it addressed the problem of low frequency events. Third, it makes the correct judgment on Gale and Church's well-known chambre-communes problem (Gale and Church, 1991).</Paragraph> <Paragraph position="2"> The decision to include pairs of substrings in the glossary proceeds as follows. Include the highest PMI scoring pairs if neither member of the pair has yet been included.</Paragraph> <Paragraph position="3"> If two pairs are tied, check whether the Inuktitut members of the pairs are in a substring relation. If they are, then add the pair with the longer substring to the glossary; if not, then add neither pair.</Paragraph> <Paragraph position="4"> Many previous efforts have used a similar methodology but were only able to focus on word to word correspondences (Gale and Church, 1991). Here, the English words can correspond to any substring in any Inuktitut word in the aligned region. This means that statistics have to be maintained for many possible pairs. Under our approach, we maintain all these statistics for all English words, all Inuktitut words as well as substrings with length of between one and 10 Roman characters, and all co-occurrences that have frequency greater than three.</Paragraph> <Paragraph position="5"> This approach thereby addresses the challenge of Inuktitut roots and multiple semantic suffixes corresponding to individual English words. It also addresses the challenge of orthographic variation at morpheme boundaries to some degree since it will truncate morphemes appropriately in many cases.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Glossary Evaluation </SectionTitle> <Paragraph position="0"> This method suggested 4362 word-substring pairs for inclusion in a glossary. This represents a 72.3% coverage of English word occurrences in the corpus (omitting words of fewer than 3 characters). One hundred of these word-substring pairs were chosen at random and judged for accuracy using two existing dictionaries and a partial suffix list. An Inuktitut substring was said to match an English word exactly if the Inuktitut root plus all the suffixes carried the same meaning as the English word and conveyed the same grammatical features (e.g., grammatical number and case). The correspondence was said to be good if the Inuktitut root plus the left-most lexical suffixes conveyed the same meaning as the English word. In those cases, the Inuktitut word conveyed additional semantic or grammatical information.</Paragraph> <Paragraph position="1"> About half of the exact matches were uninflected proper nouns. A typical example of the other exact matches is the pair inuup and person's. In this pair, inumeans person and -up is the singular genitive case. A typical example of a good match is the pair pigiaqtitara and deal. In this pair, pigiaqti- means deal and -tara conveys first person singular subject and third person singular object. For example, &quot;I deal with him&quot;.</Paragraph> <Paragraph position="2"> Of the 100 pairs, 43 were deemed exact matches and 44 were deemed good matches. The remaining 13 were incorrect. Taken together 87% of the pairs in the sample were useful to include in a glossary. This level of performance will improve as we introduce morphological analysis to both the Inuktitut and English words.</Paragraph> </Section> </Section> class="xml-element"></Paper>