XML Viewer - w05-0811

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0811_metho.xml
Size: 17,025 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0811">
  <Title>Models for Inuktitut-English Word Alignment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Special Characteristics of the
Inuktitut-English Alignment Problem
</SectionTitle>
    <Paragraph position="0"> Guided by the discussion of Inuktitut in Mallon (1999), we examined the Nunavut Hansards training and hand-labeled trial data sets in order to identify special challenges and exploitable characteristics of the Inuktitut-English word alignment problem. We were able to iden- null tify three: (1) Importance of sublexical Inuktitut units; (2) 1-to-N Inuktitut-to-English alignment cardinality; (3) Monotonicity of alignments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Types and Tokens
</SectionTitle>
      <Paragraph position="0"> Inuktitut has an extremely productive agglutinative morphology, and an orthographic word may combine very many individual morphemes. As a result, in Inuktitut-English bitext we observe Inuktitut sentences with many fewer word tokens than the corresponding English sentences; the ratio of English to Inuktitut tokens in the training corpus is 1.85.1 This suggests the importance of looking below the Inuktitut word level when computing lexical translation probabilities (or alignment af nities).</Paragraph>
      <Paragraph position="1"> To reinforce the point, consider that the ratio of training corpus types to tokens is 0.007 for English, and 0.194 for Inuktitut. In developing a customized word alignment solution for Inuktitut-English, a major goal was to handle the huge number of Inuktitut word types seen only once in the training corpus (337798 compared to 8792 for English), without demanding the development of a morphological analyzer.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Alignment
</SectionTitle>
      <Paragraph position="0"> Considering English words in English sentence order, 4.7% of their alignments to Inuktitut were found to be retrograde; that is, involving a decrease in Inuktitut word position with respect to the previous English word's aligned Inuktitut position. Since this method of counting retrograde alignments would assign a low count to mass movements of large contiguous chunks, we also measured the number of inverted alignments over all pairs of English word positions. That is, the sum SeSa=|e|[?]1a=1 Sb=|e|b=a+1Si1[?]I(e,a)Si2[?]I(e,b)(1 if i1 &gt; i2) was computed over all Inuktitut alignment sets I(e, x), for e the English sentence and x the English word position. Dividing this sum by the obvious denominator (replacing (1 if i1 &gt; i2) with (1) in the sum) yielded a value of 1.6% inverted alignments.</Paragraph>
      <Paragraph position="1"> Table 1 shows a histogram of alignment cardinalities for both English and Inuktitut. Ninety-four percent of English word tokens, and ninety-nine percent of those having a non-null alignment, align to exactly one Inuktitut word. In development of a specialized word aligner for this language pair (Section 3), we made use of the observed reliability of these two properties, monotonicity and 1-to-N cardinality.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="79" type="metho">
    <SectionTitle>
3 Alignment by Weighted Finite-State
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="79" type="sub_section">
      <SectionTitle>
Transducer Composition
</SectionTitle>
      <Paragraph position="0"> We designed a specialized alignment system to handle the above-mentioned special characteristics of Inuktitut1Though this ratio increases to 2.21 when considering only longer sentences (20 or more English words), ignoring common short, formulaic sentence pairs such as ( Hudson Bay ) ( sanikiluaq ) .</Paragraph>
      <Paragraph position="1">  alignment, computed over the trial data.</Paragraph>
      <Paragraph position="2"> English alignment. Our weighted nite-state transducer (WFST) alignment model, illustrated in Figure 1, structurally enforces monotonicity and 1-to-N cardinality, and exploits sublexical information by incorporating association scores between English words and Inuktitut word substrings, based on co-occurrence in aligned sentences.</Paragraph>
      <Paragraph position="3"> For each English word, an association score was computed not only with each Inuktitut word, but also with each Inuktitut character string of length ranging from 2 to 10 characters. This is similar to the technique described in Martin et al. (2003) as part of their construction of a bilingual glossary from English-Inuktitut bitext. However, our goal is different and we keep all the English-Inuktitut associations, rather than selecting only the best ones using a greedy method, as do they. Additionally, before extracting all substrings from each Inuktitut word, we added a special character to the word's beginning and end (e.g., makkuttut ! makkuttut ), in order to exploit any preferences for word-initial or - nal placement.</Paragraph>
      <Paragraph position="4"> The heuristic association score chosen was p(wordejwordi) p(wordijworde), computed over all the aligned sentence pairs. We have in the past observed this to be a useful indicator of word association, and it has the nice property of being in the range (0,1].</Paragraph>
      <Paragraph position="5"> The WFST aligner is a composition of 4 transducers.2 The structure of the entire WFST composition enforces monotonicity, Inuktitut-to-English 1-N cardinality, and Inuktitut word fertilities ranging between 1 and 7. This model was implemented using the ATT nite-state toolkit (Mohri et al., 1997). In Figure 1, [1] is a linear transducer mapping each English position in a particular English test sentence to the word at that position. It is constructed so as to force each English word to participate in exactly 1 alignment. [2] is a single-state transducer mapping English word to Inuktitut substrings (or full words) with weights derived from the association scores.3 [3] is a transducer mapping Inuktitut substrings (and full words) to their position in the Inuktitut test sentence. Its construction allows a single Inuktitut position to correspond to multiple English positions, while enforcing monotonicity. [4] is a transducer regulating the allowed fertility values of Inuktitut words; each Inuktitut word is permitted a fertility of between 1 and 7. The fertility values are assigned the probabilities corresponding to observed relative frequencies in the trial data, and  in regards to elders and youth i want to make general comments pijjutigillugu innatuqait amma makkuttut uqausiqakainnarumajunga</Paragraph>
      <Paragraph position="7"> stantiated for an example sentence from the development (trial) data. To save space, only a representative portion of each machine is drawn. Transition weights are costs in the tropical (min,+) semiring, derived from negative logs of probabilities and association scores. Nonzero costs are indicated in parentheses. null are not conditioned on the identity of the Inuktitut word.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="79" end_page="79" type="metho">
    <SectionTitle>
4 English-Inuktitut Transliteration
</SectionTitle>
    <Paragraph position="0"> Although in this corpus English and Inuktitut are both written in Roman characters, English names are signi cantly transformed when rendered in Inuktitut text. Consider the following English/Inuktitut pairs from the training corpus: Chartrand/saaturaan, Chretien/kurittian and the set of training corpus-attested Inuktitut renderings of Williams, Campbell, and McLean shown in Table 2(A) (which does not include variations containing the common -mut lexeme, meaning to [a person] (Mallon, 1999)).</Paragraph>
    <Paragraph position="1"> Clearly, not only does the English-to-Inuktitut transformation radically change the name string, it does so in a nondeterministic way which appears to be in uenced not only by the phonological preferences of Inuktitut but also by differing pronunciations of the name in question and possibly by differing conventions of translators (note, for example, maklain versus mikliin for McLean).</Paragraph>
    <Paragraph position="2"> We trained a probabilistic nite-state transducer (FST) to identify English-Inuktitut transliterated pairs in aligned sentences. Training string pairs were acquired from the training bitext in the following manner. Whenever single instances of corresponding honori cs were found in a sentence pair these included the cor- null underlined) orthographic characters (and character sequences). Where top substitutions for English characters are shown, none equal or better were omitted.</Paragraph>
    <Paragraph position="3"> mista/mistu) the immediately following capitalized English words (up to 2) were extracted and the same number of Inuktitut words were extracted to be used as training pairs. Thus, given the appearance in aligned sentences of Mr. Quirke and mista kuak , the training pair (Quirke,kuak) would be extracted. Common distractions such as Mr Speaker were ltered out. In order to focus on the native English name problem (Inuktitut name rendering into English is much less noisy) the English extractions were required to have appeared in a large, news-corpus-derived English wordlist. This procedure resulted in a conservative, high-quality list of 434 unique name pairs. The probabilistic FST model we selected was that of a memoryless (single-state) transducer representing a joint distribution over character substitutions, English insertions, and Inuktitut insertions. This model is identical to that presented in Ristad and Yianilos (1997). Prior to training, common English digraphs (e.g., th and sh ) were mapped to unique single characters, as were doubled consonants. Inuktitut ng and common two-vowel sequences were also mapped to unique single characters to elicit higher-quality results from the memoryless transduction model employed. Some results of the transducer training are displayed in Table 2(B). Probabilistic FST weight training was accomplished using the Dyna modeling language and DynaMITE parameter optimization toolkit (Eisner et al, 2004). The transliteration modeling described here differs from such previous transliteration work as Stalls and Knight (1998) in that there is no explicit modeling of pronunciation, only a direct transduction between written forms.</Paragraph>
    <Paragraph position="4"> In applying transliteration on trial/test data, the following criteria were used to select English words for transliteration: (1) Word is capitalized (2) Word is not in the exclusion list.4 For the top-ranked transliteration of the English word present in the Inuktitut sentence, all occurrences of that word in that sentence are marked as aligned to the English word.</Paragraph>
    <Paragraph position="5"> We have yet to evaluate English-Inuktitut transliteration in isolation on a large test set. However, accuracy on the workshop trial data was 4/4 hypotheses correct, and on test data 2/6 correct. Of the 4 incorrect test hypotheses, 2 were mistakes in identifying the correct transliteration, and 2 mistakes resulted from attempting to transliterate an English word such as Councillors which should not be transliterated. Even with a relatively low accuracy, the transliteration model, which is used only as an individual voter in combination systems, is unlikely to vote for the incorrect choice of another system. Its purpose under system combination is to push a good alignment link hypothesis up to the required vote</Paragraph>
  </Section>
  <Section position="6" start_page="79" end_page="81" type="metho">
    <SectionTitle>
5 IBM Model 4 Alignments
</SectionTitle>
    <Paragraph position="0"> As a baseline and contributor to our combination systems, we ran GIZA++ (Och and Ney, 2000), to produce alignments based on IBM Model 4. The IBM alignment models are asymmetric, requiring that one language be idenitifed as the e language, whose words are allowed many links each, and the other as the f language, whose words are allowed at most one link each.</Paragraph>
    <Paragraph position="1"> Although the observed alignment cardinalities naturally suggest identifying Inuktitut as the e language and English as the f language, we ran both directions for completeness. null As a crude rst attempt to capture sublexical correspondences in the absence of a method for morpheme segmentation, we developed a rough syllable segmenter (spending approximately 2 person-hours), ran GIZA++ to produce alignments treating the syllables as words, and chose, for each English word, the Inuktitut word or words the largest number of whose syllables were linked to it.</Paragraph>
    <Paragraph position="2"> In the nomenclature of our results tables, giza++ syllabized refers to the latter system, giza++ E(1)-I(N) represents GIZA++ run with English as the e language, and giza++ E(N)-I(1) sets English as the f language.</Paragraph>
    <Section position="1" start_page="79" end_page="81" type="sub_section">
      <SectionTitle>
6 System Performance and Combination
Methods
</SectionTitle>
      <Paragraph position="0"> We observed the 4 main systems (3 GIZA++ variants and WFST) to have signi cantly different performance proles in terms of precision and recall. Consistently, WFST  2000 randomly selected English training sentences were examined, Words such as Clerk, Federation, and Fisheries, which are frequently capitalized but should not be transliterated, were put into the exclusion list; in addition, any word with frequency &gt; 50 in the training corpus was excluded, on the rationale that common-enough words would have well-estimated translation probabilities already. 50 may seem like a high threshold until one considers the high variability of the transliteration process as demonstrated in Table 2(A).</Paragraph>
      <Paragraph position="1">  The precision, recall and F-measure cited are the unlabeled version ( probable, in the nomenclature of this shared task). The gold standard truth for trial data contained 710 alignments. The test gold standard included 1972 alignments. The column |H|/|T  |lists ratio of hypothesis set size to truth set size for each system.</Paragraph>
      <Paragraph position="2"> won out on F-measure while giza++ syllabized attained better alignment error rate (AER). Refer to Table 3 for details of performance on trial and test data.</Paragraph>
      <Paragraph position="3"> We investigated a number of system combination methods, three of which were nally selected for use in submitted systems. There were two basic methods of combination: per-link voting and per-English-word voting.6 In per-link voting, an alignment link is included if it is proposed by at least a certain number of the participating individual systems. In per-English-word voting, the best outgoing link is chosen for each English word (the link which is supported by the greatest number of individual systems). Any ties are broken using the WFST system choice. A high-recall variant of per-English-word voting was included in which ties at vote-count 1 (indicating a low-con dence decision) are not broken, but rather all systems' choices are submitted as hypotheses.</Paragraph>
      <Paragraph position="4"> The transliteration model described in Section 4 was included as a voter in each combination system, though it made few hypotheses (6 on the test data). Composition of the submitted systems was as follows: F/AER Empha6Combination methods we elected not to submit included voting with trained weights and various stacked classi ers. The reasoning was that with such a small development data set 25 sentences it was unsafe to put faith in any but the simplest of classi er combination schemes.</Paragraph>
      <Paragraph position="5"> sis - per-link voting with decision criterion &gt;= 2 votes, over all 5 described systems (WFST, 3 GIZA++ variants, transliteration). AER Emphasis (I) per-link voting, &gt;= 2 votes, over all systems except giza++ E(N)-I(1).</Paragraph>
      <Paragraph position="6"> AER Emphasis (II) per-link voting, &gt;= 3 votes, over all systems. F Emphasis per-English-word voting, over all systems, using WFST as tiebreaker. Recall Emphasis per-English-word voting, over all systems, high-recall variant.</Paragraph>
      <Paragraph position="7"> We elected to submit these systems because each tailors to a distinct evaluation criterion (as suggested by the naming convention). Experiments on trial data convinced us that minimizing AER and maximizing F-measure in a single system would be dif cult. Minimizing AER required such high-precision results that the tradeoff in recall greatly lowered F-measure. It is interesting to note that system combination does provide a convenient means for adjusting alignment precision and recall to suit the requirements of the problem or evaluation standard at hand.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML