File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0810_intro.xml

Size: 3,439 bytes

Last Modified: 2025-10-06 14:03:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0810">
  <Title>NUKTI: English-Inuktitut Word Alignment System Description</Title>
  <Section position="3" start_page="0" end_page="75" type="intro">
    <SectionTitle>
2 JAPA: Word Alignment as a Sentence
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="75" type="sub_section">
      <SectionTitle>
Alignment Task
</SectionTitle>
      <Paragraph position="0"> To adjust our systems, the organizers made available to the participants a set of 25 pairs of sentences where words had been manually aligned.</Paragraph>
      <Paragraph position="1"> A fast inspection of this material reveals that in most of the cases, the alignment produced are monotonic and involve cepts of n adjacent English words aligned to a single Inuktitut word.</Paragraph>
      <Paragraph position="2"> Many sentence alignment techniques strongly rely on the monotonic nature of the inherent alignment. Therefore, we conducted a first experiment using an in-house sentence alignment program called JAPA that we developed within the framework of the Arcade evaluation campaign (Langlais et al., 1998). The implementation details of this aligner can be found in (Langlais, 1997), but in a few words, JAPA aligns pairs of sentences by first grossly aligning their words (making use of either cognate-like tokens, or a specified bilingual dictionary). A second pass aligns the sentences in a way similar1 to the algorithm described by Gale and Church (1993), but where the search space is constrained to be close to the one delimited by the word alignment. This technique happened to be among the most accurate of the ones tested during the Arcade exercise. To adapt JAPA to our needs, we only did two things. First, we considered single sentences as documents, and tokens as sentences (we define a token as a sequence of characters delimited by 1In our case, the score we seek to globally maximize by dynamic programming is not only taking into account the length criteria described in (Gale and Church, 1993) but also a cognate-based one similar to (Simard et al., 1992).</Paragraph>
      <Paragraph position="3">  patterns observed on the development set. A total of 24 different patterns have been observed.</Paragraph>
      <Paragraph position="4"> white space). Second, since in its default setting, JAPA only considersn-msentence-alignment patterns withn,m[?] [0,2], we provided it with a new pattern distribution we computed from the development corpus (see Table 1). It is interesting to note that although English and Inuktitut have very different word systems, the length ratio (in characters) of the two sides of the TRAIN corpus is 1.05.</Paragraph>
      <Paragraph position="5"> Each pair of documents (sentences) were then aligned separately with JAPA. 1-n and n-1 alignments identified by JAPA where output without further processing. Since the word alignment format of the shared task do not account directly for n-m alignments (n,m &gt; 1) we generated the cartesian product of the two sets of words for all these n-m alignments produced by JAPA.</Paragraph>
      <Paragraph position="6"> The performance of this approach is reported in Table 2. Clearly, the precision is poor. This is partly explained by the cartesian product we resorted to when n-m alignments were produced by JAPA. We provide in section 4 a way of improving upon this scenario.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML