File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0808_metho.xml

Size: 21,540 bytes

Last Modified: 2025-10-06 14:09:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0808">
  <Title>A hybrid approach to align sentences and words in English-Hindi parallel corpora</Title>
  <Section position="4" start_page="0" end_page="58" type="metho">
    <SectionTitle>
2 Sentence Alignment
</SectionTitle>
    <Paragraph position="0"> Sentence alignment techniques vary from simple character-length or word-length techniques to more sophisticated techniques which involve lexical constraints and correlations or even cognates (Wu 2000). Examples of such alignment techniques are Brown et al. (1991), Kay and Roscheisen (1993), Warwick et al. (1989), and the &amp;quot;align&amp;quot; programme by Gale and Church (1993).</Paragraph>
    <Section position="1" start_page="0" end_page="57" type="sub_section">
      <SectionTitle>
2.1 Length-based methods
</SectionTitle>
      <Paragraph position="0"> Length-based approaches are computationally better, while lexical methods are more resource  hungry. Brown et al. (1991) and Gale and Church (1993) are amongst the most cited works in text alignment work. Purely length-based techniques have no concern with word identity or meaning and as such are considered knowledge-poor approaches. The method used by Brown et al.</Paragraph>
      <Paragraph position="1"> (1991) measures sentence length in number of words. Their approach is based on matching sentences with the nearest length. Gale and Church (1993) used a similar algorithm, but measured sentence length in number of characters. Their method performed well on the Union Bank of Switzerland (UBS) corpus giving a 2% error rate for 1:1 alignment.</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
2.2 Lexical methods
</SectionTitle>
      <Paragraph position="0"> Moving towards knowledge-rich methods, lexical information can be vital in cases where a string with the same length appears in two languages.</Paragraph>
      <Paragraph position="1"> Kay and Roscheisen (1993) tried lexical methods for sentence alignment. In their algorithm, they consider the most reliable pair of source and target sentences, i.e. those that contain many possible lexical correspondences. They achieved 96% coverage on Scientific American articles after four passes of the algorithm. Other examples of lexical methods are Warwick et al. (1989), Mayers et al.</Paragraph>
      <Paragraph position="2"> (1998), Chen (1993) and Haruno and Yamazaki (1996).</Paragraph>
      <Paragraph position="3"> Warwick et al. (1989) calculate the probability of word pairings on the basis of frequency of source word and the number of possible translations appearing in target segments. They suggest using a bilingual dictionary to build word-pairs. Mayers et al. (1998) propose a method that is based on a machine readable dictionary. Since bilingual dictionaries contain base forms, they pre-process the text to find the base form for each word. They tried this method in an English-Japanese alignment system and got accuracy of about 89.5% for 1-to-1 and 42.9% for 2-to-1 sentence alignments. Chen (1993) constructs a simple word-to-word translation model and then takes the alignment that maximizes the likelihood of generating the corpus given the translation model. Haruno and Yamazaki (1996) use a POS tagger for source and target languages and use an online dictionary to find matching word pairs. Haruno and Yamazaki (1996) pointed out that though dictionaries cannot capture context dependent keywords in the corpus, they can be very useful to obtain information about words that appear only once in the corpus. Lexical methods for sentence alignment may also result in partial word alignment. Given that lexical methods can be computationally expensive, our idea was to try a simple length-based approach similar to that of Brown et al. (1991) for sentence alignment and then use lexical methods to align words within aligned sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
2.3 Algorithm
</SectionTitle>
      <Paragraph position="0"> We use English-Hindi parallel data from the EMILLE corpus for our experiments. EMILLE is a 63 Million word electronic corpus of South Asian languages, especially those spoken as minority languages in UK. It has around 120,000 words of parallel data in each of English, Hindi, Urdu,</Paragraph>
      <Paragraph position="2"> Examining the data, we observe that it is possible to align one English sentence with one or more Hindi sentences or vice-versa. In the method described below, sentence length is calculated in number of words. We define our task as that of learning rules that characterise the relationship between the lengths of two sentences in parallel texts. We used 60 manually aligned paragraphs from the EMILLE corpus, each with an average of 3 sentences, as a dataset for our learning task.</Paragraph>
      <Paragraph position="3"> Initially we derived minimum and maximum length differences in percentages for each of the one-to-one, one-to-two and one-to-three parallel sentence pairs. Later we used these values as input to our algorithm to learn new rules that maximize the probability of aligning sentences.</Paragraph>
      <Paragraph position="4"> Learning: Let T = [1:1, 1:2, 1:3, 2:1, 3:1], a set of possible alignment types between the English and Hindi sentences. For each alignment type t [?] T, minimum and maximum length differences in number of words, normalized to percentages, can be described as mint and maxt. For each alignment type t [?] T, a constant parameter dt, where dt [?] [mint , mint + 0.01, mint + 0.02, ..., maxt ] was learned using an algorithm described in figure 2.1.</Paragraph>
      <Paragraph position="5"> dt is a value that describes the length relationship between the sentences of a pair of type t. For example, given a pair of one Hindi and two English sentences and a value dt, where t = 1:2, it is possible to check if these sentences can be aligned with each other. Suppose for a given pair of parallel sentences that consist of hi (Hindi sentence at ith position) and ej and ej+1 (English sentences at jth and j+1th positions), let |hi|, |ej |and |ej+1 |be the lengths of Hindi and English sentences. hi, ej and ej+1 are said to have 1:2 alignment if |hi |(|ej |+ |ej+1|) &lt; 0.17 * |hi|, i.e. the difference between the length of the Hindi sentence and the length of the two consecutive English sentences is less than (dt=1:2 = 0.17) times the length of the Hindi sentence. Table 2.1 lists rules for different possible alignments. Before we decide on the final alignment, we check each possibility of one Hindi sentence being aligned with one, two or three consecutive English sentences and vice-versa. We use rules H1 and H2 to check the possibility of one Hindi sentence being aligned with two or three consecutive English sentences. Similarly, rules E1 and E2 are used to check the possibility of one English sentence being aligned with two or three consecutive Hindi sentences. If none of the rules from H1, H2, E1 and E2 return true, we consider the default alignment (1-To-1) between the English and Hindi sentences. We give preference to the higher alignment over the possible lower alignments, i.e. given 1-To-2 and 1-To-3 possible alignment mappings, we consider 1-To-3 mapping.</Paragraph>
      <Paragraph position="6"> We tested our algorithm on parallel texts with total of 3441 English-Hindi sentence pairs and obtained an accuracy of 99.09%; i.e., the correctly aligned pairs were 3410.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="58" end_page="62" type="metho">
    <SectionTitle>
3 Word Alignment
</SectionTitle>
    <Paragraph position="0"> Extending sentence alignment to word alignment is a process of locating corresponding word pairs in two languages. In some cases, a word is not translated, or is translated by several words. A word can also be a part of an expression that is translated as a whole, and therefore the entire expression must be translated as a whole (Manning &amp; Schutze, 2003). We present a hybrid method for many-to-many word alignment. Hindi is a partial free order language where the order of word groups in a Hindi sentence is not fixed, but the order of words within groups is fixed (Ray et al., 2003). According to Ray et al. (2003), fixed order word group extraction is essential for decreasing the load on the free word order parser. The word alignment algorithm takes as input a pair of aligned sentences and groups words in sentences of both languages. We have observed a few facts about the Hindi language. For example, there are no  articles in Hindi (Bal Anand, 2001). Since there are no articles in Hindi, articles are aligned to null.</Paragraph>
    <Section position="1" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
3.1 Local word grouping
</SectionTitle>
      <Paragraph position="0"> A separate group is created for each token in the English text. Every English word has one property associated with it: the lemma of the word. This is necessary because a dictionary lookup approach is at the heart of our word alignment algorithm.</Paragraph>
      <Paragraph position="1"> Verbs are used in different inflected forms in different sentences. For a verb, it is common not to find all inflected forms listed in a dictionary, i.e. most dictionaries contain verbs only in their base forms. Therefore we use a morphological analyzer to find the lemma of each English word.</Paragraph>
      <Paragraph position="2"> Word groups in Hindi are created using two resources: a Hindi gazetteer list that contains a large set of named entities (NE) and a rule file that contains more than 250 rules. The gazetteer list is available as a part of Hindi Gazetteer Processing Resource in GATE (Maynard et al., 2003). For each rule in the rule file, it contains the following information:  1. Hindi Regular Expression (RE) for a word or phrase. This must match one or more words in the Hindi sentence.</Paragraph>
      <Paragraph position="3"> 2. Group name or a part-of-speech category.</Paragraph>
      <Paragraph position="4"> 3. Expected English word(s) (EEW) that this Hindi word group may align to.</Paragraph>
      <Paragraph position="5"> 4. Expected Number of English words (NW) that the Hindi group may align to.</Paragraph>
      <Paragraph position="6"> 5. In case a group of one or more English  words aligns with a group of one or more Hindi words, information about the key words (KW) in both groups. Key words must match each other in order to align English-Hindi groups.</Paragraph>
      <Paragraph position="7"> 6. A rule to convert the Hindi word into its base form (BF).</Paragraph>
      <Paragraph position="8"> Rules in the rule file identify verbs, postpositions, noun phrases and also a set of words, whose translation is expected to occur in the same order as the English words in the English sentence. The local word grouping algorithm considers one rule at a time and tries to match the regular expression in the Hindi sentence. If the expression is matched, a separate group for each found pattern is created. When a Hindi group is created, based on its pattern type, one of the following categories is assigned to that group:  i) &amp;quot;rhaa &amp;quot; , &amp;quot;rhe&amp;quot; , &amp;quot;rhii&amp;quot; are used to indicate the progressive tense. They can be seen as analogous to the English (-ing) ending. ii) &amp;quot;te&amp;quot; , &amp;quot;taa&amp;quot;, and &amp;quot;tii&amp;quot; are used as verb endings to indicate the habitual tense. They must agree with subject number and gender. iii) &amp;quot;the&amp;quot; is a past tense conjunction of the verb &amp;quot;honaa&amp;quot; . In the first rule, if we find a word &amp;quot;baavn&amp;quot; (bavan) in Hindi, we mark it as a &amp;quot;Number&amp;quot; and search for the English string with two words that is equal to the expected string &amp;quot;fifty two&amp;quot;. In the second rule, we locate a string where the second word is &amp;quot;rhaa&amp;quot; (raha). &amp;quot; 1&amp;quot; in the fifth column specifies that the first word is the keyword. We use the dictionary to locate the word in the English sentence that matches with the key word. If the English word is located, we align &amp;quot;(.)+ rhaa &amp;quot; with the English word found. In the third rule, if we find a Hindi string with two words where the first word ends with &amp;quot;te&amp;quot; (te) and the second word is &amp;quot;the&amp;quot; (the), we group them as a verb. As specified in the sixth column, we replace the characters &amp;quot;te&amp;quot; with &amp;quot;naa&amp;quot; (na) to convert the first word into its base form (e.g. &amp;quot;gaate&amp;quot; (gaate) into &amp;quot;gaanaa&amp;quot; (gaana)). In the fourth rule, we align &amp;quot;X ke ilye &amp;quot; with &amp;quot;For X&amp;quot;, where &amp;quot;For&amp;quot; = &amp;quot;ke ilye &amp;quot;. As specified in the fifth column, we align the first word in Hindi with the second word in English. In the final example, we group two words that are identical to each other. For example: &amp;quot;alg alg &amp;quot; (alag alag) which means &amp;quot;different&amp;quot; in English. Such bigrams are used to stress the importance of a word/activity in a sentence.</Paragraph>
      <Paragraph position="9">  example, in rule 3 and 4 if the word ends with either of taa , te or tii followed by (PH), it is assumed that the word is a verb. The formula for finding the lemma of any Hindi verb is: infinitive = root verb + &amp;quot;naanaanaa naa &amp;quot;. Sometimes it is possible to predict the corresponding English translation. For example, for the postposition &amp;quot;ke saamne &amp;quot;, one is likely to find the preposition &amp;quot;in front of&amp;quot; in the English sentence. We store this information as an expected English word(s) in Hindi Word Groups (HWGs) and search for it in the English sentence.</Paragraph>
      <Paragraph position="10"> In the case of rules 4 and 5, though the HWG contains more than one word, only one is the actual verb (key word) that is expected to be available in a dictionary. We specify the index of this key word in the HWG, so as to consider only the word at the specified index to compare with key word in English word group. If they match, the full HWG is aligned to the word in English sentence.</Paragraph>
    </Section>
    <Section position="2" start_page="60" end_page="60" type="sub_section">
      <SectionTitle>
3.2 Alignment Algorithm
</SectionTitle>
      <Paragraph position="0"> After applying the local word grouping rules to the Hindi sentence(s), based on their categories of HWGs, we use four methods to process and align HWGs with their respective English Word Groups.</Paragraph>
      <Paragraph position="1">  1. Dictionary lookup approach (DL) 2. Transliteration similarity approach (TS) 3. Expected English words approach (EEW) 4. Nearest aligned neighbour approach  Whilst the verbs and other groups are processed with DL approach, HWGs with categories such as proper nouns, city, job-title, location, and country are processed with TS approach. HWGs such as number, day-unit, date-unit, month-unit, auxiliary, pronoun and postpositions, where the expected English words are specified, are processed with EEW approach. Sometimes the combination of DL and TS is also used to identify the proper alignment. At the end, nearest aligned neighbour approach is used to align the unaligned HWGs.</Paragraph>
    </Section>
    <Section position="3" start_page="60" end_page="61" type="sub_section">
      <SectionTitle>
Dictionary Lookup
</SectionTitle>
      <Paragraph position="0"> The corpus we used in our experiments is encoded in Unicode and therefore the word matching process requires dictionary entries to be in Unicode encoding. The only English-Hindi dictionary we found is called, &amp;quot;shabdakoSha&amp;quot; and is freely available from (WWW2). In this dictionary, the ITRANS transliteration system is followed, i.e.</Paragraph>
      <Paragraph position="1"> Hindi entries are not written in the Devanagari script, but in the Roman script. This dictionary has around 15,000 English words, each with an average of 4 relevant Hindi words. Following  Figure 3.2 Nearest Aligned Neighbours Approach ITRANS conventions, a parser was developed to convert all these entries into Unicode. Given a set of English and Hindi words, the algorithm presented in figure 3.1 is executed to search for the best translation among the English words.</Paragraph>
    </Section>
    <Section position="4" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
Transliteration Similarity
</SectionTitle>
      <Paragraph position="0"> A transliteration system maintains a consistent correspondence between the alphabets of two languages, irrespective of sound (Manning &amp; Schutze, 2003). Given two words, each from a different language, we define &amp;quot;transliteration similarity&amp;quot; as the measure of likeness between them. This could exist due to the word in one language being inherited or adopted by the other language, or because the word is a proper noun.</Paragraph>
      <Paragraph position="1"> Named entities such as city, job-title, location, country and proper nouns, all recognized by the local word grouping algorithm are compared using a transliteration similarity approach. This likeness is counted using a table that lists letter correspondences between the alphabets of two languages. For the English and Hindi languages, it is possible to come up with a table that defines letter correspondence between the alphabets of two languages. For example, A barb2right a , B barb2right b , Bh barb2right bh , Ch barb2right c , D barb2right d , Dh barb2right dh and so on...</Paragraph>
      <Paragraph position="2"> A bidirectional mapping is established between each character in the English and Hindi alphabets.</Paragraph>
      <Paragraph position="3"> When DL is not able to find any specific English word in dictionary, this approach is used to find the transliteration similarity between the unaligned words. Sometimes because the words in a Hindi sentence are not spelled correctly, when DL issues a query to dictionary, none of the Hindi words appearing in a Hindi sentence match with the words returned from dictionary. We use a dynamic programming algorithm &amp;quot;edit-distance&amp;quot; to calculate similarity between these words (WWW3). According to WWW3, &amp;quot;The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of: change a letter, insert a letter or delete a letter.&amp;quot; The lower the distance, the greater the similarity. From our experiments of 100 proper noun pairs, we found that if the similarity is greater than 75%, the words can be reliably aligned with each other. We consider a pair with the highest similarity. E.g.: Aswani barb2right a / svaaniia / svaaniia / svaaniia / svaanii . Here we remove vowels in both strings, except those that appear at the start of words. After the removal of vowels from the English and Hindi texts, the resulting text would be: Aswn barb2right asvnasvnasvnasvn . The Hindi text is then converted into English text using the transliteration table: Aswn barb2right Aswn. The two texts are then compared using an &amp;quot;edit-distance&amp;quot; algorithm.</Paragraph>
      <Paragraph position="4"> Expected English word(s) For HWGs which are categorised as numbers, jobtitles or postpositions, it is possible to specify the expected English word or words that can be found in the parallel English text. The algorithm retrieves expected English word(s) from the HWGs and tries to locate them in the English sentence. This approach can be useful to locate one or more English words that align with one or more Hindi words. For example, the number &amp;quot;byaails &amp;quot; whose equivalent translation in English is &amp;quot;forty two&amp;quot; has two words in English, and the postposition &amp;quot;ke saamne &amp;quot;, whose equivalent translation in English is &amp;quot;in front of&amp;quot;, has three words in English. These are examples of many-to-many word alignment.</Paragraph>
      <Paragraph position="5">  At the end of the first three stages of the word alignment process, many words remain unaligned.</Paragraph>
      <Paragraph position="6"> Here we introduce a new approach, called the &amp;quot;Nearest Aligned Neighbours approach&amp;quot;. In certain cases, words in English-Hindi phrases follow a similar order. The Nearest Aligned Neighbours approach works on this principle and aligns one or more words with one of the English words. A local word grouping algorithm, explained in section 3.1, groups such phrases and tags them as &amp;quot;group&amp;quot;. Considering one HWG at a time, we find the nearest Hindi word that is already aligned with one or more English word(s). We assume that the words in English-Hindi phrases follow a similar order and align the rest words in that group accordingly. An example of alignment using the Nearest Aligned Neighbours approach is given in Figure 3.2. Word H4 is already aligned with E5, and H3, H5, H6 and H7 are yet to be aligned. The local word grouping algorithm has tagged a sequence of H4, H5, H6 and H7 as a single group.</Paragraph>
      <Paragraph position="7"> At the same time, H6 and H7 are also grouped as a single group. The algorithm searches for the aligned Hindi word, which, in this case, is H4 and aligns H5 with E6 and the group of H6 and H7 with E7.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML