File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0819_metho.xml

Size: 14,101 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0819">
  <Title>Aligning words in English-Hindi parallel corpora</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Training Data
</SectionTitle>
    <Paragraph position="0"> The training data set was composed of approximately 3441 English-Hindi parallel sentence pairs drawn from the EMILLE (Enabling Minority Language Engineering) corpus (Baker et al., 2004). The data was pre-tokenized. For the English data, a token was a sequence of characters that matches any of the &amp;quot;Dr.&amp;quot;, &amp;quot;Mr.&amp;quot;, &amp;quot;Hon.&amp;quot;, &amp;quot;Mrs.&amp;quot;, &amp;quot;Ms.&amp;quot;, &amp;quot;etc.&amp;quot;, &amp;quot;i.e.&amp;quot;, &amp;quot;e.g.&amp;quot;, &amp;quot;[a-zA-Z09]+&amp;quot;, words ending with apostrophe and all special characters except the currency symbols PS and $.</Paragraph>
    <Paragraph position="1"> Similarly for the Hindi, a token consisted of a sequence of characters with spaces on both ends and all special characters except the currency symbols PS and $.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="117" type="metho">
    <SectionTitle>
3 Word Alignment
</SectionTitle>
    <Paragraph position="0"> Given a pair of parallel sentences, the task of word alignment can be described as finding one-to-one, one-to-many, and many-to-many correspondences  between the words of source and target sentences. It becomes more complicated when aligning phrases of one language with the corresponding words or phrases in the target language. For some words, it is also possible not to find any translation in the target language. Such words are aligned to null.</Paragraph>
    <Paragraph position="1"> The algorithm presented in this paper, is a blend of various methods. We categorize words of a Hindi sentence into one of four different categories and use different techniques to deal with each of them. These categories include: 1) NEs and cognates 2) Hindi words for which it is possible to predict their corresponding English words 3) Hindi words that match certain pre-specified regular expression patterns specified in a rule file (explained in section 3.3.) and finally 4) words which do not fit in any of the above categories. In the following sections we explain different methods to deal with words from each of these categories.</Paragraph>
    <Section position="1" start_page="115" end_page="115" type="sub_section">
      <SectionTitle>
3.1 Named Entities and Cognates
</SectionTitle>
      <Paragraph position="0"> According to WWW1, the Named Entity Task is the process of annotating expressions in the text that are &amp;quot;unique identifiers&amp;quot; of entities (e.g. Organization, Person, Location etc.). For example: &amp;quot;Mr. Niraj Aswani&amp;quot;, &amp;quot;United Kingdom&amp;quot;, and &amp;quot;Microsoft&amp;quot; are examples of NEs. In most text processing systems, this task is achieved by using local pattern-matching techniques e.g. a word that is in upper initial orthography or a Title followed by the two adjacent words that are in upper initial or in all upper case. We use a Hindi gazetteer list that contains a large set of NEs. This gazetteer list is distributed as a part of Hindi Gazetteer processing resource in GATE (Maynard et al., 2003). The Gazetteer list contains various NEs including person names, locations, organizations etc. It also contains other entities such as time units - months, dates, and number expressions.</Paragraph>
      <Paragraph position="1"> Cognates can be defined as two words having a common etymology and thus are similar or identical. In most cases they are pronounced in a similar way or with a minor change. For example &amp;quot;Bungalow&amp;quot; in English is derived from the word &amp;quot;bNglaa &amp;quot; in Hindi, which means a house in the Bengali style (WWW2). We use our TS method to locate such words. Section 3.2 describes the TS approach.</Paragraph>
    </Section>
    <Section position="2" start_page="115" end_page="116" type="sub_section">
      <SectionTitle>
3.2 Transliteration Similarity
</SectionTitle>
      <Paragraph position="0"> For the English-Hindi alphabets, it is possible to come up with a table consisting of correspondences between the letters of the two alphabets. This table is generated based on the various sounds that each letter can produce. For example a letter &amp;quot;c&amp;quot; can be mapped to two letters in Hindi, &amp;quot;k&amp;quot; and &amp;quot;s&amp;quot; . This mapping is not restricted to one-to-one but also includes many-to-many correspondences. It is also possible to map a sequence of two or more characters to a single character or to a sequence two or more characters.</Paragraph>
      <Paragraph position="1"> For example &amp;quot;tio&amp;quot; and &amp;quot;sh&amp;quot; in English correspond to the character &amp;quot;sh &amp;quot; in Hindi.</Paragraph>
      <Paragraph position="2"> Prior to executing our word alignment algorithm, we use the TS approach to build a table of NEs and cognates. We consider one pair of parallel sentences at a time and for each word in a Hindi sentence, we generate different English words using our TS table. We found that before comparing words of two languages, it is more accurate to eliminate vowels from the words except those that appear at the start of words. We use a dynamic programming algorithm called &amp;quot;edit-distance&amp;quot; to measure the similarity between these words (WWW3). We calculate the similarity measure for each word in a Hindi sentence by comparing it with each and every word of an English sentence. We come up with an m x n matrix, where m and n refer to the number of words in Hindi and English respectively. This matrix contains a similarity measure for each word in a Hindi sentence corresponding to each word in a parallel English sentence. From our experiments of comparing more than 100 NE and cognate pairs, we found that the word pairs should be considered valid matches only if the similarity is greater than 75%. Therefore, we consider only those pairs which have the highest similarity among the other pairs with similarity greater than 75%. The following example shows how TS is used to compare a pair of English-Hindi words. For example consider a pair &amp;quot;aswani barb2right a / svaanii &amp;quot; and the TS table entries as shown below:  Abarb2righta , Sbarb2rights , SSbarb2rights , Vbarb2rightv , Wbarb2rightv and Nbarb2rightn We remove vowels from both words: &amp;quot;aswn barb2right asvn &amp;quot;, and then convert the Hindi word into possible English words. This gives four different combinations: &amp;quot;asvn&amp;quot;, &amp;quot;assvn&amp;quot;, &amp;quot;aswn&amp;quot; and &amp;quot;asswn&amp;quot;. These words are then compared with the actual English word &amp;quot;aswn&amp;quot;. Since we are able to locate at least one word with similarity greater than 75%, we consider &amp;quot;aswani barb2right a / svaanii &amp;quot; as a NE. Once a list of NEs and cognates is ready, we switch to our next step: local word grouping, where all words in Hindi sentences, either those available in the gazetteer list or in the list derived using TS approach, are aligned using TS approach.</Paragraph>
    </Section>
    <Section position="3" start_page="116" end_page="116" type="sub_section">
      <SectionTitle>
3.3 Local Word Grouping
</SectionTitle>
      <Paragraph position="0"> Hindi is a partially free order language (i.e. the order of the words in a Hindi sentence is not fixed but the order of words in a group/phrase is fixed).</Paragraph>
      <Paragraph position="1"> Unlike English where the verbs are used in different inflected forms to indicate different tenses, Hindi uses one or two extra words after the verb to indicate the tense. Therefore, if the English verb is not in its base form, it needs to be aligned with one or more words in a parallel Hindi sentence. Sometimes a phrase is aligned with another phrase. For example &amp;quot;customer benefits&amp;quot; aligns with &amp;quot;nullaahk ke phaayde &amp;quot;. In this example the first word &amp;quot;customer&amp;quot; aligns with the first word &amp;quot;nullaahk &amp;quot; and the second word &amp;quot;benefits&amp;quot; aligns with the third word &amp;quot;phaayde &amp;quot;. Considering &amp;quot;customer satisfaction&amp;quot; and &amp;quot;nullaahk ke phaayde &amp;quot; as phrases to be aligned with each other, &amp;quot;ke &amp;quot; is the word that indicates the relation between the two words &amp;quot;nullaahk &amp;quot; and &amp;quot;phaayde &amp;quot;, which means the &amp;quot;benefits of customer&amp;quot; in English. These words in a phrase need to be grouped together in order to align them correctly. In the case of certain prepositions, pronouns and auxiliaries, it is possible to predict the respective Hindi postpositions, pronouns and other words. We derived a set of more than 250 rules to group such patterns by consulting the provided training data and other grammar resources such as Bal Anand (2001). The rule file contains the following information for each rule:  1) Hindi Regular Expression for a word or phrase. This must match one or more words in the Hindi sentence.</Paragraph>
      <Paragraph position="2"> 2) Group name or a part-of-speech category.</Paragraph>
      <Paragraph position="3"> 3) Expected English word(s) that this Hindi word group may align to.</Paragraph>
      <Paragraph position="4"> 4) In case a group of one or more English words  aligns with a group of one or more Hindi words, information about the key words in both groups. Key words must match each other in order to align English-Hindi groups.</Paragraph>
      <Paragraph position="5"> 5) A rule to convert Hindi word into its base form.</Paragraph>
      <Paragraph position="6"> We list some of the derived rules below: 1) Group a sequence of [X + Postposition], where X can be any category in the above list except postposition or verb. For example: &amp;quot;For X&amp;quot; = &amp;quot;X ke ilye &amp;quot;, where &amp;quot;For&amp;quot; = &amp;quot;ke ilye &amp;quot;. 2) Root Verb + (rhaa , rhii or rhe ) + (PH). Present continuous tense. We use &amp;quot;PH&amp;quot; as an abbreviation to refer to the present/past tense conjunction of the verb &amp;quot;honaa &amp;quot; - nullN , hnull , hai , ho , etc. 3) Group two words that are identical to each other. For example: &amp;quot;alg alg &amp;quot;, which means &amp;quot;different&amp;quot; in English. Such bi-grams are common in Hindi and are used to stress the importance of a word/activity in a sentence.</Paragraph>
      <Paragraph position="7"> Once the words are grouped in a Hindi sentence, we identify those word groups which do not fit in any of the TS and EEW categories. Such words are then aligned using the DL approach.</Paragraph>
    </Section>
    <Section position="4" start_page="116" end_page="117" type="sub_section">
      <SectionTitle>
3.3 Dictionary lookup
</SectionTitle>
      <Paragraph position="0"> Since the most dictionaries contain verbs in their base forms, we use a morphological analyzer to convert verbs in their base forms. The English-Hindi dictionary is obtained from (WWW4). The dictionary returns, on average, two to four Hindi words referring to a particular English word. The formula for finding the lemma of any Hindi verb is: infinitive = root verb + &amp;quot;naa &amp;quot;. Since in most cases, our dictionary contains Hindi verbs in their infinitive forms, prior to comparing the word with the unaligned words, we remove the word &amp;quot;naa &amp;quot; from the end of it. Due to minor spelling mistakes it is also possible that the word returned from dictionary does not match with any of the words in  a Hindi sentence. In this case, we use edit-distance algorithm to obtain similarity between the two words. If the similarity is greater than 75%, we consider them similar. We use EEW approach for the words which remain unaligned after the DL approach.</Paragraph>
    </Section>
    <Section position="5" start_page="117" end_page="117" type="sub_section">
      <SectionTitle>
3.4 Expected English words
</SectionTitle>
      <Paragraph position="0"> Candidates for the EEW approach are the Hindi word groups (HWG) that are created by our Hindi local word grouping algorithm (explained in section 3.3). The HWGs such as postpositions, number expressions, month-units, day-units etc.</Paragraph>
      <Paragraph position="1"> are aligned using the EEW approach. For example, for the Hind word &amp;quot;baavn &amp;quot; in a Hindi sentence, which means &amp;quot;fifty two&amp;quot; in English, the algorithm tries to locate &amp;quot;fifty two&amp;quot; in its parallel English sentence and aligns them if found. For the remaining unaligned Hindi words we use the NAN approach.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="117" end_page="117" type="metho">
    <SectionTitle>
3.5 Nearest Aligned Neighbors
</SectionTitle>
    <Paragraph position="0"> In certain cases, words in English-Hindi phrases follow a similar order. The NAN approach works on this principle and aligns one or more words with one of the English words. Considering one HWG at a time, we find the nearest Hindi word that is already aligned with one or more English word(s). Aligning a phrase &amp;quot;customer benefits&amp;quot; with &amp;quot;nullaahk ke phaayde &amp;quot; (example explained in section 3.3) is an example of NAN approach. Similarly consider a phrase &amp;quot;tougher controls&amp;quot;, where for its equivalent Hindi phrase &amp;quot;aidhk inyNnullnn &amp;quot;, the dictionary returns a correct pair &amp;quot;controls barb2right inyNnullnn&amp;quot;, but fails to locate &amp;quot;tougher barb2right aidhk&amp;quot;. For aligning the word &amp;quot;tougher&amp;quot;, NAN searches for the nearest aligned word, which, in this case, is &amp;quot;controls&amp;quot;. Since the word &amp;quot;controls&amp;quot; is already aligned with the word &amp;quot;inyNnullnn&amp;quot;, the NAN method aligns the word &amp;quot;tougher&amp;quot; with the nearest unaligned word &amp;quot;aidhk &amp;quot;.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML