File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1631_metho.xml

Size: 24,829 bytes

Last Modified: 2025-10-06 14:10:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1631">
  <Title>Capturing Out-of-Vocabulary Words in Arabic Text</Title>
  <Section position="5" start_page="0" end_page="258" type="metho">
    <SectionTitle>
AA
</SectionTitle>
    <Paragraph position="0"> FN /D0CXD2C1CZD7/ (Linux). This process often results in different Arabic spellings for the same word.</Paragraph>
    <Paragraph position="1"> Current Arabic Information Retrieval (AIR) systems do not handle the problem of retrieving the different versions of the same foreign  word (Abdelali et al., 2004), and instead typically retrieve only the documents containing the same spelling of the word as used in the query.</Paragraph>
    <Paragraph position="2"> One solution to this problem has been used in cross-lingual information retrieval, where OOV words in the query are transliterated into their possible equivalents. Transliterating terms in English queries into multiple Arabic equivalents using an English-Arabic dictionary has been shown to have a positive impact on retrieval results (Abduljaleel and Larkey, 2003). However, we are aware of no work on handling OOV terms in Arabic queries.</Paragraph>
    <Paragraph position="3"> For this, proper identification of foreign words is essential. Otherwise, query expansion for such words is not likely to be effective: many Arabic words could be wrongly expanded, resulting in long queries with many false transliterations of Arabic words. Furthermore, proper identification of foreign words would be helpful because such words could then be treated differently using techniques such as approximate string matching (Zobel and Dart, 1995).</Paragraph>
    <Paragraph position="4"> In this paper, we examine possible techniques to identify foreign words in Arabic text. In the following sections we categorise and define foreign words in Arabic, and follow in section 2 with a discussion of possible different approaches that can identify them in Arabic text. In section 3 we present an initial evaluation of these approaches, and describe improvements in section 4 that we then explore in a second experiment in section 5.</Paragraph>
    <Paragraph position="5"> We discuss results in section 6 and finally conclude our work in section 7.</Paragraph>
    <Section position="1" start_page="258" end_page="258" type="sub_section">
      <SectionTitle>
1.1 Foreign words in Arabic
</SectionTitle>
      <Paragraph position="0"> Arabic has many foreign words, with varying levels of assimilation into the language. Words borrowed from other languages usually have different style in writing and construction, and Arabic linguists have drawn up rules to identify them. For example, any root Arabic word that has four or more characters should have one or more of the &amp;quot;Dalaga&amp;quot; letters ( A9ES, C8, FS, A9G8, FK, C0</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="258" end_page="259" type="metho">
    <SectionTitle>
BA
). Those that
</SectionTitle>
    <Paragraph position="0"> have no such letters are considered foreign (Al-Shanti, 1996). However, while such rules could be useful for linguistic purposes, they have limited application in Information Retrieval (IR); based on rules, many foreign words that have long been absorbed into the language and are spelled consistently would be considered to be OOV. From the IR perspective, foreign words can be split into two</Paragraph>
    <Section position="1" start_page="258" end_page="259" type="sub_section">
      <SectionTitle>
Milosevic
</SectionTitle>
      <Paragraph position="0"> general categories: translated and transliterated.</Paragraph>
      <Paragraph position="1"> Translated: These are foreign words that are modified or remodelled to conform with Arabic word paradigms; they are well assimilated into Arabic, and are sometimes referred to as Arabicised words (Aljlayl and Frieder, 2002). This process includes changes in the structure of the borrowed word, including segmental and vowel changes, and the addition, deletion, and modification of stress patterns (Al-Qinal, 2002). This category of foreign words usually has a single spelling version that is used consistently. Examples include words such as A9G8BTAGC2DMAT</Paragraph>
      <Paragraph position="3"> Transliterated: Words in this category are transliterated into Arabic by replacing phonemes with their nearest Arabic equivalents. Although Arabic has a broad sound system that contains most phonemes used in other languages, not all phonemes have Arabic equivalents. In practice, such phonemes may be represented in different ways by different persons, resulting in several spelling versions for the same foreign word. For example, we observed 28 transliterated versions for the name of the former Serbian leader (Milosevic) in the TREC 2002 Arabic collection; these are shown in Table 1.</Paragraph>
      <Paragraph position="4"> Transliteration has become more common than translation due to the need for instant access to new foreign terms. It can take considerable time for a new foreign term to be included in reference  dictionaries. However, users often need to immediately use a particular term, and cannot wait until a standard form of the word is created; news agencies form an important category of such users. This transliteration process often results in multiple spellings in common usage.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="259" end_page="259" type="metho">
    <SectionTitle>
1.2 Related work
</SectionTitle>
    <Paragraph position="0"> In the context of information retrieval, most work on foreign words in Arabic has been based on transliteration, and carried out under machine translation and cross-lingual information retrieval (CLIR) tasks, where English queries are used to search for Arabic documents, or vice versa. This often involves the use of a bilingual dictionary to translate queries and transliterate OOV words into their equivalents in Arabic.</Paragraph>
    <Paragraph position="1"> Expanding a foreign word to its possible variants in a query has been shown to increase the precision of search results (Abduljaleel and Larkey, 2003). However, OOV words in the query are easily recognised based on English rules and an English-Arabic dictionary: capitalised words are marked as nouns, and the remaining words are translated using the dictionary. Words not found in the dictionary are marked as OOV and are transliterated into probable Arabic forms. In contrast, we aim to identify foreign words as a within Arabic text which is made difficult by the absence of such easily perceptible difference.</Paragraph>
    <Paragraph position="2"> Stalls and Knight (1998) describe research to determine the original foreign word from its Arabic version; this is known as back transliteration. However, rather than using automatic methods to identify foreign words, they used a list of 2 800 names to test the accuracy of the back transliteration algorithm. Of these, only 900 names were successfully transliterated to their source names. While this approach can be used to identify transliterated foreign words, its effectiveness is not known on normal Arabic words as only names were used to test the algorithm.</Paragraph>
    <Paragraph position="3"> Jeong et al. (1999) used statistical differences in syllable unigram and bigram patterns between pure Korean words and foreign words to identify foreign words in Korean documents.</Paragraph>
    <Paragraph position="4"> This approach was later enhanced by Kang and Choi (2002) to incorporate word segmentation.</Paragraph>
    <Paragraph position="5"> A related area is language identification, where statistics derived from a language model are used to automatically identify languages (Dunning, 1994). Using N-gram counting produces good accuracy for long strings with 50 or more characters, and moderately well with 10-character-long strings. It is unclear how well this approach would work on individual words with five characters on average.</Paragraph>
  </Section>
  <Section position="8" start_page="259" end_page="261" type="metho">
    <SectionTitle>
2 Identifying foreign words
</SectionTitle>
    <Paragraph position="0"> We categorise three general approaches for recognising foreign words in Arabic text: Arabic lexicon OOV words can be easily captured by checking whether they exist in an Arabic lexicon. However, the lexicon is unlikely to include all Arabic words, while at the same time it could contain some foreign words. Moreover, this approach will identify misspelled Arabic words as foreign.</Paragraph>
    <Paragraph position="1"> Arabic patterns system Arabic uses a pattern system to derive words from their roots. Roots are three, four or sometimes five letters long. The reference pattern ABFLA6ABEQA6ABA9EV (/CUCPG5CPD0CP/ = to do) is often used to represent three-letter root words. For example, the word ABAHC1A6 ABCYA6ABC3  to ABFLA7.</Paragraph>
    <Paragraph position="2"> Many stems can be generated from this root using standard patterns. For instance, ALFLER</Paragraph>
    <Paragraph position="4"> ferent patterns that respectively represent the active participle, and present tense verb from the pattern ABFLA6ABEQA6ABA9EV. By placing the appropriate core letters and adding additional letters in each pattern, we can generate words such as ALAHC1CZ</Paragraph>
    <Paragraph position="6"> tively. New words can further accept prefixes and suffixes.</Paragraph>
    <Paragraph position="7"> We can recognise whether a word is an Arabic or foreign word by reversing the process and testing the different patterns. If, after all possible affixes have been removed, the remaining stem matches an Arabic pattern, the word is likely to be an Arabic word. For example, to check whether the word ALAHC1CZ  sition -- and conclude that it is therefore an Arabic word. Note that we must perform this determination without relying on diacritics. This approach is not perfect, as general Arabic text does not include explicit diacritics; if parts of a foreign word match a pattern, it will be marked as being Arabic. Similarly, misspelled words may be classified as foreign words if no matching pattern is found.</Paragraph>
    <Paragraph position="8"> N-gram approach Transliterated foreign words exhibit construction patterns that are often different from Arabic patterns. By counting the N-grams of a sample of foreign words, a profile can be constructed to identify similar words. This approach has been used in language identification, although it is reported to have only moderate effectiveness in identifying short strings (Cavnar and Trenkle, 1994; Dunning, 1994).</Paragraph>
    <Section position="1" start_page="260" end_page="261" type="sub_section">
      <SectionTitle>
2.1 Resources
</SectionTitle>
      <Paragraph position="0"> For the lexicon approach, we used three lexicons: the Khoja root lexicon (Khoja and Garside, 1999), the Buckwalter Lexicon (Buckwalter, 2002), and the Microsoft office 2003 lexicon (Microsoft Corporation, 2002).</Paragraph>
      <Paragraph position="1"> The Khoja stemmer has an associated compressed language dictionary that contains well-known roots. The stemmer strips prefixes and suffixes and matches the remaining stem with a list of Arabic patterns. If a match is found, the root is extracted and checked against the dictionary of root words. If no entry is found, the word is considered to be a non-Arabic word. We call this the Khoja Lexicon Approach (KLA).</Paragraph>
      <Paragraph position="2"> The Buckwalter morphological analyser is a lexicon that uses three tables and an algorithm to check possible affixes. The algorithm checks a word and analyses its possible prefixes and suffixes to determine possible segmentation for an Arabic word. If the algorithm fails to find any possible segmentation, the word is considered not found in the lexicon. We name this approach the Buckwalter Lexicon Approach (BLA).</Paragraph>
      <Paragraph position="3"> The Microsoft office lexicon is the one used in the Microsoft Office 2003 spell-checker. We test whether an Arabic word is found in this lexicon, and classify those that are not in the lexicon to be foreign words. We call this approach the Office  stemmer to implement the KPA approach To use Arabic patterns, we modified the Khoja stemmer to check whether there is a match between a word and a list of patterns after stemming without further checking against the root dictionary. If there is no match, the word is considered a foreign word. This approach is similar to the approach used by Taghva et al. (2005). We adopted the patterns of the Khoja stemmer and added 37 patterns compiled from Arabic grammar books, these are shown in Table 2. We call these approaches the Khoja Pattern Approach (KPA), and Modified Khoja Pattern Approach (MKP) respectively. A word is also considered to be an Arabic word if the remaining stem has three or fewer letters. null We evaluate the effectiveness of the n-gram method in two ways. First, we extend the n-gram text categorisation method presented by Cavnar and Trenkle (1994). The method uses language profiles where, for each language, all n-grams that occur in a training corpus are sorted in order of decreasing frequency of occurrence, for n ranging from 1 to 5. To classify a text t, we build its n-gram frequency profile, and compute the distance between each n-gram in the text and in each language profile lj. The total distance is computed by summing up all differences between the position of the n-gram in the text profile and the position of the same n-gram in the language profile:</Paragraph>
      <Paragraph position="5"> where Dj is the total distance between a text t with Ni n-grams, and a language profile lj with Nj ngrams; and rank is the position of the n-gram in the frequency-sorted list of all n-grams for either the text or language profile.</Paragraph>
      <Paragraph position="6"> In our work, we build two language profiles, one  for native Arabic words and another for foreign words. We compare the n-grams in each word in our list against these two profiles. If the total distance between the word and the foreign words profile is smaller than the total distance between the word and the Arabic words profile, then it is classified as a foreign word. As the two language profiles are not in same size, we compute the relative position of each n-gram by dividing its position in the list by the number of the n-grams in the language profile. We call this approach the n-gram approach (NGR).</Paragraph>
      <Paragraph position="7"> We also tried a simpler approach based on the construction of two trigram models: one from Arabic words, and another from foreign words.</Paragraph>
      <Paragraph position="8"> The probability that a string is a foreign word is determined by comparing the frequency of its tri-grams with each language model. A word is considered foreign if the sum of the relative frequency of its trigrams in the foreign words profile is higher than the sum of the relative frequency of its tri-grams in the Arabic words profile. We call this approach trigram (TRG).</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="261" end_page="261" type="metho">
    <SectionTitle>
3 Training Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we describe how we formed a development data set using Arabic text from the Web, and how we evaluated and improved techniques for identification of foreign words.</Paragraph>
    <Section position="1" start_page="261" end_page="261" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> To form our development data set, we crawled the Arabic web sites of the Al-Jazeera news channel1, the Al-Anwar2 and El-Akhbar3 newspapers. A list of 285 482 Arabic words was extracted. After removing Arabic stop words such as pronouns and prepositions, the list had 246 281 Arabic words with 25 492 unique words.</Paragraph>
      <Paragraph position="1"> In the absence of diacritics, we decided to remove words with three or fewer characters, as these words could be interpreted as being either Arabic or foreign in different situations. For example, the word GY</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="261" end_page="262" type="metho">
    <SectionTitle>
AA
BZ
BA
</SectionTitle>
    <Paragraph position="0"> (/CQCX/) could be interpreted as the Arabic word meaning &amp;quot;in me&amp;quot;, or the English letter B. After this step, 24 218 unique words remained. null We examined these words and categorised each of them either as Arabic word (AW), or a translit- null erated foreign word (FW). We also had to classify some terms as misspelled Arabic word (MW). We used the Microsoft Office spell-checker as a first-pass filter to identify misspelled words, and then manually inspected each word to identify any that were actually correct; the spell-checker fails to recognise some Arabic words, especially those with some complex affixes. The list also had some local Arabic dialect spellings that we chose to classify as misspelled.</Paragraph>
    <Paragraph position="1"> The final list had three categories: 22 295 correct Arabic words, 1 218 foreign words and 705 misspelled words.</Paragraph>
    <Paragraph position="2"> To build language models for the trigram approaches (NRG and TRG), we used the TREC 2001 Arabic collection (Gey and Oard, 2001). We manually selected 3 046 foreign words out of the OOV words extracted from the collection using the Microsoft office spell-checker. These foreign words are transliterated foreign words. We built the Arabic language model using 100 000 words extracted from the TREC collection using the same spell-checker. However, we excluded any word that could be a proper noun, to avoid involving foreign words. We used an algorithm to exclude any word that does not accept the suffix haa (GHA7), as transliterated proper nouns can not accept Arabic affixes.</Paragraph>
    <Section position="1" start_page="261" end_page="262" type="sub_section">
      <SectionTitle>
3.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We measure the accuracy of each approach by examining the number of foreign words correctly identified, and the number of incorrect classifications. The precision of each approach is calculated as the ratio of correctly identified foreign words to the total number of words identified as foreign The latter could be correct or misspelled Arabic words identified as foreign plus the actual foreign words identified. The recall is calculated as the ratio of correctly identified foreign words to the number of words marked manually as foreign. Although there is generally a compromise between precision and recall, we consider precision to be more important, since incorrectly classifying Arabic words as foreign would be more likely to harm general retrieval performance. The left-hand side of Table 3 shows the results of our experiments.</Paragraph>
      <Paragraph position="1"> We have included the MW results to illustrate the effects of misspelled words on each approach The results show that the n-gram approach (NGR) has the highest precision, while the  lexicon-based OLA approach gives the highest recall. The pattern approaches (KPA) and (MKP) perform well compared to the combination of patterns and the root lexicon (KLA), although the latter produces higher recall. There is a slight improvement in precision when adding more patterns, but recall is sightly reduced. The KLA approach produces the poorest precision, but has better recall rate than the NGR approach.</Paragraph>
      <Paragraph position="2"> The results show that many Arabic native words are mistakenly caught in the foreign words net.</Paragraph>
      <Paragraph position="3"> Our intention is to handle foreign words differently from Arabic native words. Our approach is based on normalising the different forms of the same foreign word to one form at the index level rather than expanding the foreign word to its possible variants at the query level. Retrieval precision will be negatively affected by incorrect classification of native and foreign words. Consequently, we consider that keeping the proportion of false positives -- correct Arabic words identified as foreign (precision) -- low to be more important than correctly identifying a higher number of foreign words (recall).</Paragraph>
      <Paragraph position="4"> Some of the Arabic words categorised as foreign are in fact misspelled; we believe that these have limited effect on retrieval precision, and there is limited value in identifying such words in a query unless the retrieval system incorporates a correction process.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="262" end_page="262" type="metho">
    <SectionTitle>
4 Enhanced rules
</SectionTitle>
    <Paragraph position="0"> To reduce the false identification rate of foreign words, we analysed the lists of foreign words, correct Arabic words identified as foreign, and Arabic misspelled words identified as foreign. We noticed that some Arabic characters rarely exist in transliterated foreign words, and used these to separate Arabic words -- correctly or incorrectly spelled Letter count letter count letter count</Paragraph>
  </Section>
  <Section position="12" start_page="262" end_page="262" type="metho">
    <SectionTitle>
GW
AA
</SectionTitle>
    <Paragraph position="0"> of 3 046 foreign words - from true foreign words. Table 4 shows the count of each character in the sample of 3 046 foreign words; foreign words tend to have vowels inserted between consonants to maintain the CVCV paradigm. We also noticed that most of transliterated foreign words do not start with the definite article A5FNBS, or end with the Taa Marbuta AGGHA7. Foreign words also rarely end with two Arabic suffixes.</Paragraph>
    <Paragraph position="1"> We also noticed that lexicon based approaches fail to recognise some correct Arabic words for the following reasons: * Words with the letter BS (Alef) with or without the diacritics Hamza (ADBS, BS</Paragraph>
  </Section>
  <Section position="13" start_page="262" end_page="263" type="metho">
    <SectionTitle>
AD
</SectionTitle>
    <Paragraph position="0"> ), or the diacritic Madda (AEBS) are not recognised as correct in many cases. Many words are also categorised incorrectly if the Hamza is wrongly placed above or below the initial Alef or the Madda is absent. In modern Arabic text, the Alef often appears without the Hamza diacritic and  particular suffixes. For example, words that have the object suffix, such as the suffix BTGIA7 in BTGIF6</Paragraph>
    <Paragraph position="2"> it to you).</Paragraph>
    <Paragraph position="3"> * Some Arabic words are compound words, written attached to each other most of the time. For example, compound nouns such as  of two words that are individually identified as being correct, are flagged as incorrect when combined.</Paragraph>
    <Paragraph position="4"> * Some common typographical shortcuts result in words being written without white space between them. Where a character that always terminates a word (for example AGGG ) is found in the apparent middle of a word, it is clear that this problem has occurred.</Paragraph>
    <Paragraph position="5"> From these observations, we constructed the following rules. Whenever one of the following conditions is met, a word is not classified as for- null CG, FKBSGO, FKBSBS, and when split into two parts at the first character of any sequence, the first part is three characters or longer, and the second part is four characters or longer.</Paragraph>
    <Paragraph position="6"> The right-hand side of Table 3 shows the improvements achieved using these rules. It can be seen that they have a large positive impact. Overall, OLA performs the best, with precision at 69% and recall at 71%. Figure 1 shows the precision obtained before and after applying these rules. Improvement is consistent across all approaches, with an increase in precision between 10% and 25%.</Paragraph>
  </Section>
  <Section position="14" start_page="263" end_page="263" type="metho">
    <SectionTitle>
5 Verification Experiments
</SectionTitle>
    <Paragraph position="0"> To verify our results, we used another data set of similar size to the first to verify our approach.</Paragraph>
    <Paragraph position="1"> We collected a list of 23 466 unique words from the Dar-al-Hayat newspaper4. Words, and classified and marked words in the same way as for the first data set (described in Section 3.1). We determined this new set to comprise 22 800 Arabic words (AW), 536 Foreign words (FW), and 130 Misspelled words (MW). Table 5 shows the initial results and improvements using the enhanced rules obtained by each approach using this data set.</Paragraph>
    <Paragraph position="2"> The results on this unseen data are relatively consistent with the previous experiment, but precision in this sample is expectedly lower.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML