File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0705_metho.xml

Size: 23,603 bytes

Last Modified: 2025-10-06 14:09:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0705">
  <Title>Modifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The LIMA natural language processor
</SectionTitle>
    <Paragraph position="0"> Our NLP system (Besancon et al., 2003), called LIMA2, was built using a traditional architecture involving separate modules for  1. Morphological analysis: a. Tokenization (separating the input stream into a graph of words).</Paragraph>
    <Paragraph position="1"> b. Simple word lookup (search for words in a full form lexicon).</Paragraph>
    <Paragraph position="2"> c. Orthographical alternative lookup  (looking for differently accented forms, alternative hyphenisation, concatenated words, abbreviation recognition), which might alter the original non-cyclic word graph by adding alternative paths.</Paragraph>
    <Paragraph position="3"> d. Idiomatic expressions recognizer (detecting and considering them as single words in the word graph).</Paragraph>
    <Paragraph position="4">  e. Unknown word analysis.</Paragraph>
    <Paragraph position="5"> 2. Part-of-Speech and Syntactic analysis: a. After the morphological analysis,  which has augmented the original graph with as many nodes as there</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 LIMA stands for the LIC2M Multilingual Analyzer.
</SectionTitle>
    <Paragraph position="0"> are interpretations for the tokens, part-of-speech analysis using language models from a hand-tagged corpus reduces the number of possible readings of the input.</Paragraph>
    <Paragraph position="1">  b. Named entity recognizer.</Paragraph>
    <Paragraph position="2"> c. Recognition of nominal and verbal chains in the graph.</Paragraph>
    <Paragraph position="3"> d. Dependency relation extraction. 3. Information retrieval application: a. Subgraph indexing.</Paragraph>
    <Paragraph position="4"> b. Query reformulation (monolingual  reformulation for paraphrases and synonymy; multilingual for cross language information retrieval). c. Retrieval scoring comparing partial matches on subgraphs and entities. null Our LIMA NLP system (Besancon et al., 2003) was first implemented for English, French, German and Spanish, with all data coded in UTF8. When we extended the system to Arabic, we found that a number of modifications had to be introduced. We detail these modifications in the next sections.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Changes specific to Semitic languages
</SectionTitle>
    <Paragraph position="0"> Two new problems posed by Arabic (and common to most Semitic languages) that forced us to alter our NLP system are the problem of incomplete vowelization of printed texts3 and the problem of agglutinative clitics. We discuss how these new problems influenced our lexical resources and language processing steps.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Lexical Resources
</SectionTitle>
      <Paragraph position="0"> The first task for introducing a new language is to create the lexical resources for this language. Since Arabic presents agglutination of articles, prepositions and conjunctions at the beginning of words as well as pronouns at the end of words, and these phenomena were not treated in our existing Euro-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Since the headwords of our monolingual and cross-lingual
</SectionTitle>
    <Paragraph position="0"> reference dictionaries for Arabic possess voweled entries, we hope to attain greater precision by treating this problem. An alternative but noisy approach (Larkey et al. 2002) is to reduce to unvoweled text throughout the NLP application.</Paragraph>
    <Paragraph position="1"> pean languages4, we had to decide how this feature would be handled in the lexicon. Solutions to this problem have been proposed, ranging from generation and storage of all agglutinated words forms (Debili and Zouari, 1985) to the compilation of valid sequences of proclitics, words and enclitics into finite-state machines (Beesley, 1996). Our system had already addressed the problem of compounds for German in the following way: if an input word is not present in the dictionary, a compound-searching module returns all complete sequences of dictionary words (a list of possible compound joining &amp;quot;fogemorphemes&amp;quot; is passed to this module) as valid decompositions of the input word. Though theoretically this method could be used to treat Arabic clitics, we decided against using this existing process for two reasons:  1. Contrary to German, in which any noun  may theoretically be the first element of a compound, Arabic clitics belong to a small closed set of articles, conjunctions, prepositions and pronouns. Allowing any word to appear at the beginning or end of an agglutinated word would generate unnecessary noise.</Paragraph>
    <Paragraph position="2">  2. Storing all words with all possible cli null tics would multiply the size of lexicon proportionally to the number of legal possible combinations. We decided that this would take up too much space, though others have adopted this approach as mentioned above.</Paragraph>
    <Paragraph position="3"> We decided to create three lexicons: two additional (small) lists of proclitic and enclitic combinations, and one large lexicon of full form5 voweled words (with no clitics), the creation of the large lexicon from a set of lemmas using classic conjugation rules did not require any modification of the existing dictionary building and compilation component. Since our NLP system already possessed a mechanism for mapping unaccented words to accented entries, and we decided to use this existing 4 Spanish, of course, possesses enclitic pronouns for some verb forms but these were not adequately treated until the solution for Arabic was implemented in our system. 5 Our dictionary making process generates all full form versions of non compound and unagglutinated words. These are e then compiled into a finite-state automaton. Every node corresponding to a full word is flagged, and an index corresponding to the automaton path points to the lexical data for that word.  mechanism for later matching of voweled and unvoweled versions of Arabic words in applications. Thus the only changes for lexical resources involve adding two small clitic lexicons.</Paragraph>
    <Paragraph position="4">  Going back to the NLP processing steps listed in section 2, we now discuss new processing changes needed for treating Arabic. Tokenization (1a) and simple word lookup (2a) of the tokenized strings in the dictionary were unchanged as LIMA was coded for UTF8. If the word was not found, an existing orthographical alternative lookup (1c) was also used without change (except for the addition of the language specific correspondence table between accented and unaccented characters) in order to find lexical entries for unvoweled or partially voweled words. Using this existing mechanism for treating the vowelization problem does not allow us to exploit partial vowelization as we explain in a later section.</Paragraph>
    <Paragraph position="5"> At this point in the processing, a word that contains clitics will not have been found in the dictionary since we had decided not to include word forms including clitics. We introduced, here, a new processing step for Arabic: a clitic stemmer. This stemmer uses the following linguistic resources: * The full form dictionary, containing for each word form its possible part-of-speech tags and linguistic features (gender, number, etc.). We currently have 5.4 million entries in this dictionary6.</Paragraph>
    <Paragraph position="6"> * The proclitic dictionary and the enclitic dictionary, having the same structure of the full form dictionary with voweled and unvoweled versions of each valid combination of clitics. There are 77 and 65 entries respectively in each dictionary.</Paragraph>
    <Paragraph position="7"> The clitic stemmer proceeds as follows on tokens unrecognized after step 1c:  * Several vowel form normalizations are performed (a an u un i in are removed, ' afii52400 are replaced by and final w' y' y or @ are replaced by w ~ ~ or afii57447).</Paragraph>
    <Paragraph position="8"> 6 If we generated all forms including appended clitics, we would generate an estimated 60 billion forms (Attia, 1999). * All clitic possibilities are computed by using proclitics and enclitics dictionaries.</Paragraph>
    <Paragraph position="9"> * A radical, computed by removing these clitics, is checked against the full form lexicon. If it does not exist in the full form  lexicon, re-write rules (such as those described in Darwish (2002)) are applied, and the altered form is checked against the full form dictionary. For example, consider the token afii62822hafii62829hw and the included clitics (w, afii62822h), the computed radical afii62829h does not exist in the full form lexicon but after applying one of the dozen re-write rules, the modified radical ~afii62829h is found the dictionary and the input token is segmented into root and clitics as: afii62822hafii62829hw = w + ~afii62829h + afii62822h . * The compatibility of the morpho-syntactic tags of the three components (proclitic, radical, enclitic) is then checked. Only valid segmentations are kept and added into the word graph. Table 1 gives some examples of segmentations7 of words in the sentence afii62764afii62832afii62811afii62782afii62803afii62817 afii62764afii62832afii62818afii62777afii62780afii62817 @rzw tafii62780a' afii62760afii62827afii62762afii62823afii62760afii62771 afii62825afii62820</Paragraph>
    <Paragraph position="11"> Producing this new clitic stemmer for Arabic allowed us to correctly treat a similar (but previously ignored) phenomenon in Spanish in which verb forms can possess pronominal enclitics. For example, the imperative form of &amp;quot;give to me&amp;quot; is written as &amp;quot;dame&amp;quot;, which corresponds to the radical &amp;quot;da&amp;quot; followed the enclitic &amp;quot;me&amp;quot;. Once we implemented this clitic stemmer for Arabic, we created an en7 For example, the agglutinated word afii62764afii62832afii62818afii62777afii62780afii62817 has two segmentations but only the segmentation: afii62764afii62832afii62818afii62777afii62780afii62817 = l + afii62764afii62832afii62818afii62777d will remain after POS tagging in step 2a  clitic dictionary for Spanish and then successfully used the same stemmer for this European language.</Paragraph>
    <Paragraph position="12"> At this point, the treatment resumes as with European languages. The detection of idiomatic8 expressions (step 1d) is performed after clitic separation using rules associated with trigger words for each expression. Once a trigger is found, its left and right lexical contexts in the rule are then tested. The trigger must be an entry in the full form lexicon, but can be represented as either a surface form or a lemma form combined with its morpho-syntactic tag. Here we came across another problem specific to Semitic languages. Since Arabic lexicon entries are voweled and since input texts may be partially voweled or unvoweled, we are forced to only use lemma forms to describe Arabic idiomatic expressions rules with the existing mechanism, or else enter all the possible partial vowelizations for each word in an idiomatic expression. Since, at this point after step 1c, each recognized word is represented with all its possible voweled lemmas in the analysis graph, we developed 482 contiguous idiomatic voweled expression rules. For example one of the developed rules recognizes in the text afii62833afii62823afii62760afii62769afii62817 nafii62829afii62823afii62760a (January) as a whole and tags the expression as a being a month.</Paragraph>
    <Paragraph position="13"> After idiomatic expression recognition, any nodes not yet recognized are assigned (in step 1e) default linguistic values based on features recognized during tokenization (e.g. presence of uppercase or numbers or special characters). Nothing was changed for this step of default value assignment in order to treat Arabic, but since Semitic languages do not have the capitalization clues that English and French have for recognizing proper and since Arabic proper names can often be decomposed into simple words (much like Chinese names), the current implementation of this step with our current lexical resources poses some problems.</Paragraph>
    <Paragraph position="14"> For example, consider the following sentence: afii62825afii62785afii62823afii62760ydafii62829afii62805 rwafii62780y afii62828afii62818afii62832afii62820zw afii62833afii62785afii62818afii62788afii62766afii62817 afii62819afii62832afii62772afii62785afii62766afii62817afii62760 afii62819afii62809afii62766afii62775y drafii62760afii62762afii62820afii62840 afii62816afii62823afii62782afii62808 afii62764afii62782afii62809afii62817 afii62828arafii62760afii62788y Frank Lampard celebrates the score by Chelsea and his team mate Eidur Gudjohnsen shares his elation. The name afii62816afii62823afii62782afii62808 (Frank) is iden8 An idiom in our system is a (possibly non-contiguous sequence) of known words that act as a single unit. For example, made up in He made up the story on the spot. Once an idiomatic expression is recognized the individual words nodes are joined into one node in the word graph.</Paragraph>
    <Paragraph position="15"> tified as such because it is found in the lexicon; the name drafii62760afii62762afii62820afii62840 (Lampard) is not in the lexicon and incorrectly stemmed as afii62840 +drafii62760afii62762afii62820 (plural of the noun dafii62782afii62762afii62820 (grater)); the name rwafii62780y (Eidur) is incorrectly tagged as a verb; and afii62825afii62785afii62823afii62760ydafii62829afii62805 (Gudjohnsen), which is not in the dictionary and for which the clitic stemmer does not produce any solutions receives the default tags adjective, noun, proper noun and verb, to be decided by the part-of-speech tagger.</Paragraph>
    <Paragraph position="16"> To improve this performance, we plan to enrich the Arabic lexicon with more proper names, using either name recognition (Maloney and Niv, 1998) or a back translation approach after name recognition in English texts (Al-Onaizan and Knight, 2002).</Paragraph>
    <Paragraph position="17"> Processing Steps: Part-of-speech analysis For the succeeding steps involving part-of-speech tagging, named entity recognition, division into nominal and verbal chains, and dependency extraction no changes were necessary for treating Arabic.</Paragraph>
    <Paragraph position="18"> After morphological analysis, as input to step 2a, part-of-speech tagging, we have the same type of word graph for Arabic text as for European text: each node is annotated with the surface form, a lemma and a part-of-speech in the graph. If a word is ambiguous, then more than one node appears in the graph for that word. Our part-of-speech tagging involves using a language model (bigrams and tri-grams of grammatical tags) derived from hand-tagged text to eliminate unattested or rare sub paths in the graph of words representing a sentence. For Arabic, we created a hand-tagged corpus, and where then able to exploit the existing mechanism.</Paragraph>
    <Paragraph position="19"> One space problem that has arisen in applying the existing processing designed for European languages comes from the problem of vowelization.</Paragraph>
    <Paragraph position="20"> With our previous European languages, it was extremely rare to have more than one possible lemmatization for a given pair: (surface form, grammatical part-of-speech tag)9. But, in Arabic this can be very common since an unvoweled string can correspond to many different words, some with the same part-of-speech but different lemmas. The effect of this previously unseen type of ambiguity on our data structures was to greatly increase the word graph size before and after part-of-speech tagging. Since each combination of (sur null 9 One example from French is the pair (etaient, finite-verb) that can correspond to the two lemmas: etre and etayer.</Paragraph>
    <Paragraph position="21">  face-form, part-of-speech-tag, and lemma) gives rise to a new node, the graph becomes larger, increasing the number of paths that all processing steps must explore. The solution to this for Arabic and other Semitic languages is simple, though we have not yet implemented it. We plan to modify our internal data structure so that each node will correspond to the surface form, a part-of-speech tag, and a set of lemmas: (surface-form, part-ofspeech-tag, {lemmas}). The inclusion of a set of possible lemmas, rather than just one lemma, in a node will greatly reduce the number of nodes in the graph and speed processing time.</Paragraph>
    <Paragraph position="22"> The next step in our NLP system, after part-of-speech tagging, is named entity recognition (Abuleil and Evans, 2004) using name triggers (e.g., President, lake, corporation, etc.). Beyond the problem mentioned above of distinguishing possible proper nouns, here we had an additional problem since our recognizer extracted the entity in its surface form. Since in Arabic, as in other Semitic languages, the input text is usually only partially voweled, this gave rise to many different forms (corresponding to different surface forms) for the same entity. This minor problem was solved by storing the fully voweled forms of the entities (for application such as information retrieval as shown below) rather than the surface form.</Paragraph>
    <Paragraph position="23"> After named entity recognition, our methods of verbal and nominal chain recognition and dependency extraction did not require any modifications for Arabic. But since the sentence graphs, as mentioned above, are currently large, we have restricted the chains recognized to simple noun and verb chunks (Abney, 1991) rather than the more complex chains (Marsh, 1984) we recognize for European languages. Likewise, the only dependency relations that we extract for the moment are relations between nominal elements. We expect that the reduction in sentence graph once lemmas are all collected in the same word node will allow us to treat more complex dependency relations.</Paragraph>
    <Paragraph position="24"> 4 Integration in a CLIR application The results of the NLP steps produce, for all languages we treat, a set of normalized lemmas, a set of named entities and a set of nominal compounds (as well as other dependency relations for some languages). These results can be used for any natural language processing application. For example, we have integrated LIMA as a front-end for a cross language information retrieval system. The inclusion of our Arabic language results into the information retrieval system did not necessitate any modifications to this system.</Paragraph>
    <Paragraph position="25"> This information retrieval (IR) application involves three linguistic steps, as shown in section 2.</Paragraph>
    <Paragraph position="26"> First, in step 3a, subgraphs (compounds and their components) of the original sentence graph are stored. For example, the NLP analysis will recognize an English phrase such as &amp;quot;management of water resources&amp;quot; as a compound that the IR system will index. This phrase and its sub-elements are normalized and indexed (as well as simple words) in the following head-first normalized forms:</Paragraph>
    <Paragraph position="28"> Parallel head-first structures are created for different languages, for example, the French &amp;quot;gestion des ressource en eau&amp;quot; generates:  When a question is posed to our cross language IR (CLIR) system it undergoes the same NLP treatment as in steps 1a to 3a. Then the query is reformulated using synonym dictionaries and translation dictionaries in step 3b. For Arabic, we have not yet acquired any monolingual synonym dictionaries, but we have purchased and modified cross-lingual transfer dictionaries between Arabic and English, Arabic and French, and Arabic and Spanish10. When a compound is found in a query, it is normalized and its sub elements are extracted as shown above. Using the reformulation dictionaries, variant versions of the compound are generated (monolingual, then cross-lingual versions) and at10 Linden and Piitulainen (2004) propose a method for extracting monolingual synonym lists from bilingual resources.  tested variants are retained as synonyms to the original compound11 (Besancon et al., 2003). To integrate the Arabic version into our CLIR system, no modifications were necessary beyond acquiring and formatting the cross language reformulation dictionaries.</Paragraph>
    <Paragraph position="29"> The final NLP step (3c) involving in our CLIR system involves ranking relevant documents. Contrary to a bag of word system, which uses only term frequency in queries and documents, our system (Besancon et al., 2003) returns documents in ranked weighted classes12 whose weightings involve the presence of named entities, the completeness of the syntactic subgraphs matched, and the database frequencies of the words and sub-graphs matched.</Paragraph>
    <Paragraph position="30"> Example An online version of our cross language retrieval system involving our Arabic processing is visible online at a third party site: http://alma.oieau.fr. This base contains 50 non-parallel documents about sustainable development for each of the following languages: English, Spanish, French and Arabic. The user can enter a query in natural language and specify the language to be used. In the example of the Figure 1, the user entered the query &amp;quot; afii57447afii62760afii62832afii62821afii62817 drafii62829afii62820 @rd&amp;quot; and selected Arabic as the language of the query.</Paragraph>
    <Paragraph position="31"> Relevant documents are grouped into classes characterized by the same set of concepts (i.e., reformulated subgraphs) as the query contains. Figure 2 shows some classes corresponding to the query &amp;quot; afii57447afii62760afii62832afii62821afii62817 drafii62829afii62820 @rd&amp;quot;. The query term @rd_ drafii62829afii62820_afii57447afii62760afii62832afii62820 is a term composed of three words: afii57447afii62760afii62832afii62820, drafii62829afii62820 and @rd. This compounds, its derived variants and their sub elements are reformulated into English, French, and Spanish and submitted to indexed versions of documents in each of these languages (as well as against Arabic documents). The highest ranking 11 This technique will only work with translations which have at least one subelement that is has a parallel between languages, but this is often the case for technical terms. 12 This return to a mixed Boolean approach is found in current research on Question Answering systems (Tellex et al., 2003). Our CLIR system resembles such systems, which return the passage in which the answer is found, since we highlight the most significant passages of each retrieved document.</Paragraph>
    <Paragraph position="32"> classes (as seen in Figure 2 for this example) match the following elements:  Terms of the query or the expansion of these terms which are found in the retrieved documents are highlighted as illustrated in Figures 2 and 3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML