File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0705_concl.xml
Size: 2,197 bytes
Last Modified: 2025-10-06 13:54:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0705"> <Title>Modifying a Natural Language Processing System for European Languages to Treat Arabic in Information Processing and Information Retrieval Applications</Title> <Section position="7" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> We have presented here an overview of our natural language processing system and its use in a CLIR setting. This article describes the changes that we had to implement to extend this system, which was initially implemented for treating European languages to the Semitic language, Arabic. Every new language possesses new problems for NLP systems, but treating a language from a new language family can severely test the original design. We found that the major problems we encountered in dealing with a language from the Semitic language family involved the problems of dealing with partially voweled or unvoweled text (two different problems), and of dealing with clitics. To treat the problem of clitics, we introduced two new lexicons and added an additional clitic stemming step at an appropriate place in our morphological analysis.</Paragraph> <Paragraph position="1"> For treating the problem of vowelization, we simply used existing methods for dealing with unaccented text, but this solution is not totally satisfactory for two reasons: we do not adequately exploit partially voweled text, and our data structures are not efficient for associating many different lemma (differing only in vowelization) with a single surface form. We are currently working on both these aspects in order to improve our treatment of Arabic. But the changes, that we describe here, involved in adding Arabic were not very extensive, and we able to integrate Arabic language treatment into a cross language information retrieval platform using one man-year of work after having created the lexicon and training corpus. A version of our CLIR is available online and illustrated in this article. We plan to more fully evaluate the performance of the CLIR system using the TREC 2001 and TREC 2002 in the coming year.</Paragraph> </Section> class="xml-element"></Paper>