File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1631_evalu.xml
Size: 2,120 bytes
Last Modified: 2025-10-06 13:59:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1631"> <Title>Capturing Out-of-Vocabulary Words in Arabic Text</Title> <Section position="15" start_page="263" end_page="264" type="evalu"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> We have seen that foreign words are not easily recognised in Arabic text, and a large number of Arabic words are affected when we try to exclude foreign words.</Paragraph> <Paragraph position="1"> We found the lexicon approach to be the best in identifying foreign words. However, current lexicons are relatively small, and the variety of Arabic inflection makes it very difficult to include all correct word forms. Furthermore, current lexicons include many foreign words; for example when using OLA approach, 1 017 foreign words out of 1 218 are OOV, indicating that about 200 foreign words are present in that lexicon. The pattern approach is more efficient but the lack of diacritics in general written Arabic makes it very difficult to precisely match a pattern with a (right) word, resulting in many foreign words being incorrectly identified as Arabic. Passing the list of all 3 046 manually judged foreign words to the pattern approach, some 2 017 words of this list were correctly judged as foreign, and about one third (1 029) were incorrectly judged to be Arabic. The n-gram method produced reasonable precision compared to the lexicon-based methods. In contrast, TRG had the worst results. This could be due to the limited size of the training corpus. However, we expect that improvements to this approach will remain limited due to the fact that many Arabic and foreign words share the same trigrams. It is clear that all the approaches are improved dramatically when applying the enhancement rules. The improvements of the NGR wasn't as equal as other approaches. This is because some of the rules are implicitly applied within the n-gram approach. The lack of diacritics also makes it very difficult to distinguish between certain foreign and Arabic words. For example, without diacritics, the word A9G9AU</Paragraph> </Section> class="xml-element"></Paper>