File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1631_concl.xml

Size: 1,436 bytes

Last Modified: 2025-10-06 13:55:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1631">
  <Title>Capturing Out-of-Vocabulary Words in Arabic Text</Title>
  <Section position="16" start_page="264" end_page="264" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> Identifying foreign words in Arabic text is an important problem for cross-lingual information retrieval, since commonly-used techniques such as stemming should not be applied indiscriminately to all words in a collection.</Paragraph>
    <Paragraph position="1"> We have presented three approaches for identifying foreign words in Arabic text: lexicons, patterns, and n-grams. We have presented results that show that the lexicon approach outperforms the other approaches, and have described improvements to minimise the false identification of foreign words. These rules result in improved precision, but have a small negative impact on recall. Overall, the results are relatively low for practical applications, and more work is needed to deal with this problem. As foreign words are characterised by having different versions, an algorithm that collapse those versions to one form could be useful in identifying foreign words. We are presently exploring algorithms to normalise foreign words in Arabic text. This will allow us to identify normalised forms for foreign words and use a single consistent version for indexing and retrieval.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML