File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/p03-1050_concl.xml
Size: 1,837 bytes
Last Modified: 2025-10-06 13:53:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1050"> <Title>Unsupervised Learning of Arabic Stemming using a Parallel Corpus</Title> <Section position="6" start_page="0" end_page="0" type="concl"> <SectionTitle> 4 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer using a small sentence-aligned parallel corpus in which the English part has been stemmed. No parallel text is needed to use the stemmer. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. The approach is applicable to any language that needs affix removal; for Arabic, our approach results in 87.5% agreement with a proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 93-96% of the performance of the state of the art, language specific stemmer above.</Paragraph> <Paragraph position="1"> We can speculate that, because of the statistical nature of the unsupervised stemmer, it tends to focus on the same kind of meaning units that are significant for IR, whether or not they are linguistically correct. This could explain why the gap betheen GOLD and UNSUP is narrowed with task-based evaluation and is a desirable effect when the stemmer is to be used for IR tasks.</Paragraph> <Paragraph position="2"> We are planning to experiment with different languages, translation model alternatives, and to extend task-based evaluation to different tasks such as machine translation and cross-lingual topic detection and tracking.</Paragraph> </Section> class="xml-element"></Paper>