File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0704_intro.xml

Size: 5,260 bytes

Last Modified: 2025-10-06 14:03:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0704">
  <Title>Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval</Title>
  <Section position="3" start_page="5" end_page="25" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Due to the morphological complexity of the Arabic language, much research has focused on the effect of morphology on Arabic Information Retrieval (IR). The goal of morphology in IR is to conflate words of similar or related meanings. Several early studies suggested that indexing Arabic text using roots significantly increases retrieval effectiveness over the use of words or stems [1, 3, 11]. However, all the studies used small test collections of only hundreds of documents and the morphology in many of the studies was done manually.</Paragraph>
    <Paragraph position="1"> Performing morphological analysis for Arabic IR using existing Arabic morphological analyzers, most of which use finite state transducers [4, 12, 13], is problematic for two reasons. First, they were designed to produce as many analyses as possible without indicating which analysis is most likely. This property of the analyzers complicates retrieval, because it introduces ambiguity in the indexing phase as well as the search phase of retrieval. Second, the use of finite state transducers inherently limits coverage, which the number of words that the analyzer can analyze, to the cases programmed into the transducers.</Paragraph>
    <Paragraph position="2"> Darwish attempted to solve this problem by developing a statistical morphological analyzer for Arabic called Sebawai that attempts to rank possible analyses to pick the most likely one [7]. He concluded that even with ranked analysis, morphological analysis did not yield statistically significant improvement over words in IR. A later study by Aljlayl et al. on a large Arabic collection of 383,872 documents suggested that lightly stemmed words, where only common prefixes and suffixes are stripped from them, were perhaps better index term for Arabic [2]. Similar studies by Darwish [8] and Larkey [14] also suggested that light stemming is indeed superior to morphological analysis in the context of IR.</Paragraph>
    <Paragraph position="3">  However, the shortcomings of morphology might be attributed to issues of coverage and correctness. Concerning coverage, analyzers typically fail to analyze Arabized or transliterated words, which may have prefixes and suffixes attached to them and are typically valuable in IR. As for correctness, the presence (or absence) of a prefix or suffix may significantly alter the analysis of a word. For example, for the word &amp;quot;Alksyr&amp;quot; is unambiguously analyzed to the root &amp;quot;ksr&amp;quot; and stem &amp;quot;ksyr.&amp;quot; However, removing the prefix &amp;quot;Al&amp;quot; introduces an additional analysis, namely to the root &amp;quot;syr&amp;quot; and the stem &amp;quot;syr.&amp;quot; Perhaps such ambiguity can be reduced by using the context in which the word is mentioned. For example, for the word &amp;quot;ksyr&amp;quot; in the sentence &amp;quot;sAr ksyr&amp;quot; (and he walked like), the letter &amp;quot;k&amp;quot; is likely to be a prefix. The problem of coverage is practically eliminated by light stemming. However, light stemming yields greater consistency without regard to correctness. Although consistency is more important for IR applications than linguistic correctness, perhaps improved correctness would naturally yield great consistency. Lee et al. [15] adopted a trigram language model (LM) trained on a portion of the manually segmented LDC Arabic Treebank in developing an Arabic morphology system, which attempts to improve the coverage and linguistic correctness over existing statistical analyzers such as Sebawai [15]. The analyzer of Lee et al. will be henceforth referred to as the IBM-LM analyzer. IBM-LM's analyzer combined the trigram LM (to analyze a word within its context in the sentence) with a prefix-suffix filter (to eliminate illegal prefix suffix combinations, hence improving correctness) and unsupervised stem acquisition (to improve coverage). Lee et al.</Paragraph>
    <Paragraph position="4"> report a 2.9% error rate in analysis compared to 7.3% error reported by Darwish for Sebawai [7].</Paragraph>
    <Paragraph position="5"> This paper evaluates the IBM-LM analyzer in the context of a monolingual Arabic IR application to determine if in-context morphology leads to improved retrieval effectiveness compared to out-of-context analysis. To determine the effect of improved analysis, particularly the use of in-context morphology, the analyzer is used to produce analyses of words in isolation (with no context) and in-context. Since IBM-LM only produces stems, Sebawai was used to produce the roots corresponding to the stems produced by IBM-LM. Both are compared to Sebawai and light stemming.</Paragraph>
    <Paragraph position="6"> The paper will be organized as follows: Section 2 surveys related work; Section 3 describes the IR experimental setup for testing the IBM-LM analyzer; Section 4 presents experimental results; and Section 5 concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML