File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0704_metho.xml

Size: 3,556 bytes

Last Modified: 2025-10-06 14:09:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0704">
  <Title>Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval</Title>
  <Section position="5" start_page="26" end_page="26" type="metho">
    <SectionTitle>
3 Experimental Design
</SectionTitle>
    <Paragraph position="0"> IR experiments were done on the LDC LDC2001T55 collection, which was used in the Text REtrieval Conference (TREC) 2002 cross-language track. For brevity, the collection is referred to as the TREC collection. The collection contains 383,872 articles from the Agence France Press (AFP) Arabic newswire. Fifty topics were developed cooperatively by the LDC and the National Institute of Standards and Technology (NIST), and relevance judgments were developed at the LDC by manually judging a pool of documents obtained from combining the top 100 documents from all the runs submitted by the participating teams to TREC's cross-language track in 2002. The number of known relevant documents ranges from 10 to 523, with an average of 118 relevant documents per topic [17]. This is presently the best available large Arabic information retrieval test collection. The TREC topic descriptions include a title field that briefly names the topic, a description field that usually consists of a single sentence description, and a narrative field that is intended to contain any information that would be needed by a human judge to accurately assess the relevance of a document [10]. Queries were formed from the TREC topics by combining the title and description fields. This is intended to model the sort of statement that a searcher might initially make when asking an intermediary, such as a librarian, for help with a search.</Paragraph>
    <Paragraph position="1"> Experiments were performed for the queries with the following index terms:  A slightly modified version of Leah Larkey's Light-10 light stemmer [8] was also tried, but the stemmer produced very similar results to Al-Stem.</Paragraph>
    <Paragraph position="2"> * cIBM-LMS: stems obtained using the IBM-LM analyzer in context. Basically, the entire TREC collection was processed by the analyzer and the prefixes and suffixes in the segmented output were removed.</Paragraph>
    <Paragraph position="3"> * cIBM-SEB-r: roots obtained by analyzing the in-context stems produced by IBM-LM using Sebawai.</Paragraph>
    <Paragraph position="4"> * IBM-LMS: stems obtained using the IBM-LM analyzer without any contextual information.</Paragraph>
    <Paragraph position="5"> Basically, all the unique words in the collection were analyzed one by one and the prefixes and suffixes in the segmented output were removed.</Paragraph>
    <Paragraph position="6"> * IBM-SEB-r: roots obtained by analyzing the out-of-context stems produced by IBM-LM using Sebawai.</Paragraph>
    <Paragraph position="7"> All retrieval experiments were performed using the Lemur language modeling toolkit, which was configured to use Okapi BM-25 term weighting with default parameters and with and without blind relevance feedback (the top 20 terms from the top 5 retrieved documents were used for blind relevance feedback). To observe the effect of alternate indexing terms mean uninterpolated average precision was used as the measure of retrieval effectiveness. To determine if the difference between results was statistically significant, a Wilcoxon signed-rank test, which is a nonparametric significance test for correlated samples, was used with p values less than 0.05 to claim significance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML