File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1050_evalu.xml

Size: 7,012 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1050">
  <Title>Unsupervised Learning of Arabic Stemming using a Parallel Corpus</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Unsupervised Training and Testing
</SectionTitle>
      <Paragraph position="0"> For unsupervised training in Step 1, we used a small parallel corpus: 10,000 Arabic-English sentences from the United Nations(UN) corpus, where the English part has been stemmed and the Arabic transliterated. null For unsupervised training in Step 2, we used a larger, Arabic only corpus: 80,000 different sentences in the same dataset.</Paragraph>
      <Paragraph position="1"> The test set consisted of 10,000 different sentences in the UN dataset; this is the testing set used below unless specified.</Paragraph>
      <Paragraph position="2"> We also used a larger corpus ( a year of Agence France Press (AFP) data, 237K sentences) for Step 2 training and testing, in order to gauge the robustness and adaptation capability of the stemmer. Since the UN corpus contains legal proceedings, and the AFP corpus contains news stories, the two can be seen as coming from different domains.</Paragraph>
      <Paragraph position="3">  In this subsection the accuracy is defined as agreement with GOLD. GOLD is a state of the art, proprietary Arabic stemmer built using rules, suffix and prefix lists, and human annotated text, in addition to an unsupervised component. GOLD is an earlier version of the stemmer described in (Lee et al., ). Freely available (but less accurate) Arabic light stemmers are also used in practice.</Paragraph>
      <Paragraph position="4"> When measuring accuracy, all tokens are considered, including those that cannot be stemmed by simple affix removal (irregulars, infixes). Note that our baseline (removing Al and p, leaving everything unchanged) is higher that simply leaving all tokens unchanged.</Paragraph>
      <Paragraph position="5"> For a more relevant task-based evaluation, please refer to Subsection 3.2.</Paragraph>
      <Paragraph position="6">  parallel data can we use? We begin by examining the effect that the size of the parallel corpus has on the results after the first step. Here, we trained our stemmer on three different corpus sizes: 50K, 10K, and 2K sentences. The high baseline is obtained by treating Al and p as affixes. The 2K corpus had acceptable results (if this is all the data available). Using 10K was significantly better; however the improvement obtained when five times as much data (50K) was used was insignificant. Note that different languages might have different corpus size needs. All other results  Although severely handicapped at the beginning, the knowledge-free starting point manages to narrow the performance gap after a few iterations. Knowing the Al+p rule still helps at this stage. However, the performance gap is narrowed further in Step 2 (see figure 8), where the knowledge free starting point benefitted from the monolingual training.</Paragraph>
      <Paragraph position="7">  Figure 8 shows the results obtained when augmenting the stemmer trained in Step 1. Two different monolingual corpora are used: one from the same domain as the test set (80K UN), and one from a different domain/corpus, but three times larger (237K AFP). The larger dataset seems to be more useful in improving the stemmer, even though the domain was different.</Paragraph>
      <Paragraph position="8"> Figure 8: Results after Step 2 (Monolingual Corpus) The baseline and the accuracy after Step 1 are presented for reference.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Test Set
</SectionTitle>
      <Paragraph position="0"> We used an additional test set that consisted of 10K sentences taken from AFP, instead of UN as in previous experiments shown in figure 8 . Its purpose was to test the cross-domain robustness of the stemmer and to further examine the importance of applying the second step to the data needing to be stemmed.</Paragraph>
      <Paragraph position="1"> Figure 9 shows that, even though in Step 1 the stemmer was trained on UN proceedings, the results on the cross-domain (AFP) test set are comparable to those from the same domain (UN, figure 8). However, for this particular test set the baseline was much higher; thus the relative improvement with respect to the baseline is not as high as when the unsupervised training and testing set came from the same collection.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Task-Based Evaluation : Arabic
Information Retrieval
Task Description:
</SectionTitle>
      <Paragraph position="0"> Given a set of Arabic documents and an Arabic query, find a list of documents relevant to the query, and rank them by probability of relevance.</Paragraph>
      <Paragraph position="1"> We used the TREC 2002 documents (several years of AFP data), queries and relevance judgments. The 50 queries have a shorter, &amp;quot;title&amp;quot; component as wel as a longer &amp;quot;description&amp;quot;. We stemmed both the queries and the documents using UNSUP and GOLD respectively. For comparison purposes, we also left the documents and queries unstemmed.</Paragraph>
      <Paragraph position="2"> The UNSUP stemmer was trained with 10K UN sentences in Step 1, and with one year's worth of monolingual AFP data (1995) in Step 2.</Paragraph>
      <Paragraph position="3"> Evaluation metric: The evaluation metric used below is mean average precision (the standard IR metric), which is the mean of average precision scores for each query. The average precision of a single query is the mean of the precision scores after each relevant document retrieved. Note that average precision implicitly includes recall information. Precision is defined as the ratio of relevant documents to total documents retrieved up to that point in the ranking.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Figure 10: Arabic Information Retrieval Results We looked at the effect of different testing conditions on the mean average precision for the 50 queries. In Figure 10, the first set of bars uses the query titles only, the second set adds the description, and the last set restricts the results to one year (1995), using both the title and description. We tested this last condition because the unsupervised stemmer was refined in Step 2 using 1995 documents. The last group of bars shows a higher relative improvement over the unstemmed baseline; however, this last condition is based on a smaller sample of relevance judgements (restricted to one year) and is therefore not as representative of the IR task as the first two testing conditions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML