File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1050_intro.xml

Size: 4,818 bytes

Last Modified: 2025-10-06 14:01:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1050">
  <Title>Unsupervised Learning of Arabic Stemming using a Parallel Corpus</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Stemming is the process of normalizing word variations by removing prefixes and suffixes. From an y Work done while a summer intern at IBM TJ Watson Research Center information retrieval point of view, prefixes and suffixes add little or no additional meaning; in most cases, both the efficiency and effectiveness of text processing applications such as information retrieval and machine translation are improved.</Paragraph>
    <Paragraph position="1"> Building a rule-based stemmer for a new, arbitrary language is time consuming and requires experts with linguistic knowledge in that particular language. Supervised learning also requires large quantities of labeled data in the target language, and quality declines when using completely unsupervised methods. We would like to reach a compromise by using a few inexpensive and readily available resources in conjunction with unsupervised learning. Our goal is to develop a stemmer generator that is relatively language independent (to the extent that the language accepts stemming) and is trainable using little, inexpensive data. This paper presents an unsupervised learning approach to non-English stemming. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources.</Paragraph>
    <Paragraph position="2"> A parallel corpus is a collection of sentence pairs with the same meaning but in different languages (i.e. United Nations proceedings, bilingual newspapers, the Bible). Table 1 shows an example that uses the Buckwalter transliteration (Buckwalter, 1999).</Paragraph>
    <Paragraph position="3"> Usually, entire documents are translated by humans, and the sentence pairs are subsequently aligned by automatic means. A small parallel corpus can be available when native speakers and translators are not, which makes building a stemmer out of such corpus a preferable direction.</Paragraph>
    <Paragraph position="4">  In introducing the report, the representative of Zambia emphasised that her country was undergoing serious and far-reaching changes in the political and economic field.</Paragraph>
    <Paragraph position="5">  We describe our approach towards reaching this goal in section 2. Although we are using resources other than monolingual data, the unsupervised nature of our approach is preserved by the fact that no direct information about non-English stemming is present in the training data.</Paragraph>
    <Paragraph position="6"> Monolingual, unannotated text in the target language is readily available and can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. This optional step is closer to the traditional unsupervised learning paradigm and is described in section 2.4, and its impact on stemmer quality is described in 3.1.4.</Paragraph>
    <Paragraph position="7"> Our approach (denoted by UNSUP in the rest of the paper) is evaluated in section 3.1 by comparing it to a proprietary Arabic stemmer (denoted by GOLD). The latter is a state of the art Arabic stemmer, and was built using rules, suffix and prefix lists, and human annotated text. GOLD is an earlier version of the stemmer described in (Lee et al., ). The task-based evaluation section 3.2 compares the two stemmers by using them as a preprocessing step in the TREC Arabic retrieval task. This section also presents the improvement obtained over using unstemmed text.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Arabic details
</SectionTitle>
      <Paragraph position="0"> In this paper, Arabic was the target language but the approach is applicable to any language that needs affix removal. In Arabic, unlike English, both prefixes and suffixes need to be removed for effective stemming. Although Arabic provides the additional challenge of infixes, we did not tackle them because they often substantially change the meaning. Irregular morphology is also beyond the scope of this paper. As a side note for readers with linguistic background (Arabic in particular), we do not claim that the resulting stems are units representing the entire paradigm of a lexical item. The main purpose of stemming as seen in this paper is to conflate the token space used in statistical methods in order to improve their effectiveness. The quality of the resulting tokens as perceived by humans is not as important, since the stemmed output is intended for computer consumption.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML