File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/p03-1050_relat.xml
Size: 2,294 bytes
Last Modified: 2025-10-06 14:15:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1050"> <Title>Unsupervised Learning of Arabic Stemming using a Parallel Corpus</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 1.2 Related Work </SectionTitle> <Paragraph position="0"> The problem of unsupervised stemming or morphology has been studied using several different approaches. For Arabic, good results have been obtained for plural detection (Clark, 2001). (Goldsmith, 2001) used a minimum description length paradigm to build Linguistica, a system for which the reported accuracy for European languages is cca.</Paragraph> <Paragraph position="1"> 83%. Note that the results in this section are not directly comparable to ours, since we are focusing on Arabic.</Paragraph> <Paragraph position="2"> A notable contribution was published by Snover (Snover, 2002), who defines an objective function to be optimized and performs a search for the stemmed configuration that optimizes the function over all stemming possibilities of a given text.</Paragraph> <Paragraph position="3"> Rule-based stemming for Arabic is a problem studied by many researchers; an excellent overview is provided by (Larkey et al., ).</Paragraph> <Paragraph position="4"> Morphology is not limited to prefix and suffix removal; it can also be seen as mapping from a word to an arbitrary meaning carrying token. Using an LSI approach, (Schone and Jurafsky, ) obtained 88% accuracy for English. This approach also deals with irregular morphology, which we have not addressed.</Paragraph> <Paragraph position="5"> A parallel corpus has been successfully used before by (Yarowsky et al., 2000) to project part of speech tags, named entity tags, and morphology information from one language to the other. For a parallel corpus of comparable size with the one used in our results, the reported accuracy was 93% for French (when the English portion was also available); however, this result only covers 90% of the tokens. Accuracy was later improved using suffix trees.</Paragraph> <Paragraph position="6"> (Diab and Resnik, 2002) used a parallel corpus for word sense disambiguation, exploiting the fact that different meanings of the same word tend to be translated into distinct words.</Paragraph> </Section> class="xml-element"></Paper>