File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1050_metho.xml

Size: 9,678 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1050">
  <Title>Unsupervised Learning of Arabic Stemming using a Parallel Corpus</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Approach
</SectionTitle>
    <Paragraph position="0"> Our approach is based on the availability of the following three resources: a small parallel corpus an English stemmer an optional unannotated Arabic corpus Our goal is to train an Arabic stemmer using these resources. The resulting stemmer will simply stem Arabic without needing its English equivalent. We divide the training into two logical steps: Step 1: Use the small parallel corpus Step 2: (optional) Use the monolingual corpus The two steps are described in detail in the following subsections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Step 1: Using the Small Parallel Corpus
</SectionTitle>
      <Paragraph position="0"> In Step 1, we are trying to exploit the English stemmer by stemming the English half of the parallel corpus and building a translation model that will establish a correspondence between meaning carrying substrings (the stem) in Arabic and the English stems.</Paragraph>
      <Paragraph position="1"> For our purposes, a translation model is a matrix of translation probabilities p(Arabic stemj English stem) that can be constructed based on the small parallel corpus (see subsection 2.2 for more details). The Arabic portion is stemmed with an initial guess (discussed in subsection 2.1.1) Conceptually, once the translation model is built, we can stem the Arabic portion of the parallel corpus by scoring all possible stems that an Arabic word can have, and choosing the best one. Once the Arabic portion of the parallel corpus is stemmed, we can build a more accurate translation model and repeat the process (see figure 2). However, in practice, instead of using a harsh cutoff and only keeping the best stem, we impose a probability distribution over the candidate stems. The distribution starts out uniform and then converges towards concentrating most of the probability mass in one stem candidate.</Paragraph>
      <Paragraph position="2">  The starting point is an inherent problem for unsupervised learning. We would like our stemmer to give good results starting from a very general initial guess (i.e. random). In our case, the starting point is the initial choice of the stem for each individual word. We distinguish several solutions: No stemming.</Paragraph>
      <Paragraph position="3"> This is not a desirable starting point, since affix probabilities used by our model would be zero.</Paragraph>
      <Paragraph position="4"> Random stemming As mentioned above, this is equivalent to imposing a uniform prior distribution over the candidate stems. This is the most general starting point.</Paragraph>
      <Paragraph position="5"> A simple language specific rule - if available If a simple rule is available, it would provide a better than random starting point, at the cost of reduced generality. For Arabic, this simple rule was to use Al as a prefix and p as a suffix. This rule (or at least the first half) is obvious even to non-native speakers looking at transliterated text. It also constitutes a surprisingly high baseline. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Translation Model
</SectionTitle>
      <Paragraph position="0"> We adapted Model 1 (Brown et al., 1993) to our purposes. Model 1 uses the concept of alignment between two sentences e and f in a parallel corpus; the alignment is defined as an object indicating for each word ei which word fj generated it. To obtain the probability of an foreign sentence f given the English sentence e, Model 1 sums the products of the translation probabilities over all possible alignments: null</Paragraph>
      <Paragraph position="2"> The alignment variable ai controls which English word the foreign word fi is aligned with. t(fje) is simply the translation probability which is refined iteratively using EM. For our purposes, the translation probabilities (in a translation matrix) are the final product of using the parallel corpus to train the translation model.</Paragraph>
      <Paragraph position="3"> To take into account the weight contributed by each stem, the model's iterative phase was adapted to use the sum of the weights of a word in a sentence instead of the count.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Candidate Stem Scoring
</SectionTitle>
      <Paragraph position="0"> As previously mentioned, each word has a list of substrings that are possible stems. We reduced the problem to that of placing two separators inside each Arabic word; the &amp;quot;candidate stems&amp;quot; are simply the substrings inside the separators. While this may seem inefficient, in practice words tend to be short, and one or two letter stems can be disallowed.</Paragraph>
      <Paragraph position="1"> An initial, naive approach when scoring the stem would be to simply look up its translation probability, given the English stem that is most likely to be its translation in the parallel sentence (i.e. the English stem aligned with the Arabic stem candidate).</Paragraph>
      <Paragraph position="2"> Figure 3 presents scoring examples before normalization. null Note that the algorithm to build the translation model is not a &amp;quot;resource&amp;quot; per se, since it is a language-independent algorithm. null English Phrase: the advisory committee  However, this approach has several drawbacks that prevent us from using it on a corpus other than the training corpus. Both of the drawbacks below are brought about by the small size of the parallel corpus: Out-of-vocabulary words: many Arabic stems will not be seen in the small corpus Unreliable translation probabilities for low-frequency stems.</Paragraph>
      <Paragraph position="3"> We can avoid these issues if we adopt an alternate view of stemming a word, by looking at the prefix and the suffix instead. Given the word, the choice of prefix and suffix uniquely determines the stem. Since the number of unique affixes is much smaller by definition, they will not have the two problems above, even when using a small corpus. These probabilities will be considerably more reliable and are a very important part of the information extracted from the parallel corpus. Therefore, the score of a candidate stem should be based on the score of the corresponding prefix and the suffix, in addition to the score of the stem string itself:</Paragraph>
      <Paragraph position="5"> where a = Arabic stem, p = prefix, s=suffix When scoring the prefix and the suffix, we could simply use their probabilities from the previous stemming iteration. However, there is additional information available that can be successfully used to condition and refine these probabilities (such as the length of the word, the part of speech tag if given etc.).</Paragraph>
      <Paragraph position="6"> English Phrase: the advisory committee  We explored several stem scoring models, using different levels of available information. Examples include: Use the stem translation probability alone score = t(aje) where a = Arabic stem, e = corresponding word in the English sentence Also use prefix (p) and suffix (s) conditional probabilities; several examples are given in table 2.</Paragraph>
      <Paragraph position="7">  The first two examples use the joint probability of the prefix and suffix, with a smoothing back-off (the product of the individual probabilities). Scoring models of this form proved to be poor performers from the beginning, and they were abandoned in favor of the last model, which is a fast, good approximation to the third model in Table 2. The last two models successfully solve the problem of the empty prefix and suffix accumulating excessive probability, which would yield to a stemmer that never removed any affixes. The results presented in the rest of the paper use the last scoring model.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Step 2: Using the Unlabeled Monolingual
Data
</SectionTitle>
      <Paragraph position="0"> This optional second step can adapt the trained stemmer to the problem at hand. Here, we are moving away from providing the English equivalent, and we are relying on learned prefix, suffix and (to a lesser degree) stem probabilities. In a new domain or corpus, the second step allows the stemmer to learn new stems and update its statistical profile of the previously seen stems.</Paragraph>
      <Paragraph position="1"> This step can be performed using monolingual Arabic data, with no annotation needed. Even though it is optional, this step is recommended since its sole resource can be the data we would need to stem anyway (see Figure 5).</Paragraph>
      <Paragraph position="2">  Step 1 produced a functional stemming model.</Paragraph>
      <Paragraph position="3"> We can use the corpus statistics gathered in Step 1 to stem the new, monolingual corpus. However, the scoring model needs to be modified, since t(aje) is no longer available. By removing the conditioning, the first/last letter scoring model we used becomes score = p(a) p(sjlast) p(pjfirst) The model can be updated if the stem candidate score/probability distribution is sufficiently skewed, and the monolingual text can be stemmed iteratively using the new model. The model is thus adapted to the particular needs of the new corpus; in practice, convergence is quick (less than 10 iterations).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML