XML Viewer - w05-0701

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0701_metho.xml
Size: 24,284 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0701">
  <Title>part-of-speech tagging of Arabic</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Memory-based learning
</SectionTitle>
    <Paragraph position="0"> Memory-based learning, also known as instancebased, example-based, or lazy learning (Aha et al., 1991; Daelemans et al., 1999), extensions of the k-nearest neighbor classifier (Cover and Hart, 1967), is a supervised inductive learning algorithm for learning classification tasks. Memory-based learning treats a set of labeled (pre-classified) training instances as points in a multi-dimensional feature space, and stores them as such in an instance base in memory. Thus, in contrast to most other machine learning algorithms, it performs no abstraction, which allows it to deal with productive but low-frequency exceptions (Daelemans et al., 1999).</Paragraph>
    <Paragraph position="1"> An instance consists of a fixed-length vector of n feature-value pairs, and the classification of that particular feature-value vector. After the instance base is stored, new (test) instances are classified by matching them to all instances in the instance base, and by calculating with each match the distance, given by a distance kernel function. Classification in memory-based learning is performed by the k-NN algorithm that searches for the k 'nearest neighbours' according to the [?](X,Y ) kernel function1.</Paragraph>
    <Paragraph position="2"> The distance function and the classifier can be refined by several kernel plug-ins, such as feature weighting (assigning larger distance to mismatches on important features), and distance weighting (assigning a smaller vote in the classification to more distant nearest neighbors). Details can be found in (Daelemans et al., 2004).</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="4" type="metho">
    <SectionTitle>
3 Morphological analysis
</SectionTitle>
    <Paragraph position="0"> We focus first on morphological analysis . Training on data extracted from the Arabic Treebank, we induce a morphological analysis generator which we control for undergeneralization (recall errors) and overgeneralization (precision errors).</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Data
3.1.1 Arabic Treebank
</SectionTitle>
      <Paragraph position="0"> Our point of departure is the Arabic Treebank 1 (ATB1), version 3.0, distributed by LDC in 2005, more specifically the &amp;quot;after treebank&amp;quot; PoS-tagged data. Unvoweled tokens as they appear in the original news paper are accompanied in the treebank by vocalized versions; all of their morphological analyses are generated by means of Tim Buckwalter's Arabic Morphological Analyzer (Buckwalter, 2002), and the appropriate morphological analysis is singled out. An example is given in Figure 1. The input token (INPUT STRING) is transliterated (LOOK-UP WORD) according to Buckwalter's transliteration system. All possible vocalizations and their morphological analyzes are listed (SOLUTION). The analysis is rule-based, and basically consists of three steps.</Paragraph>
      <Paragraph position="1"> First, all possible segmentations of the input string  SOLUTION 6: (kutubi) [kitAb_1] kutub/NOUN+i/CASE_DEF_GEN (GLOSS): books + [def.gen.] SOLUTION 7: (kutubN) [kitAb_1] kutub/NOUN+N/CASE_INDEF_NOM (GLOSS): books + [indef.nom.] SOLUTION 8: (kutubK) [kitAb_1] kutub/NOUN+K/CASE_INDEF_GEN (GLOSS): books + [indef.gen.] SOLUTION 9: (ktb) [DEFAULT] ktb/NOUN_PROP (GLOSS): NOT_IN_LEXICON SOLUTION 10: (katb) [DEFAULT] ka/PREP+tb/NOUN_PROP (GLOSS): like/such as + NOT_IN_LEXICON  in terms of prefixes (0 to 4 characters long), stems (at least one character), and suffixes (0 to 6 characters long) are generated. Next, dictionary lookup is used to determine if these segments are existing morphological units. Finally, the numbers of analyses is further reduced by checking for the mutual compatibility of prefix+stem, stem+suffix, and prefix+stem in three compatibility tables. The resulting analyses have to a certain extent been manually checked. Most importantly, a star (*) preceding a solution indicates that this is the correct analysis in the given context.</Paragraph>
      <Paragraph position="2">  We grouped the 734 files from the treebank into eleven parts of approximately equal size. Ten parts were used for training and testing our morphological analyzer, while the final part was used as held-out material for testing the morphological analyzer in combination with the PoS tagger (described in Section 4).</Paragraph>
      <Paragraph position="3"> In the corpus the number of analyses per word is not entirely constant, either due to the automatic generation method or to annotator edits. As our initial goal is to predict all possible analyses for a given word, regardless of contextual constraints, we first created a lexicon that maps every word to all analyses encountered and their respective frequencies From the 185,061 tokens in the corpus, we extracted 16,626 unique word types - skipping punctuation tokens - and 129,655 analyses, which amounts to 7.8 analyses per type on average.</Paragraph>
      <Paragraph position="5"> in Figure 1.</Paragraph>
      <Paragraph position="6">  These separate lexicons were created for training and testing material. The lexical entries in a lexicon were converted to instances suitable to memory-based learning of the mapping from words to their analyses (van den Bosch and Daelemans, 1999). Instances consist of a sequence of feature values and a corresponding class, representing a potentially complex morphological operation.</Paragraph>
      <Paragraph position="7"> The features are created by sliding a window over the unvoweled look-up word, resulting in one instance for each character. Using a 5-1-5 window yields 11 features, i.e. the input character in focus, plus the five preceding and five following characters. The equal sign (=) is used as a filler symbol.</Paragraph>
      <Paragraph position="8"> The instance classes represent the morphological analyses. The classes corresponding to a word's characters should enable us to derive all associated analyses. This implies that the classes need to encode several aspects simultaneously: vocalization, morphological segmentation and tagging. The following template describes the format of classes:</Paragraph>
      <Paragraph position="10"> following vowels &amp; tags For example, the classes of the instances in Figure 2 encode the ten solutions for the word ktb in  it allows for a simple derivation of the solution, akin to the way that the pieces of a jigsaw puzzle can be combined. We can exhaustively try all combinations of the subanalyses of the classes, and check if the right side of one subanalysis matches the left side of a subsequent subanalysis. This reconstruction process is illustrated in Figure 3 (only two reconstructions are depicted, corresponding to SOLUTION 1 and SOLUTION 4). For example, the subanalysis ka from the first class in Figure 2 matches the subanalysis ata from the sec-</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Initial Experiments
</SectionTitle>
      <Paragraph position="0"> To test the feasibility of our approach, we first train and test on the full data set. Timbl is used with its default settings (overlap distance function, gain-ratio feature weighting, k = 1). Rather than evaluating on the accuracy of predicting the complex classes, we evaluate on the complete correctness of all reconstructed analyses, in terms of precision, recall, and F-score (van Rijsbergen, 1979). As expected, this results in a near perfect recall (97.5). The precision, however, is much lower (52.5), indicating a substantial amount of analysis overgeneration; almost one in two generated analyses is actually not valid. With an F-score of only 68.1, we are clearly not able to reproduce the training data perfectly.</Paragraph>
      <Paragraph position="1"> Next we split the data in 9 parts for training and 1 part for testing. The k-NN classifier is again used with its default settings. Table 1 shows the results broken down into known and unknown words. As known words can be looked up in the lexicon derived from the training material, the first row presents the results with lookup and the second row without lookup (that is, with prediction). The fact that even with lookup the performance is not perfect shows that the upper bound for this task is not 100%. The reason is that apparantly some words in the test material have received analyses that never occur in the training material and vice versa. For known words without lookup, the recall is still good, but the precision is low. This is consistent with the initial results mentioned above. For unknown words, both recall and precison are much worse, indicating rather poor generalization.</Paragraph>
      <Paragraph position="2"> To sum up, there appear to be problems with both the precision and the recall. The precision is low for known words and even worse for unknown words.</Paragraph>
      <Paragraph position="3">  recall, split into known and unknown words.</Paragraph>
      <Paragraph position="4"> Analysis overgeneration seems to be a side effect of the way we encode and reconstruct the analyses.</Paragraph>
      <Paragraph position="5"> The recall is low for unknown words only. There appear to be at least two reasons for this undergeneration problem. First, if just one of the predicted classes is incorrect (one of the pieces of the jigsaw puzzle is of the wrong shape) then many, or even all of the reconstructions fail. Second, some generalizations cannot be made, because infrequent classes are overshadowed by more frequent ones with the same features. Consider, for example, the instance for the third character (l) of the word jEl:</Paragraph>
      <Paragraph position="7"> Its real class in the test data is: al/VERB_PERFECT+;ol/NOUN+ When the k-NN classifier is looking for its nearest neighbors, it finds three; two with a &amp;quot;verb imperfect&amp;quot; tag, and one with a &amp;quot;noun&amp;quot; tag.</Paragraph>
      <Paragraph position="8"> { al/VERB_IMPERFECT+ 2, ol/NOUN+ 1} Therefore, the class predicted by the classifier is al/VERB IMPERFECT+, because this is the majority class in the NN-set. So, although a part of the correct solution is present in the NN-set, simple majority voting prevents it from surfacing in the output.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Improving recall
</SectionTitle>
      <Paragraph position="0"> In an attempt to address the low recall, we revised our experimental setup to take advantage of the complete NN-set. As before, the k-NN classifier is used,  experiment, split into known and unknown words but rather than relying on the classifier to do the majority voting over the (possibly weighted) classes in the k-NN set and to output a single class, we perform a reconstruction of analyses combining all classes in the k-NN set. To allow for more classes in k-NN's output, we increase k to 3 while keeping the other settings as before. As expected, this approach increases the number of analyses. This, in turn, increases the recall dramatically, up to nearly perfect for known words; see Table 2. However, this gain in recall is at the expense of the precision, which drops dramatically. So, although our revised approach solves the issues above, it introduces massive overgeneration.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.4 Improving precision
</SectionTitle>
      <Paragraph position="0"> We try to tackle the overgeneration problem by filtering the analyses in two ways. First, by ranking the analyses and limiting output to the n-best. The ranking mechanism relies on the distribution of the classes in the NN-set. Normally, some classes occur more frequently than others in the NN-set. During the reconstruction of a particular analysis, we sum the frequencies of the classes involved. The resulting score is then used to rank the analyses in decreasing order, which we filter by taking the n-best.</Paragraph>
      <Paragraph position="1"> The second filter employs the fact that only certain sequences of morphological tags are valid. Tag bigrams are already implicit in the way that the classes are constructed, because a class contains the tags preceding and following the input character. However, cooccurrence restrictions on tags may stretch over longer distances; tag trigram information is not available at all. We therefore derive a frequency list of all tag trigrams occurring in the training data. This information is then used to filter analyses containing tag trigrams occurring below a certain frequency threshold in the training data.</Paragraph>
      <Paragraph position="2"> Both filters were optimized on the fold that was used for testing so far, maximizing the overall Fscore. This yieled an n-best value of 40 and tag frequency treshold of 250. Next, we ran a 10-fold cross-validation experiment on all data (except the held out data) using the method described in the previous section in combination with the filters. Average scores of the 10 folds are given in Table 3. In comparison with the initial results, both precision and recall on unknown words has improved, indicating that overgeneration and undergeneration can be midly counteracted.</Paragraph>
    </Section>
    <Section position="5" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Discussion
</SectionTitle>
      <Paragraph position="0"> Admittedly, the performance is not very impressive.</Paragraph>
      <Paragraph position="1"> We have to keep in mind, however, that the task is not an easy one. It includes vowel insertion in ambiguous root forms, which - in contrast to vowel insertion in prefixes and suffixes - is probably irregular and unpredictable, unless the appropriate stem would be known. As far as the evaluation is concerned, we are unsure whether the analyses found in the treebank for a particular word are exhaustive. If not, some of the predictions that are currently counted as precision errors (overgeneration) may in fact be correct alternatives.</Paragraph>
      <Paragraph position="2"> Since instances are generated for each type rather than for each token in the data, the effect of token frequency on classification is lost. For example, instances from frequent tokens are more likely to occur in the k-NN set, and therefore their (partial) analyses will show up more frequently. This is an issue to explore in future work. Depending on the application, it may also make sense to optimize on the correct prediction of unkown words, or on increasing only the recall.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="6" type="metho">
    <SectionTitle>
4 Part-of-speech tagging
</SectionTitle>
    <Paragraph position="0"> We employ MBT, a memory-based tagger-generator and tagger (Daelemans et al., 1996) to produce a part-of-speech (PoS) tagger based on the ATB1 corpus2. We first describe how we prepared the corpus data. We then describe how we generated the tagger (a two-module tagger with a module for known words and one for unknown words), and subsequently we report on the accuracies obtained on test material by the generated tagger. We conclude this  words (left) and their respective PoS tags (right). section by describing the effect of using the output of the morphological analyzer as extra input to the tagger.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Data preparation
</SectionTitle>
      <Paragraph position="0"> While the morphological analyzer attempts to generate all possible analyses for a given unvoweled word, the goal of PoS tagging is to select one of these analyses as the appropriate one given the context, as the annotators of the ATB1 corpus did using the * marker. We developed a PoS tagger that is trained to predict an unvoweled word in context, a concatenation of the PoS tags of its morphemes. Essentially this is the task of the morphological analyzer without segmentation and vocalization. Figure 4 shows part of a sentence where for each word the respective tag is given in the second column. Concatenation is marked by the delimiter +.</Paragraph>
      <Paragraph position="1"> We trained on the full ten folds used in the previous sections, and tested on the eleventh fold. The training set thus contains 150,966 words in 4,601 sentences; the test set contains 15,102 words in 469 sentences. 358 unique tags occur in the corpus. In the test set 947 words occur that do not occur in the training set.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
4.2 Memory-based tagger generator
</SectionTitle>
      <Paragraph position="0"> Memory-based tagging is based on the idea that words occurring in similar contexts will have the same PoS tag. A particular instantiation, MBT, was proposed in (Daelemans et al., 1996). MBT has three modules. First, it has a lexicon module which stores for all words occurring in the provided training corpus their possible PoS tags (tags which occur below a certain threshold, default 5%, are ignored). Second, it generates two distinct taggers; one for known words, and one for unknown words.</Paragraph>
      <Paragraph position="1"> The known-word tagger can obviously benefit from the lexicon, just as a morphological analyzer could. The input on which the known-word tagger bases its prediction for a given focus word consists of the following set of features and parameter settings: (1) The word itself, in a local context of the two preceding words and one subsequent word.</Paragraph>
      <Paragraph position="2"> Only the 200 most frequent words are represented as themselves; other words are reduced to a generic string - cf. (Daelemans et al., 2003) for details. (2) The possible tags of the focus word, plus the possible tags of the next word, and the disambiguated tags of two words to the left (which are available because the tagger operates from the beginning to the end of the sentence). The known-words tagger is based on a k-NN classifier with k = 15, the modified value difference metric (MVDM) distance function, inverse-linear distance weighting, and GR feature weighting. These settings were manually optimized on a held-out validation set (taken from the training data).</Paragraph>
      <Paragraph position="3"> The unknown-word tagger attempts to derive as much information as possible from the surface form of the word, by using its suffix and prefix letters as features. The following set of features and parameters are used: (1) The three prefix characters and the four suffix characters of the focus word (possibly encompassing the whole word); (2) The possible tags of the next word, and the disambiguated tags of two words to the left. The unknown-words tagger is based on a k-NN classifier with k = 19, the modified value difference metric (MVDM) distance function, inverse-linear distance weighting, and GR feature weighting - again, manually tuned on validation material.</Paragraph>
      <Paragraph position="4"> The accuracy of the tagger on the held-out corpus is 91.9% correctly assigned tags. On the 14155 known words in the test set the tagger attains an accuracy of 93.1%; on the 947 unknown words the accuracy is considerably lower: 73.6%.</Paragraph>
      <Paragraph position="5"> 5 Integrating morphological analysis and part-of-speech tagging While morphological analysis and PoS tagging are ends in their own right, the usual function of the two modules in higher-level natural-language processing or text mining systems is that they jointly determine for each word in a text the appropriate single morpho-syntactic analysis. In our setup, this  bound experiment with gold-standard PoS tags; the bottom line represents the experiment with predicted PoS tags.</Paragraph>
      <Paragraph position="6"> amounts to predicting the solution that is preceded by &amp;quot;*&amp;quot; in the original ATB1 data. For this purpose, the PoS tag predicted by MBT, as described in the previous section, serves to select the morphological analysis that is compatible with this tag. We employed the following two rules to implement this: (1) If the input word occurs in the training data, then look up the morphological analyses of the word in the training-based lexicon, and return all morphological analyses with a PoS content matching the tag predicted by the tagger. (2) Otherwise, let the memory-based morphological analyzer produce analyses, and return all analyses with a PoS content matching the predicted tag.</Paragraph>
      <Paragraph position="7"> We first carried out an experiment integrating the output of the morphological analyzer and the PoS tagger, faking perfect tagger predictions, in order to determine the upper bound of this approach. Rather than predicting the PoS tag with MBT, we directly derived the PoS tag from the annotations in the treebank. The upper result line in Table 4 displays the precision and recall scores on the held-out data of identifying the appropriate morphological analysis, i.e. the solution marked by *. Unsurprisingly, the recall on known words is 99.5%, since we are using the gold-standard PoS tag which is guaranteed to be among the training-based lexicon, except for some annotation discrepancies. More interestingly, about one in four analyses of known words matching on PoS tags actually mismatches on vowel or consonant changes, e.g. because it represents a different stem - which is unpredictable by our method.</Paragraph>
      <Paragraph position="8"> About one out of four unknown words has morphological analyses that do not match the gold-standard PoS (a recall of 73.4); at the same time, a considerable amount of overgeneration of analyses accounts for the low amount of analyses that matches (a precision of 30.2).</Paragraph>
      <Paragraph position="9"> Next, the experiment was repeated with predicted PoS tags and morphological analyses. The results are presented in the bottom result line of Table 4.</Paragraph>
      <Paragraph position="10"> The precision and recall of identifying correct analyses of known words degrades as compared to the upper-bounds results due to incorrect PoS tag predictions. On unknown words the combination of heavy overgeneration by the morphological analyzer and the 73.6% accuracy of the tagger leads to a low precision of 23.9 and a fair recall of 59.0. On both known and unknown words the integration of the morphological analyzer and the tagger is able to narrow down the analyses by the analyzer to a subset of matching analyses that in about nine out of ten cases contains the &amp;quot;* SOLUTION&amp;quot; word.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="6" end_page="7" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> The application of machine learning methods to Arabic morphology and PoS tagging appears to be somewhat limited and recent, compared to the vast descriptive and rule-based literature particularly on morphology (Kay, 1987; Beesley, 1990; Kiraz, 1994; Beesley, 1998; Cavalli-Sfora et al., 2000; Soudi, 2002).</Paragraph>
    <Paragraph position="1"> We are not aware of any machine-learning approach to Arabic morphology, but find related issues treated in (Daya et al., 2004), who propose a machine-learning method augmented with linguistic constraints to identifying roots in Hebrew words a related but reverse task to ours. Arabic PoS tagging seems to have attracted some more attention. Freeman (2001) describes initial work in developing a PoS tagger based on transformational error-driven learning (i.e. the Brill tagger), but does not provide performance analyses. Khoja (2001) reports a 90% accurate morpho-syntactic statistical tagger that uses  the Viterbi algorithm to select a maximally-likely part-of-speech tag sequence over a sentence. Diab et al. (2004) describe a part-of-speech tagger based on support vector machines that is trained on tokenized data (clitics are separate tokens), reporting a tagging accuracy of 95.5%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML