File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0708_metho.xml
Size: 17,008 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0708"> <Title>POS Tagging of Dialectal Arabic: A Minimally Supervised Approach</Title> <Section position="4" start_page="57" end_page="57" type="metho"> <SectionTitle> 3 Baseline Tagger </SectionTitle> <Paragraph position="0"> We use a statistical trigram tagger in the form of a hidden Markov model (HMM). Let w0:M be a sequence of words (w0,w1,... ,wM ) and t0:M be the corresponding sequence of tags. The trigram HMM computes the conditional probability of the word and tag sequence p(w0:M,t0:M) as:</Paragraph> <Paragraph position="2"> The lexical model p(wi|ti) characterizes the distribution of words for a specific tag; the contextual model p(ti|ti[?]1,ti[?]2) is trigram model over the tag sequence. For notational simplicity, in subsequent sections we will denote p(ti|ti[?]1,ti[?]2) as p(ti|hi), where hi represents the tag history.</Paragraph> <Paragraph position="3"> The HMM is trained to maximize the likelihood of the training data. In supervised training, both tag and word sequences are observed, so the maximum likelihood estimate is obtained by relative frequency counting, and, possibly, smoothing. During unsupervised training, the tag sequences are hidden, and the Expectation-Maximization Algorithm is used to iteratively update probabilities based on expected counts. Unsupervised tagging requires a lexicon specifying the set of possible tags for each word. Given a test sentence wprime0:M , the Viterbi algorithm is used to find the tag sequence maximizing the probability of tags given words:</Paragraph> <Paragraph position="5"> are implemented using the Graphical Models Toolkit (GMTK) (Bilmes and Zweig, 2002), which allows training a wide range of probabilistic models with both hidden and observed variables.</Paragraph> <Paragraph position="6"> As a first step, we compare the performance of four different baseline systems on our ECA dev set.</Paragraph> <Paragraph position="7"> First, we trained a supervised tagger on the MSA treebank corpus (System I), in order to gauge how a standard system trained on written Arabic performs on dialectal speech. The second system (System II) is a supervised tagger trained on the manual ECA POS annotations. System III is an unsupervised tagger on the ECA training set. The lexicon for this system is derived from the reference annotations of the training set - thus, the correct tag is not known during training, but the lexicon is constrained by expert information. The difference in accuracy between Systems II and III indicates the loss due to the unsupervised training method. Finally, we trained a system using only the unannotated ECA data and a lexicon generated by applying the MSA analyzer to the training corpus and collecting all resulting tags for each word. In this case, the lexicon is much less constrained; moreover, many words do not receive an output from the stemmer at all. This is the training method with the least amount of supervision and therefore the method we are interested in most.</Paragraph> <Paragraph position="8"> Table 5 shows the accuracies of the four systems on the ECA development set. Also included is a breakdown of accuracy by analyzable (AW), unanalyzable (UW), and out-of-vocabulary (OOV) words.</Paragraph> <Paragraph position="9"> Analyzable words are those for which the stemmer outputs possible analyses; unanalyzable words cannot be processed by the stemmer. The percentage of unanalyzable word tokens in the dev set is 18.8%. The MSA-trained tagger (System I) achieves an accuracy of 97% on a held-out set (117k words) of MSA data, but performs poorly on ECA due to a high OOV rate (43%). By contrast, the OOV rate for taggers trained on ECA data is 9.5%. The minimally-supervised tagger (System IV) achieves a baseline accuracy of 62.76%. In the following sections, we present several methods to improve this system, in order to approximate as closely as possible the performance of the supervised systems. 1 systems. AW = analyzable words, UW unanalyzable words, OOV = out-of-vocabulary words.</Paragraph> </Section> <Section position="5" start_page="57" end_page="58" type="metho"> <SectionTitle> 4 System Improvements </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="57" end_page="58" type="sub_section"> <SectionTitle> 4.1 Adding Affix Features </SectionTitle> <Paragraph position="0"> The low accuracy of unanalyzable and OOV words may significantly impact downstream applications.</Paragraph> <Paragraph position="1"> the most likely tag (NN) achieves an accuracy of 20%. A tagger which labels words with the most likely tag amongst its possible tags achieves an accuracy of 52%.</Paragraph> <Paragraph position="2"> One common way to improve accuracy is to add word features. In particular, we are interested in features that can be derived automatically from the script form, such as affixes. Affix features are added in a Naive Bayes fashion to the basic HMM model defined in Eq.1. In addition to the lexical model p(wi|ti) we now have prefix and suffix models p(ai|ti) and p(bi|ti), where a and b are the prefix and suffix variables, respectively. The affixes used are: a0a2a1 -, a3 a1 -, a4a5 -, a6a8a7 -, a9a10a7 -, a11a12a14a13 -, a11a12a16a15a17 -, a1a18 -, -a7a20a19, -a7a21a3a22a15a23 , -a7a20a19a24 , -a25a26a7 , -a7a21a3 a27a28 . Affixes are derived for each word by simple substring matching. More elaborate techniques are not used due to the philosophy of staying within a minimally supervised approach that does not require dialect-specific knowledge.</Paragraph> </Section> <Section position="2" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 4.2 Constraining the Lexicon </SectionTitle> <Paragraph position="0"> The quality of the lexicon has a major impact on unsupervised HMM training. Banko et. al. (2004) demonstrated that tagging accuracy improves when the number of possible tags per word in a &quot;noisy lexicon&quot; can be restricted based on corpus frequency. In the current setup, words that are not analyzable by the MSA stemmer are initally assigned all possible tags, with the exception of obvious restricted tags like the begin and end-of-sentence tags, NO-TAG, etc. Our goal is to constrain the set of tags for these unanalyzable words. To this end, we cluster both analyzable and unanalyzable words, and reduce the set of possible tags for unanalyzable words based on its cluster membership. Several different clustering algorithms could in principle be used; here we utilize Brown's clustering algorithm (Brown and others, 1992), which iteratively merges word clusters with high mutual information based on distributional criteria. The tagger lexicon is then derived as follows: 1. Generate K clusters of words from data.</Paragraph> <Paragraph position="1"> 2. For each cluster C, calculate P(t|C) =summationtext w[?]A,C P(t|w)P(w|C) where t and w are the word and tag, and A is the set of analyzable words.</Paragraph> <Paragraph position="2"> 3. The cluster's tagset is determined by choosing all tags tprime with P(tprime|C) above a certain threshold g.</Paragraph> <Paragraph position="3"> 4. All unanalyzable words within this cluster are given these possible tags.</Paragraph> <Paragraph position="4"> The number of clusters K and the threshold g are variables that affect the final tagset for unanalyzable words. Using K = 200 and g = 0.05 for instance, the number of tags per unanalyzable word reduces to an average of four and ranges from one to eight tags. There is a tradeoff regarding the degree of tagset reduction: choosing fewer tags results in less confusability but may also involve the removal of the correct tag from a word's lexicon entry. We did not optimize for K and g since an annotated development set for calculating accuracy is not available in a minimally supervised approach in practice. Nevertheless, we have observed that tagset reduction generally leads to improvements compared to the base-line system with an unconstrained lexicon.</Paragraph> <Paragraph position="5"> The improvements gained from adding affix features to System IV and constraining the lexicon are shown in Table 6. We notice that adding affix features yields improvements in OOV accuracy. The relationship between the constrained lexicon and un-analyzable word accuracy is less straighforward. In this case, the degradation of unanalyzable word accuracy is due to the fact that the constrained lexicon over-restricts the tagsets of some frequent unanalyzable words. However, the constrained lexicon generally improves the overall accuracy and is thus a viable technique. Finally, the combination of affix features and constrained lexicon results in a tagger with 69.83% accuracy, which is a 7% absolute im-</Paragraph> </Section> </Section> <Section position="6" start_page="58" end_page="60" type="metho"> <SectionTitle> 5 Cross-Dialectal Data Sharing </SectionTitle> <Paragraph position="0"> Next we examine whether unannotated corpora in other dialects (LCA) can be used to further improve the ECA tagger. The problem of data sparseness for Arabic dialects would be less severe if we were able to exploit the commonalities between similar dialects. In natural language processing, Kim & Khu- null danpur (2004) have explored techniques for using parallel Chinese/English corpora for language modeling. Parallel corpora have also been used to infer morphological analyzers, POS taggers, and noun phrase bracketers by projections via word alignments (Yarowsky et al., 2001). In (Hana et al., 2004), Czech data is used to develop a morphological analyzer for Russian.</Paragraph> <Paragraph position="1"> In contrast to these works, we do not require parallel/comparable corpora or a bilingual dictionary, which may be difficult to obtain. Our goal is to develop general algorithms for utilizing the commonalities across dialects for developing a tool for a specific dialect. Although dialects can differ very strongly, they are similar in that they exhibit morphological simplifications and a different word order compared to MSA (e.g. SVO rather than VSO order), and close dialects share some vocabulary.</Paragraph> <Paragraph position="2"> Each of the tagger components (i.e. contextual model p(ti|hi), lexical model p(wi|ti), and affix model p(ai|ti)p(bi|ti)) can be shared during training. In the following, we present two approaches for sharing data between dialects, each derived from following different assumptions about the underlying data generation process.</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 5.1 Contextual Model Interpolation </SectionTitle> <Paragraph position="0"> Contextual model interpolation is a widely-used data-sharing technique which assumes that models trained on data from different sources can be &quot;mixed&quot; in order to provide the most appropriate probability distribution for the target data. In our case, we have LCA as an out-of-domain data source, and ECA as the in-domain data source, with the former being about 4 times larger than the latter.</Paragraph> <Paragraph position="1"> If properly combined, the larger amount of out-of-domain data might improve the robustness of the in-domain tagger. We therefore use a linear interpolation of in-domain and out-of-domain contextual models. The joint probability p(w0:M,t0:M) becomes: null</Paragraph> <Paragraph position="3"> Here l defines the interpolation weights for the ECA contextual model pE(ti|hi) and the LCA contextual model pL(ti|hi). pE(wn|tn) is the ECA lexical model. The interpolation weight l is estimated by maximizing the likelihood of a held-out data set given the combined model. As an extension, we allow the interpolation weights to be a function of the current tag: l(ti), since class-dependent interpolation has shown improvements over basic interpolation in applications such as language modeling (Bulyko et al., 2003).</Paragraph> </Section> <Section position="2" start_page="59" end_page="60" type="sub_section"> <SectionTitle> 5.2 Joint Training of Contextual Model </SectionTitle> <Paragraph position="0"> As an alternative to model interpolation, we consider training a single model jointly from the two different data sets. The underlying assumption of this technique is that tag sequences in LCA and ECA are generated by the same process, whereas the observations (the words) are generated from the tag by two different processes in the two different dialects.</Paragraph> <Paragraph position="1"> The HMM model for joint training is expressed as:</Paragraph> <Paragraph position="3"> where ai= braceleftbigg 1 if word w</Paragraph> <Paragraph position="5"> A single conditional probability table is used for the contextual model, whereas the lexical model switches between two different parameter tables, one for LCA observations and another for ECA observations. During training, the contextual model is trained jointly from both ECA and LCA data; however, the data is divided into ECA and LCA portions when updating the lexical models. Both the contextual and lexical models are trained within the same training pass. A graphical model representation of this system is shown in Figure 1. This model can be implemented using the functionality of switching parents (Bilmes, 2000) provided by GMTK.</Paragraph> <Paragraph position="6"> During decoding, the tagger can in principle switch its lexical model to ECA or LCA, depending on the input; this system thus is essentially a multidialect tagger. In the experiments reported below, however, we exclusively test on ECA, and the LCA lexical model is not used. Due to the larger amount of data available for contextual model, joint training can be expected to improve the performance on the target dialect. The affix models can be trained jointly in a similar fashion.</Paragraph> </Section> <Section position="3" start_page="60" end_page="60" type="sub_section"> <SectionTitle> 5.3 Data sharing results </SectionTitle> <Paragraph position="0"> The results for data sharing are shown in Table 7.</Paragraph> <Paragraph position="1"> The systems Interpolate-l and Interpolate-l(ti) are taggers built by interpolation and class-dependent interpolation, respectively. For joint training, we present results for two systems: JointTrain(1:4) is trained on the existing collection ECA and LCA data, which has a 1:4 ratio in terms of corpus size; JointTrain(2:1) weights the ECA data twice, in order to bias the training process more towards ECA's distribution. We also provide results for two more taggers: the first (CombineData) is trained &quot;naively&quot; on the pooled data from both ECA and LCA, without any weighting, interpolation, or changes to the probabilistic model. The second (CombineLex) uses a contextual model trained on ECA and a lexical model estimated from both ECA and LCA data. The latter was trained in order to assess the potential for improvement due to the reduction in OOV rate on the dev set when adding the LCA data (cf. Table 4).</Paragraph> <Paragraph position="2"> All the above systems utilize the constrained lexicon, as it consistently gives improvements.</Paragraph> <Paragraph position="3"> Table 7 shows that, as expected, the brute-force combination of training data is not helpful and degrades performance. CombineLex results in higher accuracy but does not outperform models in Table 6.</Paragraph> <Paragraph position="4"> The same is true of the taggers using model interpolation. The best performance is obtained by the system using the joint contextual model with separate lexical models, with 2:1 weighting of ECA vs. LCA data. Finally, we added word affix information to the best shared-data system, which resulted in an accuracy of 70.88%. In contrast, adding affix to CombineData achieves 61.78%, suggesting that improvements in JointTrain comes from the joint training technique rather than simple addition of new data.</Paragraph> <Paragraph position="5"> This result is directly comparable to the best system in Section 4 (last row of Table 6)2.</Paragraph> <Paragraph position="6"> The analysis of tagging errors revealed that the most frequent confusions are between VBD/NNS, 2We also experimented with joint training of ECA+MSA.</Paragraph> <Paragraph position="7"> This gave good OOV accuracy, but overall it did not improve over the best system in Section 4. Also, note that all accuracies are calculated by ignoring the scoring of ambiguous words, which have several possible tags as the correct reference. If we score the ambiguous words as correct when the hypothesized tag is within this set, the accuracy of ECA+LCA+affix JointTrain rises to 77.18%, which is an optimistic upper-bound on the total accuracy.</Paragraph> <Paragraph position="8"> labeled as a particle in the reference but is most often tagged as a verb, which is also a reasonable tag.</Paragraph> </Section> </Section> class="xml-element"></Paper>