File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1073_metho.xml
Size: 25,880 bytes
Last Modified: 2025-10-06 14:10:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1073"> <Title>Maximum Entropy Based Restoration of Arabic Diacritics</Title> <Section position="5" start_page="577" end_page="578" type="metho"> <SectionTitle> 3 Previous Work </SectionTitle> <Paragraph position="0"> Diacritic restoration has been receiving increasing attention and has been the focus of several studies. In (El-Sadany and Hashish, 1988), a rule based method that uses morphological analyzer for vowelization was proposed. Another, rule-based grapheme to sound conversion approach was appeared in 2003 by Y. El-Imam (El-Imam, 2003). The main drawbacks of these rule based methods is that it is difficult to maintain the rules up-to-date and extend them to other Arabic dialects. Also, new rules are required due to the changing nature of any &quot;living&quot; language.</Paragraph> <Paragraph position="1"> More recently, there have been several new studies that use alternative approaches for the diacritization problem. In (Emam and Fisher, 2004) an example based hierarchical top-down approach is proposed. First, the training data is searched hierarchically for a matching sentence. If there is a matching sentence, the whole utterance is used.</Paragraph> <Paragraph position="2"> Otherwise they search for matching phrases, then words to restore diacritics. If there is no match at all, character n-gram models are used to diacritize each word in the utterance.</Paragraph> <Paragraph position="3"> In (Vergyri and Kirchhoff, 2004), diacritics in conversational Arabic are restored by combining morphological and contextual information with an acoustic signal. Diacritization is treated as an unsupervised tagging problem where each word is tagged as one of the many possible forms provided by the Buckwalter's morphological analyzer (Buckwalter, 2002). The Expectation Maximization (EM) algorithm is used to learn the tag sequences.</Paragraph> <Paragraph position="4"> Y. Gal in (Gal, 2002) used a HMM-based diacritization approach. This method is a white-space delimited word based approach that restores only vowels (a subset of all diacritics).</Paragraph> <Paragraph position="5"> Most recently, a weighted finite state machine based algorithm is proposed (Nelken and Shieber, 2005). This method employs characters and larger morphological units in addition to words. Among all the previous studies this one is more sophisticated in terms of integrating multiple information sources and formulating the problem as a search task within a unified framework. This approach also shows competitive results in terms of accuracy when compared to previous studies. In their algorithm, a character based generative diacritization scheme is enabled only for words that do not occur in the training data. It is not clearly stated in the paper whether their method predict the diacritics shedda and sukuun.</Paragraph> <Paragraph position="6"> Even though the methods proposed for diacritic restoration have been maturing and improving over time, they are still limited in terms of coverage and accuracy. In the approach we present in this paper, we propose to restore the most comprehensive list of the diacritics that are used in any Arabic text.</Paragraph> <Paragraph position="7"> Our method differs from the previous approaches in the way the diacritization problem is formulated and because multiple information sources are integrated. We view the diacritic restoration problem as sequence classification, where given a sequence of characters our goal is to assign diacritics to each character. Our appoach is based on Maximum Entropy (MaxEnt henceforth) technique (Berger et al., 1996). MaxEnt can be used for sequence classification, by converting the activation scores into probabilities (through the soft-max function, for instance) and using the standard dynamic programming search algorithm (also known as Viterbi search). We find in the literature several other approaches of sequence classification such as (Mc-Callum et al., 2000) and (Lafferty et al., 2001). The conditional random fields method presented in (Lafferty et al., 2001) is essentially a MaxEnt model over the entire sequence: it differs from the Maxent in that it models the sequence information, whereas the Maxent makes a decision for each state independently of the other states. The approach presented in (McCallum et al., 2000) combines Maxent with Hidden Markov models to allow observations to be presented as arbitrary overlapping features, and define the probability of state sequences given observation sequences.</Paragraph> <Paragraph position="8"> We report in section 7 a comparative study between our approach and the most competitive diacritic restoration method that uses finite state machine algorithm (Nelken and Shieber, 2005). The MaxEnt framework was successfully used to combine a diverse collection of information sources and yielded a highly competitive model that achieves a</Paragraph> </Section> <Section position="6" start_page="578" end_page="579" type="metho"> <SectionTitle> 5.1% DER. 4 Automatic Diacritization </SectionTitle> <Paragraph position="0"> The performance of many natural language processing tasks, such as shallow parsing (Zhang et al., 2002) and named entity recognition (Florian et al., 2004), has been shown to depend on integrating many sources of information. Given the stated focus of integrating many feature types, we selected the MaxEnt classifier. MaxEnt has the ability to integrate arbitrary types of information and make a classification decision by aggregating all information available for a given classification.</Paragraph> <Section position="1" start_page="578" end_page="579" type="sub_section"> <SectionTitle> 4.1 Maximum Entropy Classifiers </SectionTitle> <Paragraph position="0"> We formulate the task of restoring diacritics as a classification problem, where we assign to each character in the text a label (i.e., diacritic). Before formally describing the method1, we introduce some notations: let Y = {y1, . . . , yn} be the set of diacritics to predict or restore, X be the example space and F = {0, 1}m be a feature space. Each example x [?] X has associated a vector of binary features f (x) = (f1 (x) , . . . , fm (x)). In a supervised framework, like the one we are considering here, we have access to a set of training examples together with their classifications: {(x1, y1) , . . . , (xk, yk)}. The MaxEnt algorithm associates a set of weights (aij)i=1...nj=1...m with the features, which are estimated during the training phase to maximize the likelihood of the data (Berger et al., 1996). Given these weights, the model computes the probability distribution over labels for a particular example x as follows:</Paragraph> <Paragraph position="2"> where Z(X) is a normalization factor. To estimate the optimal aj values, we train our Max-Ent model using the sequential conditional generalized iterative scaling (SCGIS) technique (Goodman, 2002). While the MaxEnt method can nicely integrate multiple feature types seamlessly, in certain cases it is known to overestimate its confidence in especially low-frequency features. To overcome this problem, we use the regularization method based on adding Gaussian priors as described in (Chen and Rosenfeld, 2000). After computing the class probability distribution, the chosen diacritic is the one with the most aposteriori probability.</Paragraph> <Paragraph position="3"> The decoding algorithm, described in section 4.2, performs sequence classification, through dynamic programming.</Paragraph> </Section> <Section position="2" start_page="579" end_page="579" type="sub_section"> <SectionTitle> 4.2 Search to Restore Diacritics </SectionTitle> <Paragraph position="0"> We are interested in finding the diacritics of all characters in a script or a sentence. These diacritics have strong interdependencies which cannot be properly modeled if the classification is performed independently for each character. We view this problem as sequence classification, as contrasted with an example-based classification problem: given a sequence of characters in a sentence x1x2 . . . xL, our goal is to assign diacritics (labels) to each character, resulting in a sequence of diacritics y1y2 . . . yL. We make an assumption that diacritics can be modeled as a limited order Markov sequence: the diacritic associated with the character i depends only on the diacritics associated with the k previous diacritics, where k is usually equal to 3. Given this assumption, and the notation xL1 = x1 . . . xL, the conditional probability of assigning the diacritic sequence yL1 to the character sequence xL1 becomes p parenleftbigyL1 |xL1 parenrightbig = p parenleftbigy1|xL1 parenrightbig p parenleftbigy2|xL1 , y1parenrightbig. . . p parenleftbigyL|xL1 , yL[?]1L[?]k+1parenrightbig</Paragraph> <Paragraph position="2"> and our goal is to find the sequence that maximizes this conditional probability</Paragraph> <Paragraph position="4"> While we restricted the conditioning on the classification tag sequence to the previous k diacritics, we do not impose any restrictions on the conditioning on the characters - the probability is computed using the entire character sequence xL1 .</Paragraph> <Paragraph position="5"> To obtain the sequence in Equation (2), we create a classification tag lattice (also called trellis), as follows: * Let xL1 be the input sequence of character and</Paragraph> <Paragraph position="7"> Every such state corresponds to the labeling of k successive characters. We find it useful to think of an element si as a vector with k elements. We use the notations si [j] for jth element of such a vector (the label associated with the token xi[?]k+j+1) and si [j1 . . . j2] for the sequence of elements between indices j1 and j2.</Paragraph> <Paragraph position="8"> * We conceptually associate every character xi, i = 1, . . ., L with a copy of S, Si =braceleftbig si1, . . . , simbracerightbig; this set represents all the possible labelings of characters xii[?]k+1 at the stage where xi is examined.</Paragraph> <Paragraph position="9"> * We then create links from the set Si to the Si+1, for all i = 1 . . .L[?]1, with the property</Paragraph> <Paragraph position="11"> These weights correspond to probability of a transition from the state sij1 to the state si+1j2 .</Paragraph> <Paragraph position="12"> * For every character xi, we compute recur-</Paragraph> <Paragraph position="14"> Intuitively, bi (sj) represents the log-probability of the most probable path through the lattice that ends in state sj after i steps, and gi (sj) represents the state just before sj on that particular path.</Paragraph> <Paragraph position="15"> * Having computed the (bi)i values, the algorithm for finding the best path, which corresponds to the solution of Equation (2) is 1. Identify ^sLL = arg maxj=1...L bL (sj) 2. For i = L [?] 1 . . .1, compute</Paragraph> <Paragraph position="17"/> <Paragraph position="19"> in the size of the sentence L but exponential in the size of the Markov dependency, k. To reduce the search space, we use beam-search.</Paragraph> </Section> <Section position="3" start_page="579" end_page="579" type="sub_section"> <SectionTitle> 4.3 Features Employed </SectionTitle> <Paragraph position="0"> Within the MaxEnt framework, any type of features can be used, enabling the system designer to experiment with interesting feature types, rather than worry about specific feature interactions. In contrast, with a rule based system, the system designer would have to consider how, for instance, lexical derived information for a particular example interacts with character context information.</Paragraph> <Paragraph position="1"> That is not to say, ultimately, that rule-based systems are in some way inferior to statistical models - they are built using valuable insight which is hard to obtain from a statistical-model-only approach. Instead, we are merely suggesting that the output of such a rule-based system can be easily integrated into the MaxEnt framework as one of the input features, most likely leading to improved performance.</Paragraph> <Paragraph position="2"> Features employed in our system can be divided into three different categories: lexical, segmentbased, and part-of-speech tag (POS) features. We also use the previously assigned two diacritics as additional features.</Paragraph> <Paragraph position="3"> In the following, we briefly describe the different categories of features: * Lexical Features: we include the character n-gram spanning the curent character xi, both preceding and following it in a window of 7: {xi[?]3, . . ., xi+3}. We use the current word wi and its word context in a window of 5 (forward and backward trigram): {wi[?]2, . . . , wi+2}. We specify if the character of analysis is at the beginning or at the end of a word. We also add joint features between the above source of information.</Paragraph> <Paragraph position="5"> delimited words are composed of zero or more prefixes, followed by a stem and zero or more suffixes. Each prefix, stem or suffix will be called a segment in this paper. Segments are often the subject of analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is usually computed on segments rather than words. As an example, the Arabic white-space delimited word D@K.A fl contains a verb K.A fl, a third-person feminine singular subject-marker H (she), and a pronoun suffix o (them); it is also a complete sentence meaning &quot;she met them.&quot; To separate the Arabic white-space delimited words into segments, we use a segmentation model similar to the one presented by (Lee et al., 2003). The model obtains an accuracy of about 98%. In order to simulate real applications, we only use segments generated by the model rather than true segments.</Paragraph> <Paragraph position="6"> In the diacritization system, we include the current segment ai and its word segment context in a window of 5 (forward and backward trigram): {ai[?]2, . . . , ai+2}. We specify if the character of analysis is at the beginning or at the end of a segment. We also add joint information with lexical features.</Paragraph> <Paragraph position="7"> * POS Features : we attach to the segment ai of the current character, its POS: POS(ai).</Paragraph> <Paragraph position="8"> This is combined with joint features that include the lexical and segment-based information. We use a statistical POS tagging system built on Arabic Treebank data with MaxEnt framework (Ratnaparkhi, 1996). The model has an accuracy of about 96%. We did not want to use the true POS tags because we would not have access to such information in real applications.</Paragraph> </Section> </Section> <Section position="7" start_page="579" end_page="581" type="metho"> <SectionTitle> 5 Data </SectionTitle> <Paragraph position="0"> The diacritization system we present here is trained and evaluated on the LDC's Arabic Tree-bank of diacritized news stories - Part 3 v1.0: catalog number LDC2004T11 and ISBN 1-58563-298-8.</Paragraph> <Paragraph position="1"> The corpus includes complete vocalization (including case-endings). We introduce here a clearly defined and replicable split of the corpus, so that the reproduction of the results or future investigations can accurately and correctly be established. This corpus includes 600 documents from the An Nahar News Text. There are a total of 340,281 words. We split the corpus into two sets: training data and development test (devtest) data. The training data contains 288,000 words approximately, whereas the devtest contains close to 52,000 words. The 90 documents of the devtest data are created by taking the last (in chronological order) 15% of documents dating from &quot;20021015 0101&quot; (i.e., October 15, 2002) to &quot;20021215 0045&quot; (i.e., December 15, 2002). The time span of the devtest is intentionally non-overlapping with that of the training set, as this models how the system will perform in the real world.</Paragraph> <Paragraph position="2"> Previously published papers use proprietary corpus or lack clear description of the training/devtest data split, which make the comparison to other techniques difficult. By clearly reporting the split of the publicly available LDC's Arabic Treebank corpus in this section, we want future comparisons to be correctly established.</Paragraph> </Section> <Section position="8" start_page="581" end_page="582" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> Experiments are reported in terms of word error rate (WER), segment error rate (SER), and diacritization error rate (DER). The DER is the proportion of incorrectly restored diacritics. The WER is the percentage of incorrectly diacritized white-space delimited words: in order to be counted as incorrect, at least one character in the word must have a diacritization error. The SER is similar to WER but indicates the proportion of incorrectly diacritized segments. A segment can be a prefix, a stem, or a suffix. Segments are often the subject of analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is based on segments rather than words. Consequently, it is important to know the SER in cases where the diacritization system may be used to help disambiguate syntactic information.</Paragraph> <Paragraph position="1"> Several modern Arabic scripts contains the consonant doubling &quot;shadda&quot;; it is common for native speakers to write without diacritics except the shadda. In this case the role of the diacritization system will be to restore the short vowels, doubled case ending, and the vowel absence &quot;sukuun&quot;. We run two batches of experiments: a first experiment where documents contain the original shadda and a second one where documents don't contain any diacritics including the shadda. The diacritization system proceeds in two steps when it has to predict the shadda: a first step where only shadda is restored and a second step where other diacritics (excluding shadda) are predicted.</Paragraph> <Paragraph position="2"> To assess the performance of the system under different conditions, we consider three cases based on the kind of features employed: 1. system that has access to lexical features only; 2. system that has access to lexical and segment-based features; 3. system that has access to lexical, segment- null based and POS features.</Paragraph> <Paragraph position="3"> The different system types described above use the two previously assigned diacritics as additional feature. The DER of the shadda restoration step is equal to 5% when we use lexical features only, 0.4% when we add segment-based information, and 0.3% when we employ lexical, POS, and segment-based features.</Paragraph> <Paragraph position="4"> Table 2 reports experimental results of the diacritization system with different feature sets. Using only lexical features, we observe a DER of 8.2% and a WER of 25.1% which is competitive to a zation system performance. The columns marked with &quot;True shadda&quot; represent results on documents containing the original consonant doubling &quot;shadda&quot; while columns marked with &quot;Predicted shadda&quot; represent results where the system restored all diacritics including shadda.</Paragraph> <Paragraph position="5"> state-of-the-art system evaluated on Arabic Tree-bank Part 2: in (Nelken and Shieber, 2005) a DER of 12.79% and a WER of 23.61% are reported.</Paragraph> <Paragraph position="6"> The system they described in (Nelken and Shieber, 2005) uses lexical, segment-based, and morphological information. Table 2 also shows that, when segment-based information is added to our system, a significant improvement is achieved: 25% for WER (18.8 vs. 25.1), 38% for SER (9.4 vs.</Paragraph> <Paragraph position="7"> 13.0), and 41% for DER (5.8 vs. 8.2). Similar behavior is observed when the documents contain the original shadda. POS features are also helpful in improving the performance of the system. They improved the WER by 4% (18.0 vs. 18.8), SER by 5% (8.9 vs. 9.4), and DER by 5% (5.5 vs. 5.8).</Paragraph> <Paragraph position="8"> Case-ending in Arabic documents consists of the diacritic attributed to the last character in a white-space delimited word. Restoring them is the most difficult part in the diacritization of a document. Case endings are only present in formal or highly literary scripts. Only educated speakers of modern standard Arabic master their use. Technically, every noun has such an ending, although at the end of a sentence no inflection is pronounced, even in formal speech, because of the rules of 'pause'. For this reason, we conduct another experiment in which case-endings were stripped throughout the training and testing data without the attempt to restore them.</Paragraph> <Paragraph position="9"> We present in Table 3 the performance of the diacritization system on documents without caseendings. Results clearly show that when case-endings are omitted, the WER declines by 58% (7.2% vs. 17.3%), SER is decreased by 52% (4.0% vs. 8.5%), and DER is reduced by 56% (2.2% vs.</Paragraph> <Paragraph position="10"> 5.1%). Also, Table 3 shows again that a richer set of features results in a better performance; compared to a system using lexical features only, adding POS and segment-based features improved the WER by 38% (7.2% vs. 11.8%), the SER by 39% (4.0% vs. 6.6%), and DER by 38% (2.2% vs.</Paragraph> <Paragraph position="11"> based on employed features. System is trained and evaluated on documents without case-ending.</Paragraph> <Paragraph position="12"> Columns marked with &quot;True shadda&quot; represent results on documents containing the original consonant doubling &quot;shadda&quot; while columns marked with &quot;Predicted shadda&quot; represent results where the system restored all diacritics including shadda. 3.6%). Similar to the results reported in Table 2, we show that the performance of the system are similar whether the document contains the original shadda or not. A system like this trained on non case-ending documents can be of interest to applications such as speech recognition, where the last state of a word HMM model can be defined to absorb all possible vowels (Afify et al., 2004).</Paragraph> </Section> <Section position="9" start_page="582" end_page="582" type="metho"> <SectionTitle> 7 Comparison to other approaches </SectionTitle> <Paragraph position="0"> As stated in section 3, the most recent and advanced approach to diacritic restoration is the one presented in (Nelken and Shieber, 2005): they showed a DER of 12.79% and a WER of 23.61% on Arabic Treebank corpus using finite state transducers (FST) with a Katz language modeling (LM) as described in (Chen and Goodman, 1999). Because they didn't describe how they split their corpus into training/test sets, we were not able to use the same data for comparison purpose.</Paragraph> <Paragraph position="1"> In this section, we want essentially to duplicate the aforementioned FST result for comparison using the identical training and testing set we use for our experiments. We also propose some new variations on the finite state machine modeling technique which improve performance considerably.</Paragraph> <Paragraph position="2"> The algorithm for FST based vowel restoration could not be simpler: between every pair of characters we insert diacritics if doing so improves the likelihood of the sequence as scored by a statistical n-gram model trained upon the training corpus. Thus, in between every pair of characters we propose and score all possible diacritical insertions. Results reported in Table 4 indicate the error rates of diacritic restoration (including shadda). We show performance using both Kneser-Ney and Katz LMs (Chen and Goodman, 1999) with increasingly large n-grams. It is our opinion that large n-grams effectively duplicate the use of a lexicon. It is unfortunate but true that, even for a rich resource like the Arabic Treebank, the choice of modeling heuristic and the effects of small sample size are considerable. Using the finite state machine modeling technique, we obtain similar results to those reported in (Nelken and Shieber, 2005): a WER of 23% and a DER of 15%. Better performance is reached with the use of Kneser-Ney LM.</Paragraph> <Paragraph position="3"> These results still under-perform those obtained by MaxEnt approach presented in Table 2. When all sources of information are included, the Max-Ent technique outperforms the FST model by 21% (22% vs. 18%) in terms of WER and 39% (9% vs.</Paragraph> <Paragraph position="4"> 5.5%) in terms of DER.</Paragraph> <Paragraph position="5"> The SER reported on Table 2 and Table 3 are based on the Arabic segmentation system we use in the MaxEnt approach. Since, the FST model doesn't use such a system, we found inappropriate to report SER in this section.</Paragraph> <Paragraph position="6"> restoration using FST.</Paragraph> <Paragraph position="7"> We propose in the following an extension to the aforementioned FST model, where we jointly determines not only diacritics but segmentation into affixes as described in (Lee et al., 2003). Table 5 gives the performance of the extended FST model where Kneser-Ney LM is used, since it produces better results. This should be a much more difficult task, as there are more than twice as many possible insertions. However, the choice of diacritics is related to and dependent upon the choice of segmentation. Thus, we demonstrate that a richer internal representation produces a more powerful model.</Paragraph> </Section> class="xml-element"></Paper>