File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1648_metho.xml
Size: 9,668 bytes
Last Modified: 2025-10-06 14:10:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1648"> <Title>Language Modeling, and Shallow Morphology</Title> <Section position="5" start_page="409" end_page="411" type="metho"> <SectionTitle> 3 Error Correction Methodology </SectionTitle> <Paragraph position="0"> This section describes the character level modeling, the language modeling, and shallow morphological analysis.</Paragraph> <Section position="1" start_page="409" end_page="411" type="sub_section"> <SectionTitle> 3.1 OCR Character Level Model </SectionTitle> <Paragraph position="0"> A noisy channel model was used to learn how OCR corrupts single characters or character segments, producing a character level confusion model. To train the model, 6,000 OCR corrupted words were obtained from a modern printing of a medieval religious Arabic book (called &quot;The Provisions of the Return&quot; or &quot;Provisions&quot; for short by Ibn Al-Qayim). The words were then manually corrected, and the corrupted and manually corrected versions were aligned. The Provisions book was scanned at 300x300 dots per inch (dpi), and Sakhr's Automatic Reader was used to OCR the scanned pages. From the 6,000 words, 4,000 were used for training and the remaining words were set aside for later testing. The Word Error Rate (WER) for the 2,000 testing words was 39%. For all words (in training and testing), the different forms of alef (hamza, alef, alef maad, alef with hamza on top, hamza on wa, alef with hamza on the bottom, and hamza on ya) were normalized to alef, and ya and alef maqsoura were normalized to ya. Subsequently, the characters in the aligned words can aligned in two different ways, namely: 1:1 (one-to-one) character alignment, where each character is mapped to no more than one character (Church and Gale, 1991); or using m:n alignment, where a character segment of length m is aligned to a character segment of length n (Brill and Moore, 2000). The second method is more general and potentially more accurate especially for Arabic where a character can be confused with as many as three or four characters. The following example highlights the difference between the 1:1 and the m:n alignment approaches.</Paragraph> <Paragraph position="1"> Given the training pair (rnacle, made): 1:1 alignment : r n a c l e m e a d e e m:n alignment: For alignment, Levenstein dynamic programming minimum edit distance algorithm was used to produce 1:1 alignments. The algorithm computes the minimum number of edit operations required to transform one string into another.</Paragraph> <Paragraph position="2"> Given the output alignments of the algorithm, properly aligned characters (such as a barb2right a and e barb2right e) are used as anchors, e's (null characters) are combined to misaligned adjacent characters producing m:n alignments, and e's between correctly aligned characters are counted as deletions or insertions.</Paragraph> <Paragraph position="3"> To formalize the error model, given a clean word kh = #C1..Ck.. Cl..Cn# and the resulting OCR degraded word d = #D1..Dx.. Dy..Dm#, where Dx.. Dy resulted from Ck.. Cl, e representing the null character, and # marking word boundaries, the probability estimates for the three edit operations for the models are: When decoding a corrupted string d composed of the characters D1..Dx.. Dy..Dm, the goal is to find a string kh composed of the characters C1..Ck.. Cl..Cn such that P(d|kh)*P(kh) is maximum. P(kh) is the prior probability of observing kh in text and P(d|kh) is the probability of producing d from kh. P(kh) was computed from a web-mined collection of religious text by Ibn Taymiya, the main teacher of the medieval author of the &quot;Provisions&quot; book. The collection contained approximately 16 million words, with 278,877 unique surface forms.</Paragraph> <Paragraph position="4"> The segments Dx.. Dy are generated by finding all possible 2n-1 segmentations of the word d. For example, given &quot;macle&quot; then all possible segmentations are (m,a,c,l,e), (ma,c,l,e), (m,ac,l,e), (mac,l,e), (m,a,cl,e), (ma,cl,e), (m,acl,e), (macl,e), (m,a,c,le), (ma,c,le), (m,ac,le), (mac,le), (m,a,cle), (ma,cle), (m,acle), (macle).</Paragraph> <Paragraph position="5"> All segment sequences Ck.. Cl known to produce Dx.. Dy for each of the possible segmentations are produced. If a sequence of C1.. Cn segments generates a valid word kh which exists in the web-mined collection, then argmaxkh P(d|kh)*P(kh) is computed, otherwise the sequence is discarded.</Paragraph> <Paragraph position="6"> Possible corrections are subsequently ranked.</Paragraph> <Paragraph position="7"> For all the experiments reported in this paper, the top 10 corrections are generated. Note that error correction reported in this paper does not assume that a word is correct because it exists in the web-mined collection and assumes that all words are possibly incorrect.</Paragraph> <Paragraph position="8"> The effect of two modifications to the m:n character model mentioned above were examined.</Paragraph> <Paragraph position="9"> The first modification involved making the character model account for the position of letters in a word. The intuition for this model is that since Arabic letters change their shape based on their positions in words and would hence affect the letters with which they would be confused.</Paragraph> <Paragraph position="10"> Formally, given L denoting the positions of the letter at the boundaries of character segments, whether start, middle, end, or isolated, the character model would be: The second modification involved giving a small uniform probability to single character substiutions that are unseen in the training data. This was done in accordance to Lidstone's law to smooth probabilities. The probability was set to be 100 times smaller than the probability of the smallest seen single character substitution*.</Paragraph> <Paragraph position="11"> * Other uniform probability estimates were examined for the training data and the one reported here seemed to work best r n a c l e m a d e</Paragraph> </Section> <Section position="2" start_page="411" end_page="411" type="sub_section"> <SectionTitle> 3.2 Language Modeling </SectionTitle> <Paragraph position="0"> For language modeling, a trigram language model was trained on the same web-mined collection that was mentioned in the previous sub-section without any kind of morphological processing. Like the text extracted from the &quot;Provisions&quot; book, alef and ya letter normalizations were performed. The language model was built using SRILM toolkit with Good-Turing smoothing and default backoff.</Paragraph> <Paragraph position="1"> Given a corrupted word sequence [?] = {d1 .. di ..</Paragraph> <Paragraph position="2"> dn} and Ks = {Kh1 .. Khi .. Khn}, where Khi ={khi0 .. khim} are possible corrections of di (m = 10 for all the experiments reported in the paper), the aim was to find a sequence Ohm = {o1 .. oi .. on}, where</Paragraph> </Section> <Section position="3" start_page="411" end_page="411" type="sub_section"> <SectionTitle> 3.3 Language Modeling and Shallow Mor- phological Analysis </SectionTitle> <Paragraph position="0"> Two paths were pursued to explore the combined effect of language modeling and shallow morphological analysis.</Paragraph> <Paragraph position="1"> In the first, a 6-gram language model was trained on the same web-mined collection after each of the words in the collection was segmented into its constituent prefix, stem, and suffix (in this order) using language model based stemmer (Lee et al., 2003). For example, &quot; wafii62761afii62760afii62766aafii62822afii62827 - wktAbhm&quot; was replaced by &quot;w# ktAb +hm&quot; where # and + were used to mark prefixes and suffixes respectively and to distinguish them from stems. Like before, alef and ya letter normalizations were performed and the language model was built using SRILM toolkit with the same parameters.</Paragraph> <Paragraph position="2"> Formally, the only difference between this model and the one before is that Khi ={khi0 .. khim} are the {prefix, stem, suffix} tuples of the possible corrections of di (a tuple is treated as a block). Otherwise everything else is identical.</Paragraph> <Paragraph position="3"> In the second, a trigram language model was trained on the same collection after the language modeling based stemming was used on all the tokens in the collection (Lee et al., 2003). The top n generated corrections were subsequently stemmed and the stems were reranked using the language model. The top resulting stem was compared to the condition in which language modeling was used without morphological analysis (as in the previous subsection) and then the top resulting correction were stemmed. This path was pursued to examine the effect of correction on applications where stems are more useful than words such as Arabic information retrieval (Darwish et al., 2005; Larkey et al., 2002).</Paragraph> </Section> <Section position="4" start_page="411" end_page="411" type="sub_section"> <SectionTitle> 3.4 Testing the Models </SectionTitle> <Paragraph position="0"> The 1:1 and m:n character mapping models were tested while enabling or disabling character position training (CP), smoothing by the assignment of small probabilities to unseen single character substitutions (UP), language modeling (LM), and shallow morphological processing (SM) using the 6-gram model.</Paragraph> <Paragraph position="1"> As mentioned earlier, all models were tested using sentences containing 2,000 words in total.</Paragraph> </Section> </Section> class="xml-element"></Paper>