File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1606_intro.xml
Size: 6,489 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1606"> <Title>Issues in Arabic Orthography and Morphology Analysis</Title> <Section position="2" start_page="0" end_page="4" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In 2002 the LDC began using output from the Buckwalter Arabic Morphological Analyzer (Buckwalter, 2002), in order to perform morphological annotation and POS tagging of Arabic newswire text. From 2002 to 2004 three corpora were analyzed and over half a million Arabic word tokens were annotated and tagged The tagged AFP, Ummah, and Annahar corpora were published as &quot;Arabic Treebank: Part 1 v 2.0&quot; (Maamouri 2003), &quot;Arabic Treebank: Part 2 v 2.0&quot; (Maamouri 2004), and &quot;Arabic Treebank: Part 3 v 1.0&quot; (Maamouri 2004), respectively, and are available from the LDC website <http://www.ldc.upenn.edu > The author was responsible for developing and maintaining the Analyzer, which primarily involved filling in the gaps in the lexicon and modifying the POS tag set in order to meet the requirements of treebanking efforts that were performed subsequently at the LDC with the same annotated and POS-tagged newswire data.</Paragraph> <Paragraph position="1"> 2 Lessons from the AFP corpus During the tagging of the AFP data, the first corpus in the series, the Buckwalter Analyzer was equipped to handle basic orthographic variation that often goes unnoticed because it is a common feature of written Arabic (Buckwalter, 1992).</Paragraph> <Paragraph position="2"> This orthographic variation involves the writing (or omission) of hamza above or below alif in stem-initial position, and to a lesser extent, the writing (or omission) of madda on alif, also in stem-initial position. In both cases use of the bare alif without hamza or madda is quite common and goes by unnoticed by most readers. What took the LDC morphology annotation team by surprise was to find that in the AFP data the common omission of hamza in this environment had been extended to stem-medial and stem-final positions as well, as seen in the following words from that corpus: t n a rt .</Paragraph> <Paragraph position="3"> This type of orthographic variation was not attested to the same extent in the two subsequent corpora, Ummah and Annahar, which leads us to conclude that some orthographic practices might be restricted to specific news agencies. It is important to note that most of the native Arabic speakers who annotated the AFP data using the output from the Analyzer did not regard these omissions of hamza on alif in stem-medial and stem-final positions as orthographic errors, and fully expected the Analyzer to provide a solution. 3 Lessons from the Ummah corpus During the tagging of the Ummah data, a different set of orthographic issues arose. Although the Buckwalter Analyzer was equipped to handle so-called &quot;Egyptian&quot; spelling (where word-final ya' is spelled without the two dots, making it identical to alif maqsura), the Ummah corpus presented the LDC annotation team with just the opposite phenomenon: dozens of word-final alif maqsura's spelled with two dots.</Paragraph> <Paragraph position="4"> Whereas some of the affected words were automatically rejected as typographical errors (e.g., y 'y ), others where gladly analyzed at face value (e.g., y ). Unfortunately, this led to numerous false positive analyses: for example was analyzed as 'ali and 'alayya, but not as 'ala. Initially, these words were tagged as typographical errors, but their pervasiveness led the LDC team to reconsider this position, upon which the author was asked to modify the Analyzer algorithm in order to accommodate this typographic anomaly. As a result, all words ending in ya' were now re-interpreted as ending in either ya' or alif maqsura, and both forms were analyzed, as seen in the following (abridged) output: The Annahar corpus presented no orthographic surprises, or at least nothing that the LDC annotation team had not seen before. The Annahar data did contain some additional orthographic It is not entirely clear whether these &quot;dotted&quot; alif maqsura's were produced by human typists or by an encoding conversion process gone awry. It is possible that the original keyboarding was done on a platform where word-final ya' and alif maqsura are displayed via visually identical &quot;un-dotted&quot; glyphs, so it makes no difference which of the two keys the typist presses on the keyboard: both produce the same visual display, but are stored electronically as two different characters. A key to the transliteration scheme used by the Analyzer can be found at <http://www.ldc.upenn.edu/ myl/morph/buckwalter.html> features that we now identify as being common to all three corpora, as well as corpora outside the set we have annotated at the LDC.</Paragraph> <Paragraph position="5"> The first orthographic feature relates to the somewhat free interchange of stem-initial hamza above alif and hamza below alif. With some lexical items the orthographic variation simply reflects variation in pronunciation: for example, both 'isbaniya (with hamza under alif) and 'asbaniya (with hamza above alif) are well attested. But in cases involving other orthographic pairs, more interpretations are possible. Take, for instance, what we called the &quot;qala 'anna&quot; problem. This problem was identified after numerous encounters with constructions in which qala was followed by 'anna rather than 'inna, and for no apparent linguistic reason. Initially this was treated as a typographical error, but again, its pervasiveness forced us to take a different approach.</Paragraph> <Paragraph position="6"> One solution we considered was to modify the Analyzer algorithm so that instances of stem-initial hamza on alif would also be treated as possible instances of hamza under alif, very much in the spirit of the approach we used for dealing with the alif maqsura / ya' free variation cited earlier. However, there is compelling evidence that the orthography of hamza in stem-initial position is a fairly reliable indication of the perceived value of subsequent short vowel: a or u for hamza above alif, and i for hamza below alif. In other words, there is no free variation. The decision was taken to regard &quot;qala 'anna&quot; constructions as grammatically acceptable in MSA.</Paragraph> </Section> class="xml-element"></Paper>