File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1606_metho.xml
Size: 6,619 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1606"> <Title>Issues in Arabic Orthography and Morphology Analysis</Title> <Section position="3" start_page="4" end_page="138" type="metho"> <SectionTitle> 5 Concatenation in Arabic orthography </SectionTitle> <Paragraph position="0"> The second, and more serious, orthographic anomaly we encountered in all three corpora is what we call the problem of Arabic &quot;run-on&quot; words, or free concatenation of words when the word immediately preceding ends with a non-connector letter, such as alif, dal, dhal, ra, za, waw, ta marbuta, etc.</Paragraph> <Paragraph position="1"> The most frequent &quot;run-on&quot; words in Arabic are combinations of the high-frequency function words la and ma (which end in alif) with following perfect or imperfect verbs, such as la-yazal, mayuram, and ma-zala ( l m zl ). The la of &quot;absolute negation&quot; concatenates freely with nouns, as in la-budda, la-shakka ( ). It can be argued that these are lexicalized collocations, but their spelling with intervening space ( l - null Badawi, Carter and Gully regard &quot;qala 'anna&quot; constructions as grammatical but restricted to contexts &quot;where the exact words of the speaker are not used or reported&quot; (Badawi, Carter and Gully 2004, p. 713). This assertion could be investigated in the LDC corpora.</Paragraph> <Paragraph position="2"> zl - ) is just as frequent as their spelling in concatenated form.</Paragraph> <Paragraph position="3"> Proper name phrases, especially those involving the word 'abd ( ) are also written either separately or in concatenated form. Part of the data annotation process at the LDC involves assigning case endings to tokenized words, but there is currently no mechanism in the Analyzer to assign two case endings (or several pairs of POS tags) to what is being processed as a single word token. As a result of this, the phrase 'abd allah is assigned a single POS tag and case ending when it is written in concatenated form, but two POS tags and two case endings when written with intervening space.</Paragraph> <Paragraph position="4"> The problem of assigning more than one case ending and POS tag to concatenations is more obvious in fully lexicalized concatenations such as khamsumi'atin, sittumi'atin, sab'umi'atin, etc ( - - ). When these numbers are written with intervening space ( - - null ), two case endings and two POS tags are assigned by the Analyzer. But when they are written in concatenated form only one case ending and POS tag is assigned, and the &quot;infixed&quot; case ending of the first token is left undefined: So far we have discussed relatively controlled concatenation, involving mostly high-frequency function words and lexicalized phrases. But concatenation extends beyond that to random combinations of words--the only requirement being that the word immediately preceding end with a non-connector letter. These concatenations are fairly frequent, as attested by their Google scores (see Table 2).</Paragraph> <Paragraph position="5"> It is important to note that these concatenations are not immediately obvious to readers due to the characteristics of proportionally spaced Arabic fonts. Most of the native readers of Arabic at the LDC did not consider concatenations such as these to be typographical errors. Their logic was best expressed in the statement: &quot;I can read the text just fine. Why can't the Morphological Analyzer?&quot; We regard these as &quot;fully lexicalized&quot; concatenations because the first of the two constituent tokens ends in a connector letter. In other word, their concatenation is deliberate and not an accident of orthography.</Paragraph> </Section> <Section position="4" start_page="138" end_page="138" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> There are several levels of orthographic variation in Arabic, and each level calls for a specific response to resolve the orthographic anomaly. It is important that the output analysis record which method was used to resolve the anomaly. The methods used for resolving orthographic anomaly range from exact matching of the surface orthography to various strategies of orthography manipulation. Each manipulation strategy carries with it certain assumptions about the text, and these assumptions should be part of the output analysis. For example, an analysis of obtained by exact matching in a text known to contain suspicious word-final ya's (that may be alif maqsura's) does not have the same value as an analysis of the same word, using the same exact matching, but in a text where word-final ya's and alif maqsura's display normal character distribution frequencies.</Paragraph> <Paragraph position="1"> The problem of run-on words in Arabic calls for a reassessment of current tokenization strategies, including the definition of &quot;word token&quot; itself. It should be assumed that each input string represents one or more potential word tokens, each of which needs to be submitted individually for morphology analysis. For example, the input string can be segmented as a single word token, yielding two morphological analyses (faqad-tum and fa-qudtum) or it can be segmented as two word tokens (fqd tm), yielding several possible analysis pairs (faqada / fuqida / faqd / fa-qad + tamma).</Paragraph> <Paragraph position="2"> By &quot;tokenization&quot; we mean the identification of orthographically valid character string units that can be submitted to the Analyzer for analysis. The Analyzer itself performs a different kind of &quot;tokenization&quot; by identifying prefixes and suffixes that are bound morphemes but which may be treated as &quot;word tokens&quot; in syntactic analysis.</Paragraph> <Paragraph position="3"> Syntactic analysis would be needed for determining which morphology analysis is most likely the correct one for each tokenization (fqdtm and fqd tm).</Paragraph> </Section> <Section position="5" start_page="138" end_page="138" type="metho"> <SectionTitle> 7 Acknowledgements </SectionTitle> <Paragraph position="0"> Our thanks go to the Arabic annotation team at the LDC, especially the team of native speaker informants that provided the author with daily feedback on the performance of the Morphological Analyzer, especially in areas which led to a reassessment and better understanding of orthographic variation, as well as tokenization and functional definitions of typographical errors.</Paragraph> </Section> class="xml-element"></Paper>