File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1017_intro.xml
Size: 2,764 bytes
Last Modified: 2025-10-06 14:05:59
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1017"> <Title>Arabic Finite-State Morphological Analysis and Generation</Title> <Section position="3" start_page="89" end_page="89" type="intro"> <SectionTitle> 2 Goals </SectionTitle> <Paragraph position="0"> To be interesting in our applications, the Arabic morphology system had to have the following qualities: 1. It had to deal with real Arabic surface orthography, as represented on-line in standards such as ASMO 449 or the Macintosh Arabic code page (ISO8859-6). While it is possible to devise strict roman transliterations of Arabic orthography that are unambiguously convertible back and forth into real Arabic orthography, most existing romanizations are in fact transcriptions that contain more or less information than the original and so represent different orthographical systems.</Paragraph> <Paragraph position="1"> 2. It had to be able to analyze Arabic words as they appear in real texts. This means timt input words may be fully voweled or diacriticized (i.e. supplied with full diacritical markings, a style of writing found only in religious texts, poetry, and writings intended for children and other learners), partially diacriticized or undiacriticized, which is the normal case. A single system had to handle undiacriticized words and yet be able to take advantage of any diacritics that might be present.</Paragraph> <Paragraph position="2"> 3. To facilitate lookup of words in printed and on-line dictionaries, and for pedagogical purposes, the system had to return the root as an easily distinguished part of the analysis. An easier to build, but less useful, system would simply deal with complete stems rather than roots and patterns.</Paragraph> <Paragraph position="3"> 4. The system had to be large and open-ended, with each root coded to restrict the patterns with which it can in fact co-occur.</Paragraph> <Paragraph position="4"> 5. It had to be efficient and accurate, successfully analyzing hundreds or thousands of words per second on commonly available workstations and higher-end PCs.</Paragraph> <Paragraph position="5"> 6. It had to perform efficient and accurate gen null eration of valid surface forms when supplied with the component root and relevant feature tags. Analysis and generation had to be straightforward inverse operations.</Paragraph> <Paragraph position="6"> Forest of Lexicon &quot;Letter Trees&quot; Trees are connected by &quot;continuation classesY A letter path through the trees is an abstract word. Rules hand-compiled into FSTs The intersection of the rules is simulated in code. Rules allow and control the discrepancies between the abstract words in the lexicon and the surface words being</Paragraph> </Section> class="xml-element"></Paper>