File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/98/w98-1007_concl.xml

Size: 5,257 bytes

Last Modified: 2025-10-06 13:58:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1007">
  <Title>Arabic Morphology Using Only Finite-State Operations</Title>
  <Section position="8" start_page="54" end_page="54" type="concl">
    <SectionTitle>
4 Practical Applications
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
4.1 History of Computing Semitic
Stems via Intersection
</SectionTitle>
      <Paragraph position="0"> Classic Two-Level (Koskenniemi, 1983; Karttunen, 1983; Antworth, 1990) and finite-state lexicons (Karttunen, 1993) build underlying strings via concatenation only, but this limitation is not characteristic of the overall theory but only of the computational implementations.</Paragraph>
      <Paragraph position="1"> Kataja and Koskenniemi (1988) were apparently the first to understand that concatenating languages were just a special case; they showed that by generalizing lexicography to allow regular expressions, Semitic (specifically Akkadian) roots and patterns could denote regular languages, and that stems could be computed as the intersection of these regular languages. 3 This principle was borrowed in the ALPNET prototype analyzer for Arabic morphology (Beesley, 1989; Beesley, 1991); but it used an implementation of Two-Level Morphology enhanced with a &amp;quot;detouring&amp;quot; mechanism that simulated the intersection of roots and patterns at runtime. This prototype grew into a large commercial system in 1989 and 1990 (Beesley et al., 1989; Beesley, 1990). In 1989, Lauri Karttunen (personal communication) also proposed and demonstrated in an Interlisp script the intersection of roots, patterns and vocalizations as an alternative to the finite-state solution of (Kay, 1987), which used a four-tape finite-state transducer transducer.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
4.2 Current Xerox System
</SectionTitle>
      <Paragraph position="0"> The current Xerox morphological analyzer for Arabic is based on dictionaries licensed from ALPNET, but the rules and organization of the system have been extensively rewritten.</Paragraph>
      <Paragraph position="1">  The Arabic morphological analyzer starts out as a dictionary database containing entries for prefixes, suffixes, roots and patterns of Arabic. The database also includes morphotactic codings. Perl scripts extract the pertinent information from this database, reformatting it as lexc files, which are then compiled into a finite-state transducer that we label the &amp;quot;core&amp;quot; lexicon transducer. On top of the core FST, filters are composed to remove the strings that are ill-formed because of discontiguous dependencies. Finite-state rules that intersect roots and patterns are compiled into transducers and composed on the bottom of the core, leaving SKataja (personal communication) wrote comparative two-level grammars of the Neo-Babylonian and Neo-Assyrian dialects of Akkadian. The source dictionaries contained separate sublexicons for roots and patterns; these were intersected via awk scripts into Koskenniemi's TwoL format, which was then compiled.</Paragraph>
      <Paragraph position="2">  linearized lexical strings for the variation rules (also compiled into FSTs) to apply to, as shown in Figure 3. The result of the composition is a single &amp;quot;common&amp;quot; FST, with slightly enhanced fuUy-voweled strings in the lower language.</Paragraph>
      <Paragraph position="3"> For generation purposes, where the user probably wants to see only formally correct fully-roweled strings, the bottom level is trivially cleaned up by yet another layer of composed rules. For recognition purposes, the rules applied to the bottom side include \[ a I i I u I o I - \] (-&gt;) 0 ; which optionally maps the fatha (a), kasra (i), d.amma (u), sukuun (o) and shadda (') to the empty string. The resulting &amp;quot;analysis&amp;quot; transducer recognizes fully-voweled, partially voweled, and the usual unvoweled spellings. Where diacritics are present in the input, the output is correspondingly less ambiguous.</Paragraph>
      <Paragraph position="4">  The current dictionaries contain 4930 roots, each one hand-coded to indicate the subset of patterns with which it legally combines (Buckwalter, 1990). Various combinations of prefixes and suffixes, concatenated to the intersected stems, and filtered by composition, yield over 72,000,000 abstract, fully-voweled words.</Paragraph>
      <Paragraph position="5"> Sixty-six finite-state variation rules map these abstract strings into fully-voweled orthographical strings, and additional rules are then applied to optionally delete short vowels and other diacritics, allowing the system to analyze unvoweled, partially voweled, and fully-roweled orthographical variants of the 72,000,000 abstract words. New entries are added easily to the original le:dcal database.</Paragraph>
      <Paragraph position="6"> A full-scale version of the current system is available for testing on the Internet at ht tp://www.xrce.xerox.com/research/mltt / arabic. A Java interface renders Arabic words in traditional Arabic script, both for input and output.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML