XML Viewer - p95-1004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1004_metho.xml
Size: 16,763 bytes
Last Modified: 2025-10-06 14:14:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1004">
  <Title>A Morphographemic Model for Error Correction in Nonconcatenative Strings</Title>
  <Section position="4" start_page="0" end_page="24" type="metho">
    <SectionTitle>
* Vocalisation Orthographically, Semitic texts
</SectionTitle>
    <Paragraph position="0"> appear in three forms: (i) consonantal texts do not incorporate any short vowels but mattes lectionis, 2 e.g. Arabic (ktb) for /katab/, /kutib/and/kutub/, but (kaatb) for/kaatab/ and /kaatib/; (ii) partially voealised texts incorporate some short vowels to clarify ambiguity, e.g. (kutb) for /kutib/ to distinguish it from /katab/; and (iii) voealised texts incorporate full vocalisation, e.g. (tadahra\]) for /tada ay. 1We have used the CV model to describe pattern morphemes instead of prosodic terms because of its familiarity in the computational linguistics literature. For the use of moraic sad affLxational models in handling Arabic morphology computationally, see (Kiraz,).</Paragraph>
    <Paragraph position="1"> 2'Mothers of reading', these are consonantal letters which play the role of long vowels, sad are represented in the pattern morpheme by VV (e.g. /aa/, /uu/, /ii/). Mattes lectionis cannot be omitted from the orthographic string.</Paragraph>
    <Paragraph position="2">  * Vowel and Diacritic Shifts Semitic languages employ a large number of diacritics to represent enter alia short vowels, doubled letters, and nunation. 3 Most editors allow the user to enter such diacritics above and below letters.</Paragraph>
    <Paragraph position="3"> To speed data entry, the user usually enters the base characters (say a paragraph) and then goes back and enters the diacritics. A common mistake is to place the cursor one extra position to the left when entering diacritics. This results in the vowels being shifted one position, e.g. *(wkatubi) instead of (wakutib).</Paragraph>
    <Paragraph position="4"> * Vocalisms The quality of the perfect and imperfect vowels of the basic forms of the Semitic verbs are idiosyncratic. For example, the Syriac root {ktb} takes the perfect vowel a, e.g.</Paragraph>
    <Paragraph position="5"> /ktab/, while the root {nht} takes the vowel e, e.g. /nhet/. It is common among learners to make mistakes such as */kteb/or */nhat/.</Paragraph>
    <Paragraph position="6"> * Phonetic Syncopation A consonantal segment may be omitted from the phonetic surface form, but maintained in the orthographic surface from. For example, Syriac (md/nt~)'city' is pronounced/mdit~/.</Paragraph>
    <Paragraph position="7"> * Idiosyncrasies The application of a morphographemic rule may have constraints as on which lexical morphemes it may or may not apply. For example, the glottal stop \[~\] at the end of a stem may become \[w\] when followed by the relative adjective morpheme {iyy}, as in Arabic /samaaP+iyy/-+/samaawiyy/'heavenly', but /hawaaP+iyy/-~/hawaa~iyy/'of air'.</Paragraph>
    <Paragraph position="8"> * Morphosyntactic Issues In broken plurals, diminutives and deverbal nouns, the user may enter a morphologically sound, but morphosyntactically ill-formed word. We shall discuss this in more detail in section 4. 4 To the above, one adds language-independent issues in spell checking such as the four Damerau transformations: omission, insertion, transposition and substitution (Damerau, 1964).</Paragraph>
  </Section>
  <Section position="5" start_page="24" end_page="25" type="metho">
    <SectionTitle>
2 A Morphographemic Model
</SectionTitle>
    <Paragraph position="0"> This section presents a morphographemic model which handles error detection in non-linear strings.</Paragraph>
    <Paragraph position="1">  cies, see (Abduh, 1990).</Paragraph>
    <Paragraph position="2"> Subsection 2.1 presents the formalism used, and sub-section 2.2 describes the model.</Paragraph>
    <Section position="1" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
2.1 The Formalism
</SectionTitle>
      <Paragraph position="0"> In order to handle the non-linear phenomenon of Arabic, our model adopts the two-level formalism presented by (Pulman and Hepple, 1993), with the multi tape extensions in (Kiraz, 1994). Their forrealism appears in (2).</Paragraph>
      <Paragraph position="1">  = left lexical context = lexical form = right lexical context = left surface context = surface form = right surface context The special symbol * is a wildcard matching any context, with no length restrictions. The operator caters for obligatory rules. A lexical string maps to a surface string if\[ they can be partitioned into pairs of lexical-surface subsequences, where each pair is licenced by a =~ or ~ rule, and no partition violates a C/~ rule. In the multi-tape version, lexical expressions (i.e. LLC, LEX and RLC) are n-tuple of regulax expressions of the form (xl, x2, ..., xn): the/th expression refers to symbols on the ith tape; a nill slot is indicated by ~.5 Another extension is giving LLC the ability to contain ellipsis, ... , which indicates the (optional) omission from LLC of tuples, provided that the tuples to the left of... are the first to appear on the left of LEx.</Paragraph>
      <Paragraph position="2"> In our morphographemic model, we add a similar formalism for expressing error rules (3).</Paragraph>
      <Paragraph position="4"> we allow ~. If the rules were to be compiled into automata, a genuine symbol, e.g. 0, must be used. For the compilation of our formalism into automata, see (Kiraz and Grimley-Evans, 1995).</Paragraph>
      <Paragraph position="5">  The error rules capture the correspondence between the error surface and the correct surface, given the surrounding partition into surface and lexical contexts. They happily utilise the multi-tape format and integrate seamlessly into morphological analysis. PLC and PRC above are the left and right contexts of both the lexical and (correct) surface levels. Only the =~ is used (error is not obligatory).</Paragraph>
    </Section>
    <Section position="2" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
2.2 The Model
</SectionTitle>
      <Paragraph position="0"> 2.2.1 Finding the error Morphological analysis is first called with the assumption that the word is free of errors. If this fails, analysis is attempted again without the 'no error' restriction. The error rules are then considered when ordinary morphological rules fail. If no error rules succeed, or lead to a successful partition of the word, analysis backtracks to try the error rules at successively earlier points in the word.</Paragraph>
      <Paragraph position="1"> For purposes of simplicity and because oh the whole is it likely that words will contain no more than one error (Damerau, 1964; Pollock and Zamora, 1983), normal 'no error' analysis usually resumes if an error rule succeeds. The exception occurs with a vowel shift error (SS3.2.1). If this error rule succeeds, an expectation of further shifted vowels is set up, but no other error rule is allowed in the subsequent partitions. For this reason rules are marked as to whether they can occur more than once.</Paragraph>
      <Paragraph position="2">  Once an error rule is selected, the corrected surface is substituted for the error surface, and normai analysis continues - at the same position. The substituted surface may be in the form of a variable, which is then ground by the normal analysis sequence of lexical matching over the lexicon tree.</Paragraph>
      <Paragraph position="3"> In this way only lexical words a~e considered, as the variable letter can only he instantiated to letters branching out from the current position on the lexicon tree. Normal prolog backtracking to explore alternative rules/lexical branches applies throughout.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="25" end_page="1261614" type="metho">
    <SectionTitle>
3 Error Checking in Arabic
</SectionTitle>
    <Paragraph position="0"> We demonstrate our model on the Arabic verbal stems shown in (4) (McCarthy, 1981). Verbs are classified according to their measure (M): there are 15 trilateral measures and 4 quadrilateral ones.</Paragraph>
    <Paragraph position="1"> Moving horizontally across the table, one notices a change in vowel melody (active {a}, passive {ui}); everything else remains invariant. Moving vertically, a change in canonical pattern occurs; everything else remains invariant.</Paragraph>
    <Paragraph position="2"> Subsection 3.1 presents a simple two-level grammar which describes the above data. Subsection 3.2 presents error checking.</Paragraph>
    <Section position="1" start_page="25" end_page="1261614" type="sub_section">
      <SectionTitle>
3.1 Two-Level Rules
</SectionTitle>
      <Paragraph position="0"> The lexicai level maintains three lexieai tapes (Kay, 1987; Kiraz, 1994): pattern tape, root tape and vocalism tape; each tape scans a lexical tree. Exampies of pattern morphemes are: (ClVlC2VlC3} (M 1), {ClC2VlnC3v2c4} (M Q3). The root morphemes are {ktb} and {db_rj}, and the vocalism morphemes are {a} (active) and {ui} (passive).</Paragraph>
      <Paragraph position="1"> The following two-level grammar handles the above data. Each lexical expression is a triple; lexical expressions with one symbol assume e on the remaining positions.</Paragraph>
      <Paragraph position="3"> (5) gives three general rules: R0 allows any character on the first lexical tape to surface, e.g. infixes, prefixes and suffixes. R1 states that any P E {Cl, c2, c3, c4} on the first (pattern) tape and C on the second (root) tape with no transition on the third (vocalism) tape corresponds to C on the surface tape; this rule sanctions consonants. Similarly, tL2 states that any P E {Vl, v2} on the pattern tape and V on vocalism tape with no transition on the root tape corresponds to V on the surface tape; this rule sanctions vowels.</Paragraph>
      <Paragraph position="4">  stem morphemes, e.g. prefixes and suffixes. R4 applies to stem morphemes reading three boundary symbols simultaneously; this marks the end of a stem. Notice that LLC ensures that the right boundary rule is invoked at the right time.</Paragraph>
      <Paragraph position="5"> Before embarking on the rest of the rules, an illustrated example seems in order. The derivation of/dhunrija/(M Q5, passive), from the three morphemes {ClC2VlnCsv2c4} , {dhrj} and {ui}, and the suffix {a} '3rd person' is illustrated in (7).</Paragraph>
      <Paragraph position="7"> The numbers between the surface tape and the lexical tapes indicate the rules which sanction the moves.</Paragraph>
      <Paragraph position="9"> Resuming the description of the grammar, (8) presents spreading rules. Notice the use of ellipsis to indicate that there can be tuples separating LEX and LLC, as far as the tuples in LLC are the nearest ones to LEX. R5 sanctions the spreading (and gemination) of consonants. R6 sanctions the spreading of the first vowel. Spreading examples appear in (9).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1261614" end_page="1261614" type="metho">
    <SectionTitle>
\[k\[ala\[t\[alb\[ \[ST
</SectionTitle>
    <Paragraph position="0"> The following rules allow for the different possible orthographic vocalisations in Semitic texts:</Paragraph>
    <Paragraph position="2"> vowels in non-stem and stem morphemes, respectively; note that the lexical contexts make sure that long vowels are not deleted. R9 allows the optional deletion of a short vowel what is the cause of spreading. For example the rules sanction both /katab/ (M 1, active) and /kutib/ (M 1, passive) as interpretations of (ktb) as showin in (10).</Paragraph>
    <Section position="1" start_page="1261614" end_page="1261614" type="sub_section">
      <SectionTitle>
3.2 Error Rules
</SectionTitle>
      <Paragraph position="0"> Below are outlined error rules resulting from peculiarly Semitic problems. Error rules can also be constructed in a similar vein to deal with typographical Damerau error (which also take care of the issue of  A vowel shift error rule will be tried with a partition on a (short) vowel which is not an expected (lexical) vowel at that position. Short vowels can legitimately be omitted from an orthographic representation - it is this fact which contributes to the problem of vowel shifts. A vowel is considered shifted if the same vowel has been omitted earlier in the word.</Paragraph>
      <Paragraph position="1"> The rule deletes the vowel from the surface. Hence in the next pass of (normal) analysis, the partition is analysed as a legitimate omission of the expected vowel. This prepares for the next shifted vowel to be treated in exactly the same way as the first. The expectation of this reapplieation is allowed for in</Paragraph>
      <Paragraph position="3"> In the rules above, 'X' is the shifted vowel. It is deleted from the surface. The partition contextual tuples consist of \[RULE NAME, SURF, LEX\]. The LEX element is a tuple itself of \[PATTERN, ROOT, VOCALISM\]. In E0 the shifted vowel was analysed earlier as an omitted stem vowel (ore_stray), whereas in E1 it was analysed earlier as an omitted spread vowel (om_sprv). The surface/lexical restrictions in the contexts could be written out in more detail, but both rules make use of the fact that those contexts are analysed by other partitions, which check that they meet the conditions for an omitted stem vowel or omitted spread vowel.</Paragraph>
      <Paragraph position="4"> For example, *(dhruji) will be interpreted as (duhrij). The 'E0's on the rule number line indicate where the vowel shift rule was applied to replace an error surface vowel with 6. The error surface vowels are written in italics.</Paragraph>
      <Paragraph position="5">  Problems resulting from phonetic syncopation can be treated as accidental omission of a consonant, e.g. *(mdit~), (mdint~).</Paragraph>
      <Paragraph position="7"> Although the error probably results from a different fault, a deleted long vowel can be treated in the same way as a deleted consonant. With current transcription practice, long vowels are commonly written as two characters - they are possibly better represented as a single, distinct character.</Paragraph>
      <Paragraph position="8">  One type of morphographemic error is that consonant substitution may not take place before appending a suffix. For example/samaaP/'heaven' + {iyy) 'relative adjective' surfaces as (samaawiyy), where P-~ w in the given context. A common mistake is to write it as *(samma~iyy).</Paragraph>
      <Paragraph position="9"> (16) F_A: P ::~ w where reap = n { *- /glottal_change, w,(Pc,P,~)\] } The 'glottal_change' rule would be a normal morphological spelling change rule, incorporating contextual constraints (e.g. for the morpheme boundary) as necessary.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="1261614" end_page="1261614" type="metho">
    <SectionTitle>
4 Broken Plurals, Diminutive and
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1261614" end_page="1261614" type="sub_section">
      <SectionTitle>
Deverbal Nouns
</SectionTitle>
      <Paragraph position="0"> This section deals with morphosyntactic errors which are independent of the two-level analysis. The data described below was obtained from Daniel Ponsford (personal communication), based on (Wehr, 1971).</Paragraph>
      <Paragraph position="1"> Recall that a Semitic stems consists of a root morpheme and a vocalism morpheme arranged according to a canonical pattern morpheme. As each root does not occur in all vocalisms and patterns, each lexical entry is associated with a feature structure which indicates inter alia the possible patterns and vocalisms for a particular root. Consider the nominal data in (17).</Paragraph>
      <Paragraph position="2">  ble, but do not occur lexically with the cited nouns. A common mistake is to choose the wrong pattern. In such a case, the two-level model succeeds in finding two-level analyses of the word in question, but fails when parsing the word morphosyntactically: at this stage, the parser is passed a root, vocalism and pattern whose feature structures do not unify.</Paragraph>
      <Paragraph position="3"> Usually this feature-clash situation creates the problem of which constituent to give preference to (Langer, 1990). Here the vocalism indicates the inflection (e.g. broken plural) and the preferance of vocalism pattern for that type of inflection belongs to the root. For example *(kidaa~)would be analysed as root {kd~} with a broken plural vocalism. The pattern type of the vocalism clashes with the broken plural pattern that the root expects. To correct, the morphological analyser is executed in generation mode to generate the broken plural form of {kd~} in the normal way.</Paragraph>
      <Paragraph position="4"> The same procedure can be applied on diminutive and deverbal nouns.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML