XML Viewer - w97-1108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1108_metho.xml
Size: 17,246 bytes
Last Modified: 2025-10-06 14:14:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1108">
  <Title>Linearization of Nonlinear Lexical Representations</Title>
  <Section position="3" start_page="0" end_page="57" type="metho">
    <SectionTitle>
2 Problems in Templatic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="57" type="sub_section">
      <SectionTitle>
Morphology
2.1 Nonlinearity vs. Linearity
</SectionTitle>
      <Paragraph position="0"> Consider the infamous Arabic stem/katab/'to write -- PERFECT ACTIVE'. It is derived from the root morpheme {ktb) 'notion of writing', the vocalism morpheme {a} 'PERFECT ACTIVE' and .the rather abstract pattern morpheme {CVCVC} 'VERB.' The latter describes the interdigination of the root and vocalism. Substituting the Cs with the root consonants and the Vs with the vocalism vowels results in the surface form /katab/. This process is illustrated along the lines of (McCarthy, 1981) - based on autosegmental phonology (Goldsmith, 1976) as follows:</Paragraph>
      <Paragraph position="2"> Similarly, applying the same process on the root {sdq} 'notion of truth' results in the verb /s.adaq/ 'to speak the truth - PERFECT ACTIVE'.</Paragraph>
      <Paragraph position="3"> The stems /katab/ and /s.adaq/ may be prefixed and suffixed to fomn other words. Prefixation and sumxation, however, are linear operations in Semitic. In other words, the lexical representation of the prefixes and suffixes does not require multi- null ple tapes. Hence, the prefix {wa} 'and' is applied to the above stems to form/wakatab/and/w~adaq/, respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
2.2 Phonological and Orthographic Rules
</SectionTitle>
      <Paragraph position="0"> Surface-to-lexical mappings must account for phonological and orthographic processes. In fact, for many languages, the phonological and orthographic rules tend to be more numerous than the morphological rules. This is the case in Semitic. For example, the Syriac grammar reported in (Kiraz, 1996) contains 48 rules. Only six rules (a mere 12.5%) 1 are motivated by templatic morphology. The rest are phonological and orthographic.</Paragraph>
      <Paragraph position="1"> Consider the above derivation of/katab/, but for Syriac rather than Arabic (both languages share the same morphemes in this case). Syriac has the Vowel Deletion Rule V ~ e/__ CV where e is the empty string. The rule states that short vowels in open syllables are deleted. Hence, */katab/ ~ /ktab/. The rule applies right-to-left; hence, when adding the object pronominal suffix {eh} 'MASCULINE 3RD SINGULAR', the second vowel is deleted, */katabeh/~/katbeh/.</Paragraph>
      <Paragraph position="2"> Similarly, prefixing the above {wa} morpheme (which is also shared by Syriac and Arabic), results in */wakatab/ ~ /waktab/ (first stem vowel is deleted), and */wakatabeh/~/wkatbeh/(prefix vowel and second stem vowel are deleted).</Paragraph>
      <Paragraph position="3"> It is worth noting that such phonological rules do not depend on the nonlinear lexical structure of the stem. They actually apply on the morphologically derived stem. Semitic, then, maintains at least the following strata: lexical-morphological (where the lexical representation is nonlinear) and morphological-surface (where both representations are linear).</Paragraph>
    </Section>
    <Section position="3" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
2.3 Other Linguistic Representations
</SectionTitle>
      <Paragraph position="0"> So far we have looked at two linguistic representations: lexical and surface (~ orthographic). Now consider a text-to-speech system which requires a phonological representation as well.</Paragraph>
      <Paragraph position="1"> In the Arabic example above, the first phoneme of /sadaq/ is emphatic (denoted by the sublinear dot). This emphasis is spread at the phonological level resulting in \[s.a.d.aq\] (\[q\] is already an em1Had the grammar been more exhaustive, the percentage would be much less since most additions to the rules would be in the domain of phonology/orthography, rather than templatic morphology.</Paragraph>
      <Paragraph position="2"> phatic phoneme). 2 In this case, emphasis can be determined from the surface (~ orthographic) form.</Paragraph>
      <Paragraph position="3"> However, this is not always the case. Syriac spirantization requires lexical information as the following example illustrates: Synchronically speaking, the six plosives \[b\], \[g\], \[d\], \[k\], \[p\] and \[t\] undergo spirantization when in postvocalic position wilh respect to the lexical form, 3 resulting in \[v\], \[~\], \[b\], \[x\], If\] and \[0\], respectively. Hence, */katab/--~ \[k0av\], and */wakatab/--~ \[wax0av\] (in both cases the first stem vowel is deleted as described above).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="57" end_page="361215133" type="metho">
    <SectionTitle>
3 Multi-Tape Grammar
</SectionTitle>
    <Paragraph position="0"> This section provides a grammar for the above data using a multi-tape model and illustrates some of the complexities involved in maintaining multiple lexical tapes throughout. The multi-tape model (originally proposed by (Kay, 1987)) is an extension to the commonly used regular rewrite rules. In the multi-tape version, more than one lexical tape is allowed. Here, we shall use the following formalism - which derives from the one reported by (Pulman and Hepple, 1993) - to express regular rewrite rules:</Paragraph>
    <Paragraph position="2"> where LLC is the left lexical context, L~x is the lexical form, RLC is the right lexical context, LSC is the left surface context, SURF is the surface form, and RSC is the right surface context. The operators and C/:~ indicate optional and obligatory rules, respectively. In the multi-tape version, lexical expressions are n-tuple of regular expressions of the form (xl, x2, ..., x, 0, with the ith expression referring to symbols on the ith lexical tape. When n = 1, the parentheses can be ignored; hence, (x) and x are equivalent .4 The grammars presented here assumes a lexicon with the morpheme entries presented above. The pattern morpheme is {cvcvc} (in small letters); capitals in rules denote variables drawn from a finite-set of symbols.</Paragraph>
    <Paragraph position="3"> Lexieal expressions make use of three tapes: pattern, root and vocalism, respectively. Hence, the</Paragraph>
    <Paragraph position="5"> where X is any segment, C is a consonant, and * is any context.</Paragraph>
    <Paragraph position="6"> Grammar 2 Grammar for Syriac Vowel Deletion</Paragraph>
    <Paragraph position="8"> where C is a consonant and V is a vowel lexical expression (c,k,s) denotes a \[c\] on the first (pattern) tape,i a \[k\] on the second (root) tape and the empty string on the third (vocalism) tape. Prefixation and suffixation, which for the most part fall out of the domain of templatic morphology, are represented as a sequence of segments as in any other language and ate placed on the first (pattern) lexical tape. 5</Paragraph>
    <Section position="1" start_page="57" end_page="33121121" type="sub_section">
      <SectionTitle>
3.1 Nonlinearity vs. Linearity
</SectionTitle>
      <Paragraph position="0"> Rules R1 and R2 in Grammar 1 take care of consonants and vo~wels, respectively. The rules derive the Arabic forms /katab/ and /s.adaq/. R3 is the default rule for prefixes and suffixes. It simply maps every segment on the .first lexical tape to the surface. Grammar' 1 derives the forms/wakatab/and /w~.adaq/as well. The former is illustrated below.</Paragraph>
      <Paragraph position="1">  \[wl alk la I t I albl Surface The numbers:between the tapes refer to the rules in the grammaR. Note that the prefix shares a tape with the pattern.</Paragraph>
    </Section>
    <Section position="2" start_page="33121121" end_page="361215133" type="sub_section">
      <SectionTitle>
3.2 Phonological and Orthographic Rules
</SectionTitle>
      <Paragraph position="0"> The Syriac vowel deletion rule, V ---+ e/ CV, is given in the notation of our formalism in Grammar 2. Note that by virtue of its right-lexical context (cv,C,a), R4 can only apply to the first stem vowel as illustrated in the derivation of/ktab/from the underlying */katab/by the deletion of the first vowel: SHaving tile prefixes share a tape of tile patterns is a matter of convenience since the number of segments in a pattern, more or less, corresponds to that on the surface more than segments of roots and vocalisms.</Paragraph>
      <Paragraph position="1"> l a l a t Vocalism  second stem vowel is deleted by the same phonological phenomenon. The difference here lies in the right-lexical context expression (cV,C,e), where the suffix vowel appears on the first lexical tape. The derivation is illustrated below: t al a t i. Vocatism kl It bi i Root c!vlc v cte!11 Pattern and Affixes 1215133 Iklalt b l e l h \[ Surface R4 and R5 fail when the deleted :vowel itself appears in the prefix, e.g. {wa} +/katbeh/--~/wkatbeh/. R6 handles this case; here, the right context (cv,C,a) belongs to the nonlinear stem as shown below: null If al all \]Vocalism t I k 1 t b l Root !alc vlc v cielh Pattern and Affixes  Iwl fk!a!t ib!e!hlSurZace In addition, R7 deletes prefix vowels when the right context belongs to a (possibly another) linear prefix, e.g., {wa} + {la} + {da} +/katab/---~/waldaktab/ (the \[a\] of {la\] and the first stem vowel are deleted), as illustrated below:</Paragraph>
      <Paragraph position="3"> The above examples clearly illustrate the complexity of maintaining large nonlinear grammars.</Paragraph>
    </Section>
    <Section position="3" start_page="361215133" end_page="361215133" type="sub_section">
      <SectionTitle>
4 Using a Linearized Lexical
Representation
</SectionTitle>
      <Paragraph position="0"> This section argues that a better framework for solving Semitic morphology divides the lexical-surface mappings into two separate problems. The first handles the templatic nature of morphology, mapping the multiple lexical representation into a linearized lexieal form. This linearized form maintains the same linguistic information of the original lexical representation, and somewhat corresponds to McCarthy's notion of tier conflation (McCarthy, 1986).</Paragraph>
      <Paragraph position="1"> The second takes care of phonological/ orthographic/graphemic mappings between the linearized lexical form and the actual surface. The combined machine is mathematically taken as the composition of the two machines representing the two sets of rules. This brings us to the question of composing multi-tape automata.</Paragraph>
    </Section>
    <Section position="4" start_page="361215133" end_page="361215133" type="sub_section">
      <SectionTitle>
4.1 Composition of Multi-Tape Machines
</SectionTitle>
      <Paragraph position="0"> The composition of two binary transducers A and B is straightforward since one tape is taken for input and the other for output. The composition of the two machines is a generalization of the intersection of the same two automata in that each state in the resulting machine is a pair drawn from one state in A and the other from B, and each transition corresponds to a pair of transitions, one from A and the other from B, with compatible labels.</Paragraph>
      <Paragraph position="1"> The composition of multi-tape transducers, however, is ambiguous. Which tapes are input and which are output? Consider the machine which accepts the regular relation 6 a*:b*:b* and a second machine which accepts the regular relation b* :b* :c*.</Paragraph>
      <Paragraph position="2"> The composition of the two machines can be either the machine accepting a* :c* or the machine accepting a* :b* :b* :c*. However, if tapes can be marked as belonging to the domain or range of the transduction, the ambiguity will be resolved.</Paragraph>
      <Paragraph position="3"> Formally, an n-tape finlte-state automaton is a 5-tuple M = (Q, Z, 5, q0, F), where Q is a finite set of states, E is a finite input alphabet (a set of n-tuples of symbols), t~ is a transition function mapping Q x E'~ to Q, q0 E Q is an initial state, and F C Q is a set of final states. An n-tape FSA accepts an n-tuple of strings if and only if starting from the initial state q0, it can scan all the symbols  on every tape i, 1 _&lt; i &lt; n, and end up in a final state q E F.</Paragraph>
      <Paragraph position="4"> An n-tape finite-state transducer is a 6-tuple M = (Q,E,5, qo, F,d), where Q, .~, 6, q0 and F are like before and d, 1 &lt; d &lt; n, is the number of domain tapes. The number of range tapes is simply n - d.</Paragraph>
      <Paragraph position="5"> Let A = (Qi, El, 51, ql, Fi, dl) and B = (Q2, E2, 52, q2, F2, d2) be two multi-tape transducers over nl and n2 tapes, respectively. Further, let si denote the symbol on the ith tape. There is a composition of A and B, denoted by C, if and only if</Paragraph>
      <Paragraph position="7"> if and only if / 8 t 8d1+1 ~ 81,'&amp;quot; &amp;quot;,8nl ~ d2 The resulting machine is an k-tape machine, where k = dl - d2 + n2.</Paragraph>
    </Section>
    <Section position="5" start_page="361215133" end_page="361215133" type="sub_section">
      <SectionTitle>
Implementational Note
</SectionTitle>
      <Paragraph position="0"> We found that it is best not to indicate d, the number of domain tapes, in the data structure representing the automata, but to hav~ it as an argument to the composition function. This enables the user to change the value of d per operation if the need arises.</Paragraph>
    </Section>
    <Section position="6" start_page="361215133" end_page="361215133" type="sub_section">
      <SectionTitle>
4.2 A Mixed Grammar
</SectionTitle>
      <Paragraph position="0"> Now we illustrate the advantage of having a linearized lexical form by developing a mixed grammar.</Paragraph>
      <Paragraph position="1"> We make use of two grammars for the data presented above. G1 for templatic nonlinear problems and G2 for linear issues. For the current data, our G1 would be similar to the rules in Grammar 1.</Paragraph>
      <Paragraph position="2"> G2 takes as input the output of G1, i.e., the linearized lexical form such as Syriac */katab/, */waladakatab/, etc. Since R4-R7 in Grammar 2 represent the one phonological phenomenon, viz., the deletion of a short vowel in an open syllable, they can be combined into one rules:</Paragraph>
      <Paragraph position="4"> where C is a consonant and V is a vowel ~o Grammar 3 Grammar for Spirantization, case for</Paragraph>
      <Paragraph position="6"> where V is a vowel An identity rule (similar to R3 is also required). Applying R8 and the identity rule on the input of  Recall that the rule applies right-to-left. It might not be clear from this example how advantageous is this solution. After all, only three rules were saved. However, note that almost all of the rules in a real grammar do not belong to the templatte morphology domain, but to the linear phonological\]orthographic domain. Consider the case of Syriac spirantization mentioned above, viz., \[- plosive\] ~ \[+ fricative\] / V __ Each of the six Syriac plosives requires a set of rules of the form in Grammar 3:R9 applies when the center and context belong to prefixes and suffixes, R10 applies when the center belongs to the stem and the context belongs to a prefix, and Rll applies when the center and context belong to the stem. (Since Syriac stems invariably end in consonants, there is no rule for the case when the center belongs to a suffix and the right context to the stem in this case.) To cover all six plosives, 18 rules are required. If, however, the rules are to apply on the linearized lexical form, each plosive requires only one rule similar to R9 (a total of six rules).</Paragraph>
      <Paragraph position="8"/>
    </Section>
  </Section>
  <Section position="5" start_page="361215133" end_page="361215133" type="metho">
    <SectionTitle>
5 Conclusilon
</SectionTitle>
    <Paragraph position="0"> Using a lineari~ed form provides a pragmatic solution to the pr6blems discussed above. While the templatic mo@hology issues are resolved using a multi-tape grammar, the linear-in-nature phonological/graphemic issues are dealt with using a two-tape grammar as in lany other Western language. As illustrated with ithe vowel deletion rule above, this makes the task I of the grammar writer easier by far.</Paragraph>
    <Paragraph position="1"> In addition, the size of the intermediate automata is substantially decreased in terms of space complexity.</Paragraph>
    <Paragraph position="2"> There is another advantage of this model if used in a multi-lingual Semitic environment system. We noted above how the derivation of/katab/in Arabic and Syriac is similar. The only difference is that in the latter a vowel deletion rule takes place. It is then possible to generalize the lexical-to-linearizedform module for more than one Semitic language. At the abstract finite-state level, our solution may have some similarities with the proposal of (Kornat, 1991) which aims at modeling autosegmental phonology by coding nonlinear autosegmental representations as linear strings. Kornai's approach linearizes the lexical nonlinear representation from the outset using a number of coding mechanisms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML