File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2039_metho.xml
Size: 14,983 bytes
Last Modified: 2025-10-06 14:07:09
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2039"> <Title>Finite-State Reduplication in One-Level Prosodic Morphology</Title> <Section position="5" start_page="296" end_page="298" type="metho"> <SectionTitle> 3 Finite-State Methods </SectionTitle> <Paragraph position="0"> The present proposal differs from the state-labelled automata employed in One-Level Phonology by returning to conventional arc-labelled ones, but shares the idea that labels denote sets, which is advantageous for compact automata.</Paragraph> <Section position="1" start_page="296" end_page="297" type="sub_section"> <SectionTitle> 3.1 Enriched Representations </SectionTitle> <Paragraph position="0"> As motivated in SS2, an appropriate automaton representation of morphemes that may undergo reduplication should provide generic support for three key operations: (i) copying or repetition of symbols, (ii) truncation or skipping, and (iii) in fixation.</Paragraph> <Paragraph position="1"> For copying, the idea is to enrich the FSA representing a morpheme by encoding stepwise repetition locally. For every content arc i 2~ j we add a reverse repeat arc j repe~t i. Following repeat arcs, we can now move backwards within a string, as we shall see in more detail below.</Paragraph> <Paragraph position="2"> For truncation, a similar local encoding is available: For every content arc i --% j, add another skip arc i ski~ j. This allows us to move forward while suppressing the spellout of e.</Paragraph> <Paragraph position="3"> A generic recipe for infixation ensures that segmental material can be inserted anywhere within an existing morpheme FSA. A possible representational enrichment therefore adds a self loop i ~ i labelled with the symbol alphabet E to every state i of the FSA. 2 Each of the three enrichments presupposes an epsilon-free automaton in order to be wellbehaved.</Paragraph> <Paragraph position="4"> This requirement in particular ensures that technical arcs (skip, repeat) are in 1:1 correspondence with content arcs, which is essential for unambiguous positional movement: e.g. add_skips(a e b) would ambiguously require 1 or 2 skips to supress the spellout of b, because it creates a disjunction of the empty string e with skip. It is perhaps worth emphasizing that there is no special interpretation whatsoever for these technical arcs: the standard automaton semantics is unaffected. As a consequence, skip and repeat will be a visible part of the output in word form generation and must be allowed in the input for parsing as well.</Paragraph> <Paragraph position="5"> Taken together, the three enrichments yield an automaton for Bambara wulu, shown in figure 1.a.</Paragraph> <Paragraph position="6"> While skipping is not necessary for this example, 4 ~ 4 is: it will host the fixed-melody/o/. The repeat arcs will of course facilitate copying, as we shall see in a moment.</Paragraph> <Paragraph position="8"/> </Section> <Section position="2" start_page="297" end_page="297" type="sub_section"> <SectionTitle> 3.2 Copying as Intersection </SectionTitle> <Paragraph position="0"> Bird & Ellison (1992) came close to discovering a useful device for reduplication when they noted that automaton intersection has at least indexedgrammar power (ibid., p.48). They demonstrated their claim by showing that odd-length strings of indefinite length like the one described by the regular expression (a bcde f g)+ can be repeated by intersecting them with an automaton accepting only strings of even length: the result is (abede f gabede f g) +.</Paragraph> <Paragraph position="1"> Generalizing from their artifical example, let us first make one additional minor enrichment by tagging the edges of the reduplicative portion of a base with synchronization bits :1, while using the opposite value :0 for the interior part (see figure 1.a). This gives us a segment-independent handle on those edges and a regular expression seg:l seg:o* seg:l for the whole synchronized portion (seg abbreviates the set of phonological segments).</Paragraph> <Paragraph position="2"> Assuming repeat-enriched bases, a total reduplication morpheme can now be seen as a partial word specification which mentions two synchronized portions separated by an arbitrary-length move backwards: null (4) seg:lseg:o*seg:l repeat* seg:lseg:o* seg:l Moreover, total reduplicative copying now simply is intersection of the base and (4), or - in the Bambara case - a simple variant that adds the/o/(figure 1.b). Disregarding serf loops for the moment, the reader may verify that no expansion of the kleenestarred repeat that traverses less than Ibase\[ segments will satisfy the demand for two synchronized portions. Semai requires another slight variant of (4) which skips the interior of the base in the reduplicant: null (5) seg:l skip*seg:l repeat* seg:lseg:o*seg:l The identification of copying with intersection not only allows for great flexibility in describing the full range of actual reduplicative constructions with regular expressions, it also reuses the central operation for constraint combination that is independently required for one-level morphology and phonology.</Paragraph> <Paragraph position="3"> Any improvement in efficient implementation of intersection therefore has immediate benefits for grammar computation as a whole. In contrast, a hypothetical setup where a dedicated total copy device is sandwiched between finite-state transducers seems much less elegant and may require additional machinery to detect copies during parsing.</Paragraph> <Paragraph position="4"> Note that it is in fact possible to compute reduplication-as-intersection over an entire lexicon of bases (see figure 3 for an example), provided that repeat arcs are added individually to each base, Enriched base FSAs can then be unioned together and undergo further automaton transformations such as determinization or minimization. This restriction is necessary because our finite-state method cannot express token identity as normally required in string repetition. Rather than identifying the same token, it addresses the same string position, using the weaker notion of type identity. Therefore, application of the method is only safe ff strings are effectively isolated from one another, which is exactly what per-base enrichment achieves. See SS3.4 for a suggestion on how to lift the restriction in practice.</Paragraph> </Section> <Section position="3" start_page="297" end_page="298" type="sub_section"> <SectionTitle> 3.3 Resource Consciousness </SectionTitle> <Paragraph position="0"> One pays a certain price for allowing general repetition and infixation: because of its self loops and technical arcs, the automaton of figure 1.a over-generates wildly. Also, during intersection, self loops can absorb other morphemes in unexpected ways. A possible diagnosis of the underlying defect is that we need to distinguish between producers and consumers of information. In analogy to LFG's constraint vs constraining equations, information may only be consumed if it has been produced at least once.</Paragraph> <Paragraph position="1"> For automata, let us spend a P/C bit per arc, with P/C=I for producers and P/C=O for consumer arcs.</Paragraph> <Paragraph position="2"> In open interpretation mode, then, intersection combines the P/C bits of compatible arcs via logical OR, making producers dominant. It follows that a resource may be multiply consumed, which has obvious advantages for our application, the multiple realization of string symbols. A final step of closed in- null terpretation prunes all consumer-only arcs that survived constraint interaction, in what may be seen as intersection with the universal producer language under logical-AND combination of P/C bits.</Paragraph> <Paragraph position="3"> Using these resource-conscious notions, we can now model both the default absence of material and purely contextual requirements as consumer-type information: unless satisfied by lexical resources that have been explicitly produced, the corresponding arcs will not be part of the result. By convention, producers are displayed in bold. Thus, the exact result of figure 1.a 71 1.b after closed interpretation is: W:I U:0/:o U:o o repeat 4 repeat* W:l u:o l:o U:l This expression also illustrates that, for parsing, strings like wuluowulu need to be consumer-selfloop-enriched via a small preprocessing step, because intersection with the grammar would otherwise fail due to unmentioned technical arcs such as repeat. Because our proposal is fully declarative, parsing then reduces to intersecting the enriched parse string with the grammar-and-lexicon automaton (whose construction will itself involve intersection) in closed interpretation mode, followed by a check for nonemptiness of the result. Whereas the original parse string was underspecified for morphological categories, the parse result for a realistic morphology system will, in addition to technical arcs, contain fully specified category arcs in some predefined linearization order, which can be efficiently retrieved if desired.</Paragraph> </Section> <Section position="4" start_page="298" end_page="298" type="sub_section"> <SectionTitle> 3.4 On-demand Algorithms </SectionTitle> <Paragraph position="0"> It is clear that the above method is particularly attractive if some of its operations can be performed online, since a fullform lexicon of productive reduplications is clearly undesirable e.g. for Bambara. I therefore consider briefly questions of efficient implementation of these operations.</Paragraph> <Paragraph position="1"> Mohri et al. (1998) identify the existence of a local computation rule as the main precondition 3 for a lazy implementation of automaton operations, i.e. one where results are only computed when demanded by subsequent operations. Such implementations are very advantageous when large intermediate automata may be constructed but only a small part of them is visited for any particular input. They show that such a rule exists for composi3A second condition is that no state is visited that has not been discovered from the start state. It is easy to implement (6) so that this condition is fulfilled as well.</Paragraph> <Paragraph position="2"> tion o, hence also for our operation of intersection (An B = range(identity(A) o identity(B))).</Paragraph> <Paragraph position="3"> Fortunately, the three enrichment steps all have local computation rules as well: (6) e repeat a. q~-+ q2 ~ q2 ) q~ ski~ b. ql-~ q2 ~ ql q2 c. q ~ q-~+ q The impact of the existence of lazy implementations for enrichment operations is twofold: we can (a) now maintain minimized base lexicons for storage efficiency and add enrichments lazily to the currently pursued string hypothesis only, possibly modulated by exception diacritics that control when enrichment should or should not happen. 4 And (b), laziness suffices to make the proposed reduplication method reasonably time-efficient, despite the larger number of online operations. Actual benchmarks from a pilot implementation are reported elsewhere (Walther, submitted).</Paragraph> </Section> </Section> <Section position="6" start_page="298" end_page="299" type="metho"> <SectionTitle> 4 A Worked Example </SectionTitle> <Paragraph position="0"> In this section I show how to implement the Koasati case from (3) using the FSA Utilities toolbox (van Noord, 1997). FSA Utilities is a Prolog-based finite-state toolkit and extendible regular expression compiler. It is freely available and encourages rapid prototyping.</Paragraph> <Paragraph position="1"> Figure 2 displays the regular expression operators that will be used (italicized operators are modifications or extensions). The grammar will be pre- null sented below in a piecewise fashion, with line numbers added for easy reference.</Paragraph> <Paragraph position="2"> 4See Walther (submitted) for further details. With deterministic automata, the question of how to recover from a wrong string hypothesis during parsing is not an issue.</Paragraph> <Paragraph position="3"> Starting with the definition of stems (line 1), we add the three enrichments to the bare phonological string (2). However, the innermost producer-type string constructed by stringToAutomaton (3) is intersected with phonological constraints (5,6) that need to see the string only, minus its enrichments. This is akin to lexical rule application.</Paragraph> <Paragraph position="4"> Lines 8-10 capture the V/h alternation that is characteristic for vowel-initial stems under reduplication, with the vocalic alternant constituting the default used in isolated pronunciation. In contrast, the ha/alternant is concatenated with a consumer-type skip that requires a producer from elsewhere. Lines 12-1 C/ define two example stems.</Paragraph> <Paragraph position="5"> The following constraint (15-18) enriches a prosodically underspecified string with moras # - abstract units of syllable weight (Hayes, 1995) -, a prerequisite to locating (20-24) and synchronization-marking (25-31) the first heavy syllable after which the reduplicative infix will be inserted.</Paragraph> <Paragraph position="6"> and after (p:l i n:~ ) the infixation site need to be marked. Also, it turns out to be useful to classify base string positions for easy reference in the reduplicative morpheme, which motivates lines 32- 34. The main part now is the reduplicative morpheme itself (35), which looks like a mixture of Bambara and Semai: the spellout of the base is followed by iterated repeats (36) to move back to its synchronized initial position (37), which - recall/h/- is required to be consonantal. The rest of the base is skipped before insertion of the fixed-melody part/o(o)/occurs (38, 42-44). Proceeding with the interrupted realization of the base, we identify its beginning as a synchronized syllable onset (,,~ mora), followed by a right-synchronized string (39- 40).</Paragraph> <Paragraph position="8"> Finally, some obvious word_level_constraints need to be defined (45-54), before the central intersection of Stem and punctual-aspect reduplication (57) completes our Koasati fragment:</Paragraph> <Paragraph position="10"> punctualaspect_reduplication).</Paragraph> <Paragraph position="11"> The result of wordform ( {tahaspin, aklat lin} ) is shown in figure 3 ( \[ and \] are aliases for initial and final position).</Paragraph> <Paragraph position="12"> Space precludes the description of a final automaton operation called Bounded Local Optimization (Walther, 1999) that turns out to be useful here to ban unattested free length variation, as found e.g. in ak-ho(o)-latlin where the length of o is yet to be determined. Suffice to say that a parametrization of Bounded Local Optimization would prune the moraic arc 16 ~ 19 in figure 3 by considering it costlier than the non-moraic arc 16 --~ 18, thereby eliminating the last source of indeterminacy.</Paragraph> </Section> class="xml-element"></Paper>