File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1042_metho.xml
Size: 18,772 bytes
Last Modified: 2025-10-06 14:14:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1042"> <Title>Compiling Regular Formalisms with Rule Features into Finite-State Automata</Title> <Section position="3" start_page="0" end_page="330" type="metho"> <SectionTitle> 2 Regular Formalism with Rule Features </SectionTitle> <Paragraph position="0"> This work adopts the following notation for regular formalisms, cf. (Kaplan and Kay, 1994):</Paragraph> <Paragraph position="2"> where T, A and p are n-way regular expressions which describe same-length relations) (An n-way regular expression is a regular expression whose terms posed by (Kiraz, 1996).</Paragraph> <Paragraph position="3"> ~More 'user-friendly' notations which allow mapping expressions of unequal length (e.g., (Grimley-Evans, Kiraz, and Pulman, 1996)) are mathematically equivalent to the above notation after rules are converted into same-</Paragraph> <Paragraph position="5"> are n-tuples of alphabetic symbols or the empty string e. A same-length relation is devoid of e. For clarity, the elements of the n-tuple are separated by colons: e.g., a:b:c* q:r:s describes the 3-relation { (amq, bmr, cms) \[ m > 0 }. Following current terminology, we call the first j elements 'surface '6 and the remaining elements 'lexical'.) The arrows correspond to context restriction (CR), surface coercion (SC) and composite rules, respectively. A compound rule takes the form</Paragraph> <Paragraph position="7"> To accommodate for rule features, each rule may be associated with an (n -j)-tuple of feature structures, each of the form \[attributel =vall , attribute,=val2 , . . .\] (3) i.e., an unordered set of attribute=val pairs. An attribute is an atomic label. A val can be an atom or a variable drawn from a predefined finite set of possible values, z The ith element in the tuple corresponds to the (j z_ i)th element in rule expressions. As a way of illustration, consider the simplified grammar in Figure 1 with j = 1.</Paragraph> <Paragraph position="8"> The four elements of the tuples are: surface, pattern, root, and vocalism. R1 and R2 sanction the first and third consonants, respectively. R3 and R4 sanction vowels. R5 is the gemination rule; it is only triggered if the given rule features are satisfied: \[cat=verb\] for the first lexical element (i.e., the pattern) and \[measure=pa&quot;el\] for the second element (i.e., the root). The rule also illustrates that r can be a sequence of tuples. The derivation of/katteb/ is illustrated below: length descriptions at some preprocessing stage.</Paragraph> <Paragraph position="10"> The numbers between the lexical expressions and the surface expression denote the rules in Figure 1 which sanction the given lexical-surface mappings.</Paragraph> <Paragraph position="11"> Rule features play a role in the semantics of rules: a =~ states that if the contexts and rule features are satisfied, the rule is triggered; a C/=: states that if the contexts, lexical expressions and rule features are satisfied, then the rule is applied. For example, although R5 is devoid of context expressions, the rule is composite indicating that if the root measure is pa &quot;el, then gemination must occur and vice versa. Note that in a compound rule, each set of contexts is associated with a feature structure of its own.</Paragraph> <Paragraph position="12"> What is meant by 'rule features are satisfied'? Regular grammars which make use of rule features normally interact with a lexicon. In our model, the lexicon consists of (n - j) sublexica corresponding to the lexical elements in the formalism. Each sub-lexical entry is associate with a feature structure.</Paragraph> <Paragraph position="13"> Rule features are satisfied if they match the feature structures of the lexical entries containing the lexical expressions in r, respectively. Consider the lexicon in Figure 2 and rule R5 with 7&quot; = t:c.,:t:0 t:0:0:0 and the rule features (\[cat=verb\], \[measure=pa&quot;el\], \[\]). The lexical entries containing r are {clvc_,vc3} and {ktb}, respectively. For the rule to be triggered, \[cat=verb\] of the rule must match with \[cat=verb\] of the lexical entry {clvc2vc3}, and \[measure=pa&quot;el\] of the rule must match with \[measure=(p'al,pa&quot;el)\] of the lexical entry {ktb}.</Paragraph> <Paragraph position="14"> As a second illustration, R6 derives the simple p'al measure,/ktab/. Note that in R5 and R6, 1. the lexical expressions in both rules (ignoring 0s) are equivalent, 2. both rules are composite, and 3. they have different surface expression in r.</Paragraph> <Paragraph position="15"> In traditional rewrite formalism, such rules will be contradicting each other. However, this is not the case here since R5 and R6 have different rule features. The derivation of this measure is shown below (R7 completes the derivation deleting the first vowel on the surfaceS): l a 101a 10 I~oc~tism 01ti01b root c v Ic2! v Ip . rn 17632 Ik!0!t !albl rI ce Note that in order to remain within finite-state power, both the attributes and the values in feature structures must be atomic. The formalism allows a value to be a variable drawn from a predefined finite set of possible atomic values. In the compilation process, such variables are taken as the disjunction of all possible predefined values.</Paragraph> <Paragraph position="16"> Additionally, this version of rule feature matching does not cater for rules whose r span over two lexical forms. It is possible, of course, to avoid this limitation by having rule features match the feature structures of both lexical entries in such cases.</Paragraph> </Section> <Section position="4" start_page="330" end_page="331" type="metho"> <SectionTitle> 3 Mathematical Preliminaries </SectionTitle> <Paragraph position="0"> We define here a number of operations which will be used in our compilation process.</Paragraph> <Paragraph position="1"> If an operator 0p takes a number of arguments (at, * *., ak), the arguments are shown as a subscript, e.g. 0p(a,,...,~k) - the parentheses are ignored if there is only one argument. When the operator is mentioned without reference to arguments, it appears on its own, e.g. 0p.</Paragraph> <Paragraph position="2"> Operations which are defined on tuples of strings can be extended to sets of tuples and relations. For example, if S is a tuple of strings and 0p(S) is an operator defined on S, the operator can be extended to a relation R in the following manner</Paragraph> <Paragraph position="4"> lar language. Id,(L) = {X I X is an n-tuple of the form (x,.-., x), x E L } is the n-way identity of L. 9 and Kay, 1994).</Paragraph> <Paragraph position="5"> Definition 3.2 (Insertion) Let R be a regular re- null lation over the alphabet E and let m be a set of symbols not necessarily in E. Iasertm(R) inserts the relation Ida(a) for all a E m, freely throughout R. Insert~ I o Insertm(R) = R removes all such instances if m is disjoint from E. 1deg Remark 3.2 We can define another form of Insert where the elements in rn are tuples of symbols as followS: Let R be a regular relation over the alphabet and let rn be a set of tuples of symbols not necessarily in E. Insertm(R) inserts a, for all a E m, freely throughout R.</Paragraph> <Paragraph position="6"> The current algorithm is motivated by the work of (Grimley-Evans, Kiraz, and Puhnan, 1996). tt Intuitively, the automata is built by three approximations as follows:</Paragraph> <Section position="1" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 4.1 Accepting rs </SectionTitle> <Paragraph position="0"> Let 7- be the set of all rs in a regular grammar, p be an auxiliary boundary symbol (not in the grammar's alphabets) and p' = Ida(p). The first approximation is described by</Paragraph> <Paragraph position="2"> Centers accepts the symbols, p', followed by zero or more rs, each (if any) followed by p'. In other words, the machine accepts all centers described by the grammar (each center surrounded by p') irrespective of their contexts.</Paragraph> <Paragraph position="3"> It is implementation dependent as to whether T includes other correspondences which are not explicitly given in rules (e.g., a set of additional feasible centers).</Paragraph> </Section> <Section position="2" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 4.2 Context Restriction Rules </SectionTitle> <Paragraph position="0"> For a given compound rule, the set of relations in which r is invalid is</Paragraph> <Paragraph position="2"> i.e., r in any context minus r in all valid contexts.</Paragraph> <Paragraph position="3"> However, since in SS4.1 above, the symbol p appears freely, we need to introduce it in the above expression. The result becomes</Paragraph> <Paragraph position="5"> The above expression is only valid if r consists of only one tuple. However, to allow it to be a sequence of such tuples as in R5 in Figure 1, it must be 1. surrounded by p~ on both sides, and 2. devoid of p~.</Paragraph> <Paragraph position="6"> The first condition is accomplished by simply placing p' to the left and right of r. As for the second condition, we use an auxiliary symbol, w, as a place-holder representing r, introduce p freely, then substitute r in place of w. Formally, let w be an auxiliary symbol (not in the grammar's alphabet), and let w ~ = Ida(w) be a place-holder representing r. The above expression becomes</Paragraph> <Paragraph position="8"> CR now accepts only the sequences of tuples which appear in contexts in the grammar (but including the partitioning symbols p~); however, it does not force surface coercion constraints.</Paragraph> </Section> <Section position="3" start_page="331" end_page="331" type="sub_section"> <SectionTitle> 4.3 Surface Coercion Rules </SectionTitle> <Paragraph position="0"> Let r' represent the center of the rule with the correct lexical expressions and the incorrect surface expressions with respect to ,'r*,</Paragraph> <Paragraph position="2"> The coerce relation for a compound rule can be simply expressed by l~-</Paragraph> <Paragraph position="4"> The two p~s surrounding r ~ ensure that coercion applies on at least one center of the rule.</Paragraph> <Paragraph position="5"> For all such expressions, we subtract Coerce from the automaton under construction, yielding</Paragraph> <Paragraph position="7"> SC now accepts all and only the sequences of tupies described by the grammar (but including the partitioning symbols p~).</Paragraph> <Paragraph position="8"> It remains only to remove all instances of p from the final machine, determinize and minimize it.</Paragraph> <Paragraph position="9"> There are two methods for interpreting transducers. When interpreted as acceptors with n-tuples of symbols on each transition, they can be determinized using standard algorithms (Hopcroft and Ullman, 1979). When interpreted as a transduction that maps an input to an output, they cannot always be turned into a deterministic form (see (Mohri, 1994; Roche and Schabes, 1995)).</Paragraph> </Section> </Section> <Section position="5" start_page="331" end_page="333" type="metho"> <SectionTitle> 5 Compilation with Rule Features </SectionTitle> <Paragraph position="0"> This section shows how feature structures which are associated with rules and lexical entries can be incorporated into FSAs.</Paragraph> <Paragraph position="1"> 12A special case can be added for epenthetic rules.</Paragraph> <Section position="1" start_page="332" end_page="332" type="sub_section"> <SectionTitle> 5.1 Intuitive Description </SectionTitle> <Paragraph position="0"> We shall describe our handling of rule features with a two-level example. Consider the following analysis.</Paragraph> <Paragraph position="1"> la\[bl c ldI ~ te \[ f! ~ \[glh\[ i \]1~ \[ Lexical 1 2 3 4 5 6 7 5 8 9105 \[a!blcldlOlelf!O!g!h!i!OlS&quot;&quot;Saee The lexical expression contains the lexical forms {abcd}, {ef} and {ghi}, separated by a boundary symbol, b, which designates the end of a lexical entry.</Paragraph> <Paragraph position="2"> The numbers between the tapes represent the rules (in some grammar) which allow the given lexical-surface mappings.</Paragraph> <Paragraph position="3"> Assume that the above lexical forms are associated in the lexicon with the feature structures as in Figure 3. Further, assume that each two-level rule m, 1 < m < 10, above is associated with the feature structure Fro. Hence, in order for the above two-level analysis to be valid, the following feature structures must match All the structures ... must match ...</Paragraph> <Paragraph position="5"> Usually, boundary rules, e.g. rule 5 above, are not associated with feature structures, though there is nothing stopping the grammar writer from doing so.</Paragraph> <Paragraph position="6"> To match the feature structures associated with rules and those in the lexicon we proceed as follows.</Paragraph> <Paragraph position="7"> Firstly, we suffix each lexical entry in the lexicon with the boundary symbol, ~, and it's feature structure. (For simplicity, we consider a feature structure with instantiated values to be an atomic object of length one which can be a label of a transition in a FSA.) 13 Hence the above lexical forms become: 'abcd kfl', 'efbf~.', and 'ghi ~f3'. Secondly, we incorporate a feature structure of a rule into the rule's right context, p. For example, if p of rule 1 above is b:b c:c, the context becomes b:b c:c ,'r* 0:F1 (12) (this simplified version of the expression suffices for the moment). In other words, in order for a:a to be sanctioned, it must be followed by the sequence: 13As to how this is done is a matter of implementation. 1. b:b c:c, i.e., the original right context; 2. any feasible tuple, ,'r*; and 3. the rule's feature structure which is deleted on the surface, 0:F1.</Paragraph> <Paragraph position="8"> This will succeed if only if F1 (of rule 1) and fl (of the lexical entry) were identical. The above analysis is repeated below with the feature structures incorporated into p.</Paragraph> <Paragraph position="9"> lalblcldlblS~le fl~lS~lg hli!~!f~lL~ic~t 12345 675 89105 \[alblcldlO!O!e flOlOlg hlilO!OiSuqace As indicated earlier, in order to remain within finite-state power, all values in a feature structure must be instantiated. Since the formalism allows values to be variables drawn from a predefined finite set of possible values, variables entered by the user are replaced by a disjunction over all the possible values.</Paragraph> </Section> <Section position="2" start_page="332" end_page="333" type="sub_section"> <SectionTitle> 5.2 Compiling the Lexicon </SectionTitle> <Paragraph position="0"> Our aim is to construct a FSA which accepts any lexical entry from the ith sublexicon on its j &quot; ith tape.</Paragraph> <Paragraph position="1"> A lexical entry # (e.g., morpheme) which is associated with a feature structure C/ is simply expressed by/~C/, where k is a (morpheme) boundary symbol which is not in the alphabet of the lexicon. The expression of sublexicon i with r entries becomes,</Paragraph> <Paragraph position="3"> We also compute the feasible feature structures of sublexicon i to be</Paragraph> <Paragraph position="5"> and the overall feasible feature structures on all sublexica to be</Paragraph> <Paragraph position="7"> The first element deletes all such features on the surface. For convenience in later expressions, we incorporate features with ~ as follows</Paragraph> <Paragraph position="9"> The overall lexicon can be expressed by, 14</Paragraph> <Paragraph position="11"> 14To make the lexicon describe equal-length relations, a special symbol, say 0, is inserted throughout.</Paragraph> <Paragraph position="12"> The operator x creates one large lexicon out of all the sublexica. This lexicon can be substantially reduced by intersecting it with Proj ect~'l (~0).. If a two-level grammar is compiled into an automaton, denoted by Gram, and a lexicon is compiled into an automaton, denoted by Lez, the automaton which enforces lexical constraints on the language is expressed by</Paragraph> <Paragraph position="14"> The first component above is a relation which accepts any surface symbol on its first tape and the lexicon on the remaining tapes.</Paragraph> </Section> <Section position="3" start_page="333" end_page="333" type="sub_section"> <SectionTitle> 5.3 Compiling Rules </SectionTitle> <Paragraph position="0"> A compound regular rule with m context-pairs and m rule features takes the form v {==~,<==,C/~} kl___pl;k2--p2;...;Am---p m \[C/1, C/2,..., C/-~\] (19) where v, A ~, and pk, 1 < k < m are like before and ck is the tuple of feature structures associated with rule k.</Paragraph> <Paragraph position="1"> The following modifications to the procedure given in section 4 are required.</Paragraph> <Paragraph position="2"> Forgetting contexts for the moment, our basic machine scans sequences of tuples (from &quot;/-), but requires that any sequence representing a lexical entry be followed by the entry's feature structure (from * ). This is achieved by modifying eq. 4 as follows:</Paragraph> <Paragraph position="4"> The expression accepts the symbols, 9', followed by zero or more occurrences of the following: 1. one or more v, each followed by ~a', and 2. a feature tuple in * followed by p'. In the second and third phases of the compilation process, we need to incorporate members of C/I, freely throughout the contexts. For each A k, we compute the new left context</Paragraph> <Paragraph position="6"> The right context is more complicated. It requires that the first feature structure to appear to the right of v is Ck. This is achieved by the expression, 7&quot;~ k = Inserto(p k) CI ~'*C/k~r~ (22) The intersection with a'*C/k,'r; ensures that the first feature structure to appear to the right of v is Ck: zero or more feasible tuples, followed by Ck, followed by zero or more feasible tuples or feature structures.</Paragraph> <Paragraph position="7"> Now we are ready to modify the Restrict relation.</Paragraph> <Paragraph position="8"> The first component in eq. 5 becomes A = (; U ~O)*vTr~ (23) The expression allows ~ to appear in the left and right contexts of v; however, at the left of v, the expression (Tr tO ~rC/) puts the restriction that the first tuple at the left end must be in a', not in C/.</Paragraph> <Paragraph position="9"> The second component in eq. 5 simply becomes</Paragraph> <Paragraph position="11"> Hence, Restrict becomes (after replacing v with w' in eq. 23 and eq. 24)</Paragraph> <Paragraph position="13"> In a similar manner, the Coercer relation becomes null</Paragraph> <Paragraph position="15"/> </Section> </Section> class="xml-element"></Paper>