File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1017_metho.xml

Size: 18,942 bytes

Last Modified: 2025-10-06 14:15:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1017">
  <Title>Transducers from Rewrite Rules with Backreferences</Title>
  <Section position="4" start_page="126" end_page="130" type="metho">
    <SectionTitle>
2 The Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="126" end_page="128" type="sub_section">
      <SectionTitle>
2.1 Preliminary Considerations
</SectionTitle>
      <Paragraph position="0"> Before presenting the algorithm proper, we will deal with a couple of meta issues. First, we introduce our version of the finite state calculus in SS2.1.1. The treatment of special marker symbols is discussed in SS2.1.2. Then in SS2.1.3, we discuss various utilities that will be essential for the algorithm. null  The algorithm is implemented in the FSA Utilities (van Noord, 1997). We use the notation provided by the toolbox throughout this paper. Table 1 lists the relevant regular expression operators. FSA Utilities offers the possibility to define new regular expression operators. For example, consider the definition of the nullary operator vowel as the union of the five vowels:  macro (vowel, {a, e, i,o,u}).</Paragraph>
      <Paragraph position="1"> In such macro definitions, Prolog variables can be used in order to define new n-ary regular expression operators in terms of existing operators. For instance, the lenient_composition operator (Karttunen, 1998) is defined by: macro (priorityiunion (Q ,R), {Q, -domain(Q) o R}).</Paragraph>
      <Paragraph position="2"> macro (lenient_composition (R, C), priority_union(R o C,R)).</Paragraph>
      <Paragraph position="3"> Here, priority_union of two regular expressions Q and R is defined as the union of Q and the composition of the complement of the domain of Q with R. Lenient composition of R and C is defined as the priority union of the composition of R and C (on the one hand) and R (on the other hand).</Paragraph>
      <Paragraph position="4"> Some operators, however, require something more than simple macro expansion for their definition. For example, suppose a user wanted to match n occurrences of some pattern. The FSA Utilities already has the '*' and '+' quantifiers, but any other operators like this need to be user defined. For this purpose, the FSA Utilities supplies simple Prolog hooks allowing this general quantifier to be defined as: macro (mat chn (N, X), Regex) * mat ch_n (N, X, Regex).</Paragraph>
      <Paragraph position="6"> For example: match_n(3,a) is equivalent to the ordinary finite state calculus expression \[a, a, a\].</Paragraph>
      <Paragraph position="7"> Finally, regular expression operators can be defined in terms of operations on the underlying automaton. In such cases, Prolog hooks for manipulating states and transitions may be used. This functionality has been used in van Noord and Gerdemann (1999) to provide an implementation of the algorithm in Mohri and Sproat (1996).</Paragraph>
      <Paragraph position="8">  Previous algorithms for compiling rewrite rules into transducers have followed Kaplan and Kay (1994) by introducing special marker symbols (markers) into strings in order to mark off candidate regions for replacement. The assumption is that these markers are outside the resulting transducer's alphabets. But previous algorithms have not ensured that the assumption holds.</Paragraph>
      <Paragraph position="9"> This problem was recognized by Karttunen (1996), whose algorithm starts with a filter transducer which filters out any string containing a marker. This is problematic for two reasons. First, when applied to a string that does happen to contain a marker, the algorithm will simply fail. Second, it leads to logical problems in the interpretation of complementation. Since the complement of a regular expression R is defined as E - R, one needs to know whether the marker symbols are in E or not. This has not been clearly addressed in previous literature.</Paragraph>
      <Paragraph position="10"> We have taken a different approach by providing a contextual way of distinguishing markers from non-markers. Every symbol used in the algorithm is replaced by a pair of symbols, where the second member of the pair is either a 0 or a 1 depending on whether the first member is a marker or not. 2 As the first step in the algorithm, O's are inserted after every symbol in the input string to indicate that initially every symbol is a non-marker. This is defined as: macro (non_markers, \[?, \[\] :0\] *) .</Paragraph>
      <Paragraph position="11"> Similarly, the following macro can be used to insert a 0 after every symbol in an arbitrary expression E.</Paragraph>
      <Paragraph position="12"> 2This approach is similar to the idea of laying down tracks as in the compilation of monadic second-order logic into automata Klarlund (1997, p. 5). In fact, this technique could possibly be used for a more efficient implementation of our algorithm: instead of adding transitions over 0 and 1, one could represent the alphabet as bit sequences and then add a final 0 bit for any ordinary symbol and a final 1 bit for a marker symbol.</Paragraph>
      <Paragraph position="13"> macro (non_markers (E), range (E o non_markers)).</Paragraph>
      <Paragraph position="14"> Since E is a recognizer, it is first coerced to identity(E). This form of implicit conversion is standard in the finite state calculus.</Paragraph>
      <Paragraph position="15"> Note that 0 and 1 are perfectly ordinary alphabet symbols, which may also be used within a replacement. For example, the sequence \[i,0\] represents a non-marker use of the symbol I.</Paragraph>
      <Paragraph position="16">  Before describing the algorithm, it will be helpful to have at our disposal a few general tools, most of which were described already in Kaplan and Kay (1994). These tools, however, have been modified so that they work with our approach of distinguishing markers from ordinary symbols. So to begin with, we provide macros to describe the alphabet and the alphabet extended with marker symbols:  macro (sig, \[?, 0\] ).</Paragraph>
      <Paragraph position="17"> macro (xsig, \[?, {0,1}\] ).</Paragraph>
      <Paragraph position="18"> The macro xsig is useful for defining a specialized version of complementation and containment: macro(not (X) ,xsig* - X).</Paragraph>
      <Paragraph position="19"> macro ($$ (X), \[xsig*, X, xsig*\] ).</Paragraph>
      <Paragraph position="20"> The algorithm uses four kinds of brackets, so it will be convenient to define macros for each of these brackets, and for a few disjunctions.</Paragraph>
      <Paragraph position="21"> macro (lbl, \[' &lt;1 ', 1\] ) macro (lb2, \[' &lt;2', 1\] ) macro (rb2, \[' 2&gt; ', 1\] ) macro (rbl, \[' 1&gt; ', 1\] ) macro (lb, {lbl, lb2}) macro (rb, {rbl ,rb2}) macro (bl, {lbl, rbl}) macro (b2, {lb2, rb2}) macro (brack, {lb, rb}).</Paragraph>
      <Paragraph position="22">  As in Kaplan &amp; Kay, we define an Intro(S) operator that produces a transducer that freely introduces instances of S into an input string. We extend this idea to create a family of Intro operators. It is often the case that we want to freely introduce marker symbols into a string at any position except the beginning or the end.</Paragraph>
      <Paragraph position="23">  %% Free introduction macro(intro(S) ,{xsig-S, \[\] x S}*) .</Paragraph>
      <Paragraph position="24"> ~.7. Introduction, except at begin macro (xintro (S) , ( \[\] , \[xsig-S, intro (S) \] }) . deg/.~. Introduction, except at end macro (introx (S) , ( \[\] , \[intro (S) , xsig-S\] }) .  macro (xintrox (S), { \[\], \[xsig-S\] , \[xsig-S, intro (S), xsig-S\] }).</Paragraph>
      <Paragraph position="25"> This family of Intro operators is useful for defining a family of Ignore operators: macro( ign( E1,S),range(E1 o intro(S))).</Paragraph>
      <Paragraph position="26"> macro(xign(El,S) ,range(E1 o xintro(S))).</Paragraph>
      <Paragraph position="27"> macro( ignx(E1,S),range(E1 o introx(S))).</Paragraph>
      <Paragraph position="28"> macro (xigax (El, S), range (El o xintrox (S)) ).  In order to create filter transducers to ensure that markers are placed in the correct positions, Kaplan &amp; Kay introduce the operator P-iff-S(L1,L2). A string is described by this expression iff each prefix in L1 is followed by a suffix in L2 and each suffix in L2 is preceded by a prefix in L1. In our approach, this is defined as: macro(if_p then s(L1,L2),</Paragraph>
      <Paragraph position="30"> To make the use ofp_iff_s more convenient, we introduce a new operator l_if f_r (L, R), which describes strings where every string position is preceded by a string in L just in case it is followed by a string in R:</Paragraph>
      <Paragraph position="32"> Finally, we introduce a new operator if (Condit ion, Then, Else) for conditionals. This operator is extremely useful, but in order for it to work within the finite state calculus, one needs a convention as to what counts as a boolean true or false for the condition argument. It is possible to define true as the universal language and false as the empty language: macro(true,? *). macro(false,{}).</Paragraph>
      <Paragraph position="33"> With these definitions, we can use the complement operator as negation, the intersection operator as conjunction and the union operator as disjunction. Arbitrary expressions may be coerced to booleans using the following macro: macro (coerce_t oboolean (E), range(E o (true x true))).</Paragraph>
      <Paragraph position="34"> Here, E should describe a recognizer. E is composed with the universal transducer, which transduces from anything (?*) to anything (?*). Now with this background, we can define the condi-</Paragraph>
      <Paragraph position="36"/>
    </Section>
    <Section position="2" start_page="128" end_page="130" type="sub_section">
      <SectionTitle>
2.2 Implementation
</SectionTitle>
      <Paragraph position="0"> A rule of the form x ~ T(x)/A__p will be written as replace(T,Lambda,Rho). Rules of the more general form xl ...z,, ~ Tl(xl)...T,~(Xn)/A_-p will be discussed in SS3. The algorithm consists of nine steps composed as in figure 1.</Paragraph>
      <Paragraph position="1"> The names of these steps are mostly derived from Karttunen (1995) and Mohri and Sproat (1996) even though the transductions involved are not exactly the same. In particular, the steps derived from Mohri &amp; Sproat (r, f, 11 and 12) will all be defined in terms of the finite state calculus as opposed to Mohri &amp; Sproat's approach of using low-level manipulation of states and transitions, z The first step, non_markers, was already defined above. For the second step, we first consider a simple special case. If the empty string is in the language described by Right, then r(Right) should insert an rb2 in every string position. The definition of r(Right) is both simpler and more efficient if this is treated as a special case. To insert a bracket in every possible string position, we use: \[\[\[\] x rb2,sig\]*,\[\] x rb2\] If the empty string is not in Right, then we must use intro(rb2) to introduce the marker rb2, fol\]owed by l_iff_r to ensure that such markers are immediately followed by a string in Right, or more precisely a string in Right where additional instances of rb2 are freely inserted in any position other than the beginning. This expression is written as:</Paragraph>
      <Paragraph position="3"> l_iff_r (rb2, xign (non_markers (R) , rb2) ) ) ) .</Paragraph>
      <Paragraph position="4"> The third step, f(domain(T)) is implemented as:  % in previous step.</Paragraph>
      <Paragraph position="5"> % perform T's transduction on regions marked % off by bl's.</Paragraph>
      <Paragraph position="6"> % ensure that Ibl must be preceded % by a string in Left.</Paragraph>
      <Paragraph position="7"> % ensure that Ib2 must not occur preceded % by a string in Left.</Paragraph>
      <Paragraph position="8"> % remove the auxiliary O's.</Paragraph>
      <Paragraph position="10"> The lb2 is first introduced and then, using t_i f f_.r, it is constrained to occur immediately before every instance of (ignoring complexities) Phi followed by an rb2. Phi needs to be marked as normal text using non_markers and then xign_x is used to allow freely inserted lb2 and rb2 anywhere except at the beginning and end. The following lb2&amp;quot; allows an optional lb2, which occurs when the empty string is in Phi.</Paragraph>
      <Paragraph position="11"> The fourth step is a guessing component which (ignoring complexities) looks for sequences of the form lb2 Phi rb2 and converts some of these into lbl Phi rbl, where the bl marking indicates that the sequence is a candidate for replacement.</Paragraph>
      <Paragraph position="12"> The complication is that Phi, as always, must be converted to non_markers (Phi) and instances of b2 need to be ignored. Furthermore, between pairs of lbl and rbl, instances of lb2 are deleted.</Paragraph>
      <Paragraph position="13"> These lb2 markers have done their job and are no longer needed. Putting this all together, the definition is:</Paragraph>
      <Paragraph position="15"> \]*, xsig*\]).</Paragraph>
      <Paragraph position="16"> The fifth step filters out non-longest matches produced in the previous step. For example (and simplifying a bit), if Phi is ab*, then a string of the form ... rbl a b Ibl b ... should be ruled out since there is an instance of Phi (ignoring brackets except at the end) where there is an internal Ibl. This is implemented as:~  The sixth step performs the transduction described by T. This step is straightforwardly implemented, where the main difficulty is getting T to apply to our specially marked string: macro (aux_replace (T),  mized a bit: Since we know that an rbl must be preceded by Phi, we can write! \[ign_ (non_markers (Phi) , brack) , rb 1, xs ig*\] ). This may lead to a more constrained (hence smaller) transducer.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="130" end_page="130" type="metho">
    <SectionTitle>
3 Longest Match Capturing
</SectionTitle>
    <Paragraph position="0"> As discussed in SS1 the POSIX standard requires that multiple captures follow a longest match strategy. For multiple captures as in (3), one establishes first a longest match for domain(T1).</Paragraph>
    <Paragraph position="1"> .... domain( T~ ). Then we ensure that each of domain(Ti) in turn is required to match as long as possible, with each one having priority over its rightward neighbors. To implement this, we define a macro lm_concat(Ts) and use it as: replace (lm_concat (Ts), Left, Right) Ensuring the longest overall match is delegated to the replace macro, so lm_concat(Ts) needs only ensure that each individual transducer within Ts gets its proper left-to-right longest matching priority. This problem is mostly solved by the same techniques used to ensure the longest match within the replace macro. The only complication here is that Ts can be of unbounded length. So it is not possible to have a single expression in the finite state calculus that applies to all possible lenghts. This means that we need something a little more powerful than mere macro expansion to construct the proper finite state calculus expression. The FSA Utilities provides a Prolog hook for this purpose. The resulting definition of lm_concat is given in figure 2.</Paragraph>
    <Paragraph position="2"> Suppose (as in Friedl (1997)), we want to match the following list of recognizers against the string topological and insert a marker in each boundary position. This reduces to applying:</Paragraph>
    <Paragraph position="4"> This expression transduces the string topological only to the string top#o#1ogical. 5</Paragraph>
  </Section>
  <Section position="6" start_page="130" end_page="132" type="metho">
    <SectionTitle>
4 Conclusions
</SectionTitle>
    <Paragraph position="0"> The algorithm presented here has extended previous algorithms for rewrite rules by adding a limited version of backreferencing. This allows the output of rewriting to be dependent on the form of the strings which are rewritten. This new feature brings techniques used in Perl-like languages into the finite state calculus. Such an integration is needed in practical applications where simple text processing needs to be combined with more sophisticated computational linguistics techniques.</Paragraph>
    <Paragraph position="1"> One particularly interesting example where backreferences are essential is cascaded deterministic (longest match) finite state parsing as described for example in Abney (Abney, 1996) and various papers in (Roche and Schabes, 1997a).</Paragraph>
    <Paragraph position="2"> Clearly, the standard rewrite rules do not apply in this domain. If NP is an NP recognizer, it would not do to.say NP ~ \[NP\]/A_p. Nothing would force the string matched by the NP to the left of the arrow to be the same as the string matched by the NP to the right of the arrow.</Paragraph>
    <Paragraph position="3"> One advantage of using our algorithm for finite state parsing is that the left and right contexts may be used to bring in top-down filtering. 6 An often cited advantage of finite state 5An anonymous reviewer suggested theft lm_concat could be implemented in the framework of Karttunen (1996) as: \[toltoplolpolo\]-+... #; Indeed the resulting transducer from this expression would transduce topological into top#o#1ogical.</Paragraph>
    <Paragraph position="4"> But unfortunately this transducer would also transduce polotopogical into polo#top#o#gical, since the notion of left-right ordering is lost in this expression. null</Paragraph>
    <Paragraph position="6"> domains(Ts,Domains), concatT(Ts,ConcatTs).</Paragraph>
    <Paragraph position="7"> domains(\[\],\[\]).</Paragraph>
    <Paragraph position="8"> domains(\[FIRO\],\[domain(F) IR\]):- domains(RO,R). concatT(\[\],\[\]).</Paragraph>
    <Paragraph position="9"> concatT(\[TlTs\], \[inverse(non_markers) o T,ibl x \[\]IRest\]):- concatT(Ts,Rest). %% macro(mark_boundaries(L),Exp): This is the central component of im_concat. For our %% &amp;quot;toplological&amp;quot; example we will have:</Paragraph>
    <Paragraph position="11"> aux_greed(L,\[\],Filters), compose_list(Filters,ComposedO,Composed).</Paragraph>
    <Paragraph position="12"> aux_greed(\[HIT\],Front,Filters):- aux_greed(T,H,Front,Filters,_CurrentFilter).</Paragraph>
    <Paragraph position="13"> aux_greed(\[\],F,_,\[\],\[ign(non_markers(F),Ibl)\]).</Paragraph>
    <Paragraph position="14">  Proceedings of EACL '99 parsing is robustness. A constituent is found bottom up in an early level in the cascade even if that constituent does not ultimately contribute to an S in a later level of the cascade. While this is undoubtedly an advantage for certain applications, our approach would allow the introduction of some top-down filtering while maintaining the robustness of a bottom-up approach.</Paragraph>
    <Paragraph position="15"> A second advantage for robust finite state parsing is that bracketing could also include the notion of &amp;quot;repair&amp;quot; as in Abney (1990). One might, for example, want to say something like: xy</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML