File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1017_intro.xml

Size: 4,717 bytes

Last Modified: 2025-10-06 14:06:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1017">
  <Title>Transducers from Rewrite Rules with Backreferences</Title>
  <Section position="3" start_page="0" end_page="126" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Context sensitive rewrite rules have been widely used in several areas of natural language processing. Johnson (1972) has shown that such rewrite rules are equivalent to finite state transducers in the special case that they are not allowed to rewrite their own output. An algorithm for compilation into transducers was provided by Kaplan and Kay (1994). Improvements and extensions to this algorithm have been provided by Karttunen (1995), Karttunen (1997), Karttunen (1996) and Mohri and Sproat (1996).</Paragraph>
    <Paragraph position="1"> In this paper, the algorithm will be extended to provide a limited form of backreferencing. Backreferencing has been implicit in previous research, such as in the &amp;quot;batch rules&amp;quot; of Kaplan and Kay (1994), bracketing transducers for finite-state parsing (Karttunen, 1996), and the &amp;quot;LocalExtension&amp;quot; operation of Roche and Schabes (1995). The explicit use of backreferencing leads to more elegant and general solutions.</Paragraph>
    <Paragraph position="2"> Backreferencing is widely used in editors, scripting languages and other tools employing regular expressions (Friedl, 1997). For example, Emacs uses the special brackets \( and \) to capture strings along with the notation \n to recall the nth such string. The expression \(a*\)b\l matches strings of the form anba n. Unrestricted use of backreferencing thus can introduce non-regular languages. For NLP finite state calculi (Karttunen et al., 1996; van Noord, 1997) this is unacceptable. The form of backreferences introduced in this paper will therefore be restricted.</Paragraph>
    <Paragraph position="3"> The central case of an allowable backreference is:</Paragraph>
    <Paragraph position="5"> This says that each string x preceded by A and followed by p is replaced by T(x), where A and p are arbitrary regular expressions, and T is a transducer) This contrasts sharply with the rewriting rules that follow the tradition of Kaplan &amp; Kay:</Paragraph>
    <Paragraph position="7"> In this case, any string from the language C/ is replaced by any string independently chosen from the language C/.</Paragraph>
    <Paragraph position="8"> We also allow multiple (non-permuting) backreferences of the form: ~The syntax at this point is merely suggestive. As an example, suppose that T,c,. transduces phrases into acronyms. Then x =C/~ T=cr(x)/(abbr)__(/abbr&gt; would transduce &lt;abbr&gt;non-deterministic finite automaton&lt;/abbr&gt; into &lt;abbr&gt;NDFA&lt;/abbr&gt;.</Paragraph>
    <Paragraph position="9"> To compare this with a backreference in Perl, suppose that T~cr is a subroutine that converts phrases into acronyms and that R~C/,. is a regular expression matching phrases that can be converted into acronyms. Then (ignoring the left context) one can write something like: s/(R~c,.)(?=(/ASBR))/T,,c~($1)/ge;. The backreference variable, $1, will be set to whatever string R~c,. matches.</Paragraph>
    <Paragraph position="11"> Since transducers are closed under concatenation, handling multiple backreferences reduces to the problem of handling a single backreference: x ~ (TI&amp;quot; T2..... T,O(x)/A--p (4) A problem arises if we want capturing to follow the POSIX standard requiring a longestcapture strategy. ~riedl (1997) (p. 117), for example, discusses matching the regular expression (toltop)(olpolo)?(gicallo?logical) against the word: topological. The desired result is that (once an overall match is established) the first set of parentheses should capture the longest string possible (top); the second set should then match the longest string possible from what's left (o), and so on. Such a left-most longest match concatenation operation is described in SS3.</Paragraph>
    <Paragraph position="12"> In the following section, we initially concentrate on the simple Case in (1) and show how (1) may be compiled assuming left-to-right processing along with the overall longest match strategy described by Karttunen (1996).</Paragraph>
    <Paragraph position="13"> The major components of the algorithm are not new, but straightforward modifications of components presented in Karttunen (1996) and Mohri and Sproat (1996). We improve upon existing approaches because we solve a problem concerning the use of special marker symbols (SS2.1.2). A further contribution is that all steps are implemented in a freely available system, the FSA Utilities of van Noord (1997) (SS2.1.1).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML