File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1015_metho.xml

Size: 20,854 bytes

Last Modified: 2025-10-06 14:14:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1015">
  <Title>Directed Replacement</Title>
  <Section position="4" start_page="109" end_page="109" type="metho">
    <SectionTitle>
2 Directed Replacement
</SectionTitle>
    <Paragraph position="0"> We define directed replacement by means of a composition of regular relations. As in Kaplan and Kay (1994), Karttunen (1995), and other previous works on related topics, the intermediate levels of the composition introduce auxiliary symbols to express and enforce constraints on the replacement relation. Figure 6 shows the component relations and how they are composed with the input.</Paragraph>
    <Paragraph position="1">  by a caret that are instances of the upper language. The initial caret is replaced by a &lt;, and a closing &gt; is inserted to mark the end of the match. We permit carets to appear freely while matching. No carets are permitted outside the matched substrings and the ignored internal carets are eliminated. In this case, there are four possible outcomes, shown in Figure 8, but only two of them are allowed under the constraint that there can be no carets outside the brackets.</Paragraph>
    <Paragraph position="2">  If the four relations on the bottom of Figure 6 are composed in advance, as our compiler does, the application of the replacement to an input string takes place in one step without any intervening levels and with no auxiliary symbols. But it helps to understand the logic to see where the auxiliary marks would be in the hypothetical intermediate results. Let us consider the caseofa b \[ b I b a \[ a b a ~-&gt; x applying to the string &amp;quot;aba&amp;quot; and see in detail how the mapping implemented by the transducer in Figure 4 is composed from the four component relations. We use three auxiliary symbols, caret ('), left bracket (&lt;) and right bracket (&gt;), assuming here that they do not occur in any input. The first step, shown in Figure 7, composes the input string with a transducer that inserts a caret, in the beginning of every substring that belongs to the upper language.</Paragraph>
    <Paragraph position="3">  ginning of a substring that matches &amp;quot;ab&amp;quot;, &amp;quot;b&amp;quot;, &amp;quot;ba&amp;quot;, or ~aba&amp;quot;.</Paragraph>
    <Paragraph position="4"> Note that only one &amp;quot; is inserted even if there are several candidate strings starting at the same location. null In the left-to-right step, we enclose in angle brackets all the substrings starting at a location marked In effect, no starting location for a replacement can be skipped over except in the context of another replacement starting further left in the input string. (Roche and Schabes (1995) introduce a similar technique for imposing the left-to-right order on the transduction.) Note that the four alternatives in Figure 8 represent the four factorizations in Figure 2.</Paragraph>
    <Paragraph position="5"> The longest-match constraint is the identity relation on a certain set of strings. It forbids any replacement that starts at the same location as another, longer replacement. In the case at hand, it means that the internal &gt; is disallowed in the context &lt; a b &gt; a. Because &amp;quot;aba&amp;quot; is in the upper language, there is a longer, and therefore preferred, &lt; a b a &gt; alternative at the same starting location, Figure 9.</Paragraph>
  </Section>
  <Section position="5" start_page="109" end_page="111" type="metho">
    <SectionTitle>
ALLOWED NOT ALLOWED
</SectionTitle>
    <Paragraph position="0"> guage string with an initial &lt; and a nonfinal &gt; in the middle.</Paragraph>
    <Paragraph position="1"> In the final replacement step, the bracketed regions of the input string, in the case at hand, just &lt; a b a &gt; , are replaced by the strings of the lower language, yielding &amp;quot;x&amp;quot; as the result for our example. Note that longest match constraint ignores any internal brackets. For example, the bracketing &lt; a  &gt; &lt; a &gt; is not allowed if the upper language contains &amp;quot;aa&amp;quot; as well as &amp;quot;a&amp;quot;. Similarly, the left-to-right constraint ignores any internal carets.</Paragraph>
    <Paragraph position="2"> As the first step towards a formal definition of UPPER (c)-&gt; LOWER it is useful to make the notion of &amp;quot;ignoring internal brackets&amp;quot; more precise. Figure 10 contains the auxiliary definitions. For the details of the formalism (briefly explained in the Appendix), please consult Karttunen (1995), Kempe and Karttunen (1996). 3</Paragraph>
    <Paragraph position="4"> final diacritics.</Paragraph>
    <Paragraph position="5"> The precise definition of the UPPER ~-&gt; LOWER relation is given in Figure 11. It is a composition of many auxiliary relations. We label the major components in accordance with the outline in Figure 6. The formulation of the longest-match constraint is based on a suggestion by Ronald M. Kaplan (p.c.).</Paragraph>
    <Paragraph position="6">  The logic of ~-&gt; replacement could be encoded in many other ways, for example, by using the three pairs of auxiliary brackets, &lt;i, &gt;i, &lt;c, &gt;c, and &lt;a, &gt;a, introduced in Kaplan and Kay (1994). We take here a more minimalist approach. One reason is that we prefer to think of the simple unconditional (uncontexted) replacement as the basic case, as in Karttunen (1995). Without the additional complexities introduced by contexts, the directionality and 3UPPER' is the same language as UPPER except that carets may appear freely in all nonfinal positions. Similarly, UPPER'' accepts any nonfinal brackets.</Paragraph>
    <Paragraph position="7">  length-of-match constraints can be encoded with fewer diacritics. (We believe that the conditional case can also be handled in a simpler way than in Kaplan and Kay (1994).) The number of auxiliary markers is an important consideration for some of the applications discussed below.</Paragraph>
    <Paragraph position="8"> In a phonological or morphological rewrite rule, the center part of the rule is typically very small: a modification, deletion or insertion of a single segment. On the other hand, in our text processing applications, the upper language may involve a large network representing, for example, a lexicon of multiword tokens. Practical experience shows that the presence of many auxiliary diacritics makes it difficult or impossible to compute the left-to-right and longest-match constraints in such cases. The size of intermediate states of the computation becomes a critical issue, while it is irrelevant for simple phonological rules. We will return to this issue in the discussion of tokenizing transducers in Section 4.</Paragraph>
    <Paragraph position="9"> The transducers derived from the definition in Figure 11 have the property that they unambiguously parse the input string into a sequence of sub-strings that are either copied to the output unchanged or replaced by some other strings. However they do not fall neatly into any standard class of transducers discussed in the literature (Eilenberg 1974, Schiitzenberger 1977, Berstel 1979). If the LOWER language consists of a single string, then the relation encoded by the transducer is in Berstel's terms a rational function, and the network is an unambigous transducer, even though it may contain states with outgoing transitions to two or more destinations for the same input symbol. An unambiguous transducer may also be sequentiable, in * which case it can be turned into an equivalent sequential transducer (Mohri, 1994), which can in turn be minimized. A transducer is sequential just in case there are no states with more than one transition for the same input symbol. Roche and Sehabes (1995) call such transducers deterministic.</Paragraph>
    <Paragraph position="10"> Our replacement transducers in general are not unambiguous because we allow LOWER to be any regular language. It may well turn out that, in all cases that are of practical interest, the lower language is in fact a singleton, or at least some finite set, but it is not so by definition. Even if the replacement transducer is unambiguous, it may well be unsequentiable if UPPER is an infinite language. For example, the simple transducer for a+ b ~-&gt; x in Figure 12 cannot be sequentialized. It has to replace any string of &amp;quot;a&amp;quot;s by &amp;quot;x&amp;quot; or copy it to the output unchanged depending on whether the string eventually terminates at &amp;quot;b'. It is obviously impossible for any finite-state b:O Figure 13, a simple parallel replacement of the two auxiliary brackets that mark the selected regions.</Paragraph>
    <Paragraph position="11"> Because the placement of &lt; and &gt; is strictly controlled, they do not occur anywhere else.</Paragraph>
    <Paragraph position="12">  in Figure 4 is sequentiable because there the choice between a and a:x just depends on the next input symbol.</Paragraph>
    <Paragraph position="13"> Because none of the classical terms fits exactly, we have chosen a novel term, directed transduction, to describe a relation induced by the definition in Figure 11. It is meant to suggest that the mapping from the input into the output strings is guided by the directionality and length-of-match constraints.</Paragraph>
    <Paragraph position="14"> Depending on the characteristics of the UPPER and LOWER languages, the resulting transducers may be unambiguous and even sequential, but that is not guaranteed in the general case.</Paragraph>
  </Section>
  <Section position="6" start_page="111" end_page="112" type="metho">
    <SectionTitle>
3 Insertion
</SectionTitle>
    <Paragraph position="0"> The effect of the left-to-right and longest-match constraint is to factor any input string uniquely with respect to the upper language of the replace expression, to parse it into a sequence of substrings that either belong or do not belong to the language. Instead of replacing the instances of the upper language in the input by other strings, we can also take advantage of the unique factorization in other ways.</Paragraph>
    <Paragraph position="1"> For example, we may insert a string before and after each substring that is an instance of the language in question simply to mark it as such.</Paragraph>
    <Paragraph position="2"> To implement this idea, we introduce the special symbol ... on the right-hand side of the replacement expression to mark the place around which the insertions are to be made. Thus we allow replacement expressions of the form UPPER ~-&gt; PREFIX *.. SUFFIX. The corresponding transducer locates the instances of UPPER in the input string under the left-to-right, longest-match regimen just described.</Paragraph>
    <Paragraph position="3"> But instead of replacing the matched strings, the transducer just copies them, inserting the specified prefix and suffix. For the sake of generality, we allow PREFIX and SUFFIX to denote any regular language.</Paragraph>
    <Paragraph position="4"> The definition of UPPER ~-&gt; PREFIX ... SUFFIX is just as in Figure 11 except that the Replacement</Paragraph>
    <Paragraph position="6"> With the ... expressions we can construct transducers that mark maximal instances of a regular language. For example, let us assume that noun phrases consist of an optional determiner, (d), any number of adjectives, a*, and one or more nouns, n+.</Paragraph>
    <Paragraph position="7"> The expression (d) a* a+ ~-&gt; 7,\[ ... %3 compiles into a transducer that inserts brackets around maximal instances of the noun phrase pattern. For example, it maps &amp;quot;damlvaan&amp;quot; into &amp;quot;\[dann\] v \[aan\] &amp;quot;, as shown in Figure 14.</Paragraph>
    <Paragraph position="8">  to &amp;quot;d a.tlI'tv aa.L-rl&amp;quot; Although the input string &amp;quot;dannvaan&amp;quot; contains many other instances of the noun phrase pattern, &amp;quot;n&amp;quot;, &amp;quot;an&amp;quot;, &amp;quot;nn&amp;quot;, etc., the left-to-right and longest-match constraints pick out just the two maximal ones. The transducer is displayed in Figure 15. Note that ? here matches symbols, such as v, that are not included in the alphabet of the network.</Paragraph>
    <Paragraph position="9"> Figure 15: (d) a* n+ e-&gt; ~,\[...~,\]. The one path with &amp;quot;dannvaan&amp;quot; on the upper side is: &lt;00: \[ 7 d 3 a3n4n40:\] 5v00:\[7a3a3a40:\] 5&gt;.</Paragraph>
  </Section>
  <Section position="7" start_page="112" end_page="113" type="metho">
    <SectionTitle>
4 Applications
</SectionTitle>
    <Paragraph position="0"> The directed replacement operators have many useful applications. We describe some of them. Although the same results could often be achieved by using lex and yacc, sed, awk, perl, and other Unix utilities, there is an advantage in using finite-state transducers for these tasks because they can then be smoothly integrated with other finite-state processes, such as morphological analysis by lexical transducers (Karttunen et al 1992, Karttunen 1994) and rule-based part-of-speech disambiguation (Chanod and Tapanainen 1995, Roche and Schabes 1995).</Paragraph>
    <Section position="1" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
4.1 Tokenization
</SectionTitle>
      <Paragraph position="0"> A tokenizer is a device that segments an input string into a sequence of tokens. The insertion of end-oftoken marks can be accomplished by a finite-state transducer that is compiled from tokenization rules.</Paragraph>
      <Paragraph position="1"> The tokenization rules may be of several types. For example, \[WHITE_SPACE+ ~-&gt; SPACE\] is a normalizing transducer that reduces any sequence of tabs, spaces, and newlines to a single space. \[LETTER+ ~-&gt; ... END_0F_TOKEN\] inserts a special mark, e.g. a newtine, at the end of a letter sequence.</Paragraph>
      <Paragraph position="2"> Although a space generally counts as a token boundary, it can also be part of a multiword token, as in expressions like &amp;quot;at least&amp;quot;, &amp;quot;head over heels&amp;quot;, &amp;quot;in spite of&amp;quot;, etc. Thus the rule that introduces the END_0F_TOKEN symbol needs to combine the LETTER+ pattern with a list of multiword tokens which may include spaces, periods and other delim- null The tokenizer in Figure 16 is composed of three transducers. The first reduces strings of whitespace characters to a single space. The second transducer inserts an END_0F_TOKEN mark after simple words and the, listed multiword expressions. The third removes the spaces that are not part of some multi-word token. The percent sign here means that the following blank is to be taken literally, that is, parsed as a symbol.</Paragraph>
      <Paragraph position="3"> Without the left-to-right, longest-match constraints, the tokenizing transducer would not produce deterministic output. Note that it must introduce an END_0F_TOKEN mark after a sequence of letters just in case the word is not part of some longer multiword token. This problem is complicated by the fact that the list of multiword tokens may contain overlapping expressions. A tokenizer for French, for example, needs to recognize &amp;quot;de plus&amp;quot; (moreover), &amp;quot;en plus&amp;quot; (more), &amp;quot;en plus de&amp;quot; (in addition to), and &amp;quot;de plus en plus&amp;quot; (more and more) as single tokens. Thus there is a token boundary after &amp;quot;de plus&amp;quot; in de plus on ne le fai~ plus (moreover one doesn't do it anymore) but not in on le \]:air de plus en plus (one does it more and more) where &amp;quot;de plus en plus&amp;quot; is a single token.</Paragraph>
      <Paragraph position="4"> If the list of multiword tokens contains hundreds of expressions, it may require a lot of time and space to compile the tokenizer even if the final result is not too large. The number of auxiliary symbols used to encode the constraints has a critical effect on the efficiency of that computation. We first observed this phenomenon in the course of building a tokenizer for the British National Corpus according to the specifications of the BNC Users Guide (Leech, 1995), which lists around 300 multiword tokens and 260 foreign phrases. With the current definition of the directed replacement we have now been able to compute similar tokenizers for several other languages (French, Spanish, Italian, Portuguese, Dutch, German).</Paragraph>
    </Section>
    <Section position="2" start_page="112" end_page="113" type="sub_section">
      <SectionTitle>
4.2 Filtering
</SectionTitle>
      <Paragraph position="0"> Some text processing applications involve a preliminary stage in which the input stream is divided into regions that are passed on to the calling process and regions that are ignored. For example, in processing an SGML-coded document, we may wish to delete all the material that appears or does not appear in a region bounded by certain SGML tags, say &lt;A&gt; and &lt;/A&gt;.</Paragraph>
      <Paragraph position="1"> Both types of filters can easily be constructed using the directed replace operator. A negative filter that deletes all the material between the two SGML codes, including the codes themselves, is expressed as in Figure 17.</Paragraph>
      <Paragraph position="2">  A positive filter that excludes everything else can be expressed as in Figure 18.</Paragraph>
      <Paragraph position="3">  The positive filter is composed of two transducers. The first reduces to &lt;A&gt; any string that ends with it and does not contain the &lt;/A&gt; tag. The second transducer does a similar transduction on strings that begin with &lt;/A&gt;. Figure 12 illustrates the effect of the positive filter.</Paragraph>
      <Paragraph position="4"> &lt;B&gt;one&lt;/B&gt;&lt;A&gt;two&lt;/A&gt;&lt;C&gt;three&lt;/C&gt;&lt;A&gt;f our&lt;/A&gt; &lt;A&gt; two &lt;/A&gt; &lt;A&gt;four&lt;/A&gt; By means of this simple &amp;quot;bottom-up&amp;quot; technique, it is possible to compile finite-state transducers that approximate a context-free parser up to a chosen depth of embedding. Of course, the left-to-right, longest-match regimen implies that some possible analyses are ignored. To produce all possible parses, we may introduce the ... notation to the simple replace expressions in Karttunen (1995).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="113" end_page="113" type="metho">
    <SectionTitle>
5 Extensions
</SectionTitle>
    <Paragraph position="0"> The idea of filtering by finite-state transduction of course does not depend on SGML codes. It can be applied to texts where the interesting and uninteresting regions are defined by any kind of regular pattern.</Paragraph>
    <Section position="1" start_page="113" end_page="113" type="sub_section">
      <SectionTitle>
4.3 Marking
</SectionTitle>
      <Paragraph position="0"> As we observed in section 3, by using the ... symbol on the lower side of the replacement expression, we can construct transducers that mark instances of a regular language without changing the text in any other way. Such transducers have a wide range of applications. They can be used to locate all kinds of expressions that can be described by a regular pattern, such as proper names, dates, addresses, social security and phone numbers, and the like. Such a marking transducer can be viewed as a deterministic parser for a &amp;quot;local grammar&amp;quot; in the sense of Gross (1989), Roche (1993), Silberztein (1993) and others.</Paragraph>
      <Paragraph position="1"> By composing two or more marking transducers, we can also construct a single transducer that builds nested syntactic structures, up to any desired depth. To make the construction simpler, we can start by defining auxiliary symbols for the basic regular patterns. For example, we may define NP as \[(d) a* n+J. With that abbreviatory convention, a composition of a simple NP and VP spotter can be defined as in Figure 20.</Paragraph>
      <Paragraph position="2">  posite transducer to the string &amp;quot;dannvaan&amp;quot;. The definition of the left-to-right, longest-match replacement can easily be modified for the three other directed replace operators mentioned in Figure 3.</Paragraph>
      <Paragraph position="3"> Another extension, already implemented, is a directed version of parallel replacement (Kempe and Karttunen 1996), which allows any number of replacements to be done simultaneously without interfering with each other. Figure 22 is an example of a directed parallel replacement. It yields a transducer that maps a string of &amp;quot;PS's into a single &amp;quot;b&amp;quot; and a string of &amp;quot;b&amp;quot;s into a single '%'.</Paragraph>
      <Paragraph position="4">  The definition of directed parallel replacement requires no additions to the techniques already presented. In the near future we also plan to allow directional and length-of-match constraints in the more complicated case of conditional context-constrained replacement.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML