String Transformation Learning 
Giorgio Satta 
Dipartimento di Elettronica e Informatica 
Universith di Padova 
via Gradenigo, 6/A 
1-35131 Padova, Italy 
satt a@dei, unipd, it 
John C. Henderson 
Department of Computer Science 
Johns Hopkins University 
Baltimore, MD 21218-2694 
j hndrsn~cs, j hu. edu 
Abstract 
String transformation systems have been 
introduced in (Brill, 1995) and have sev- 
eral applications in natural language pro- 
cessing. In this work we consider the com- 
putational problem of automatically learn- 
ing from a given corpus the set of transfor- 
mations presenting the best evidence. We 
introduce an original data structure and 
efficient algorithms that learn some fam- 
ilies of transformations that are relevant 
for part-of-speech tagging and phonologi- 
cal rule systems. We also show that the 
same learning problem becomes NP-hard 
in cases of an unbounded use of don't care 
symbols in a transformation. 
1 Introduction 
Ordered sequences of rewriting rules are used in 
several applications in natural language process- 
ing, including phonological and morphological sys- 
tems (Kaplan and Kay, 1994), morphological disam- 
biguation, part-of-speech tagging and shallow syn- 
tactic parsing (Brill, 1995), (Karlsson et ah, 1995). 
In (Brill, 1995) a learning paradigm, called error- 
driven learning, has been introduced for automatic 
induction of a specific kind of rewriting rules called 
transformations, and it has been shown that the 
achieved accuracy of the resulting transformation 
systems is competitive with that of existing systems. 
In this work we further elaborate on the error- 
driven learning paradigm. Our main contribution 
is summarized in what follows. We consider some 
families of transformations and design efficient al- 
gorithms for the associated learning problem that 
improve existing methods. Our results are achieved 
by exploiting a data structure originally introduced 
in this work. This allows us to simultaneously repre- 
sent and test the search space of all possible transfor- 
mations. The transformations we investigate make 
use of classes of symbols, in order to generalize regu- 
larities in rule applications. We also show that when 
an unbounded number of these symbol classes are al- 
lowed within a transformation, then the associated 
learning problem becomes NP-hard. 
The notation we use in the remainder of the paper 
is briefly introduced here. ~3 denotes a fixed, finite 
alphabet and e the null string. E* and E+ are the 
set of all strings and all non-null strings over E, re- 
spectively. Let w 6 E*. We denote by Iwl the length 
ofw. Let w = uxv; uis aprefix and v is a suf- 
fix of w; when x is non-null, it is called a factor of 
w. The suffix of w of length i is denoted suffi(w), 
for O < i _< Iwl. Assume that x is non-null, and 
w = uixsuffi(w ) for ~ > 0 different values of i but 
not for ~ + 1, or x is not a factor of w and ~ = 0. 
Then we say that ~ is the statistic of factor z in w. 
2 The learning paradigm 
The learning paradigm we adopt is called error- 
driven learning and has been originally proposed 
in (Brill, 1995) for part of speech tagging applica- 
tions. We briefly introduce here the basic assump- 
tions of the approach. 
A string transformation is a rewriting rule de- 
noted as u -* v, where u and v are strings such that 
\[u\[ = Ivt. This means that ifu appears as a factor of 
some string w, then u should be replaced by v in w. 
The application of the transformation might be con- 
ditioned by the requirement that some additionally 
specified pattern matches some part of the string w 
to be rewritten. 
We now describe how transformations can be au- 
tomatically learned. A pair of strings (w, w') is an 
aligned pair if IT\[ = \]w'\[. When w = uzsuffi(w), 
w' = u'x'suffi(w' ) and Ixl = Ix'l, we say that fac- 
tors x and x' occur at aligned positions within 
(w, w'). A multi-set of aligned pairs is called an 
aligned corpus. Let (w, w ') be an aligned pair and 
let 7- be some transformation of the form u --~ v. 
The positive evidence of v (w.r.t. (w, w')) is the 
number of different positions at which factors u and 
v are aligned within (w, w'). The negative evi- 
dence of r (w.r.t. w, w ~) is the number of different 
positions at which factors u and u are aligned within 
444 
¢ 
a ~ 
( 
C 
$' 
$ 
-a., 
al 
(p 
$ cl 
q~ 
$, 
\[1,2~ 
Figure 1: Trie and suffix tree for string w = accbacac$. Pair \[i, j\] denotes the factor of w starting at position 
i and ending at position j (hence \[1, 2\] denotes ac). 
(w, w'). Intuitively speaking, positive (negative) ev- 
idence is a count of how many times we will do well 
(badly, respectively) when using v on w in trying to 
get w'. The score associated with v is the differ- 
ence between the positive evidence and the negative 
evidence of r. This extends to an aligned corpus 
in the obvious way. We are interested in the set of 
transformations that are associated with the high- 
est score in a given aligned corpus, and will develop 
algorithms to find such a set in the next sections. 
3 Data Structures 
This section introduces two data structures that are 
basic to the development of the algorithms presented 
in this paper. 
3.1 Suffix trees 
We briefly present here a data structure that is 
well known in the text processing literature; the 
reader is referred to (Crochemore and Rytter, 1994) 
and (Apostolico, 1985) for definitions and further 
references. 
Let w be some non-null string. Throughout the 
paper we assume that the rightmost symbol of w is 
an end-marker not found at any other position in 
the string. The suffix tree associated with w is a 
"compressed" trie of all strings suffi(w), 1 <i< Iwl. 
Edges are labeled by factors of w which are encoded 
by means of two natural numbers denoting endpoints 
in the string. An example is reported in Figure 1. 
An implicit node is a node not explicitly repre- 
sented in the suffix tree, that splits the label of some 
edge at a given position. (Each implicit node cor- 
responds to some node in the original trie having 
only one child.) We denote by parent(p) the parent 
node of (implicit) node p and by label(p, q) the la- 
bel of the edge spanning (implicit) nodes p and q. 
Throughout the paper, we take the dominance rela- 
tion between nodes to be reflexive, unless we write 
proper dominance. We also say that implicit node 
q immediately dominates node p if q splits the arc 
between parent(p) and p. Of main interest here are 
the following properties of suffix trees: 
• if node p has children Pl .... , Pd, then d _> 2 and 
strings label(p, pi) differ one from the other at 
the leftmost symbol; 
. all and only the factors of w are represented by 
paths from the root to some (implicit) node; 
• the statistic of factor u of w is the number of 
leaves dominated by the (implicit) node ending 
the path representing u. 
In the remainder of the paper, we sometimes identify 
an (implicit) node of a suffix tree with the factor 
represented by the path from the root to that node. 
The suffix tree and the statistics of all factors of 
w can be constructed/computed in time O(\[w\[), as 
reported in (Weiner, 1973) and (McCreight, 1976). 
McCreight algorithm uses two basic functions to 
scan paths in the suffix tree under construction. 
These functions are briefly introduced here and will 
be exploited in the next subsection. Below, p is a 
node in a tree and u is a non-null string. 
function Slow_scan(p, u): Starting at p, scan u sym- 
bol by symbol. Return the {implicit) node corre- 
sponding to the last matching symbol. 
The next function runs faster than Slow_scan, and 
can be used whenever we already know that u is an 
(implicit) node in the tree (u completely matches 
some path ill the tree). 
function Fast_scan(p, u): Starting at p, scan u by 
iteratively (i) finding the edge between the current 
node and one of its children, that has the same first 
symbol as the suffix of u yet to be scanned, and 
(ii) skipping a prefix of u equal to the length of the 
selected edge label. Return the (implicit) node u. 
445 
\[1,2~ / \"-M9,91 
d/\[8,9\] t::7~~ Na\[~ 
/\[8.9\]/\[6.9\] 
9,9\] 
7(13)~(2) . 
\[3~\] N"~ "91 
Figure 2: Suffix tree aligmnent for strings w = accbacac$, w' = acabacba$ and the identity homomorphism 
h(a) = a, h(b) = b, h(c) = c. Each a-link is denoted by indexing the incident nodes with the same integer 
number; if the incident node is an implicit node, then we add between parentheses the relative position w.r.t. 
the arc label. 
From each node au in the suffix tree, au some factor, 
McCreight's algorithm creates a pointer, called an s- 
link, to node u which necessarily exists in the suffix 
tree. We write q = s-link(p) if there is an s-link from 
ptoq. 
3.2 Suffix tree alignment 
In the next section each transformation will be as- 
sociated with several strings. Given an input text, 
we will compute transformation scores by comput- 
ing statistics of these strings. This can easily be 
done using suffix trees, and by pairing statistics cor- 
responding to the same transformation. The latter 
task can be done using the data structure originally 
introduced here. 
A total function h : E ~ E ~, ~ and E' two alpha- 
bets, is called a (restricted) homomorphism. We 
extend h to a string function in the usual way by 
posing h(¢) = s and h(au) = h(a)h(u), a E E and 
u E E*. Given w,w' E E +, we need to pair each 
factor u of w with factor h(u) possibly occurring in 
w ~. To solve this problem, we construct the suffix 
trees T,T' for w,w', respectively. Then we estab- 
lish an a-link (a pointer) from each node u of T, 
u some factor, to the (implicit) node h(u) of T ~, if 
h(u) exists. Furthermore, if factor ua with a E E is 
an (implicit) node of T such that h(u) but not h(ua) 
are (implicit) nodes of T', we create node u in T (if 
u was an implicit node) and establish an a-link from 
u to (implicit) node h(u) of T'. Note that the to- 
tal number of a-links is O(Iwl). The resulting data 
structure is called here suffix tree aligmnent. An 
example is reported in Figure 2. 
We now specify a method to compute suffix tree 
alignments. In what follows p,p~ are tree nodes and 
u is a non-null string. Crucially, we assume we can 
access the s-links of T and T'. Paths u and v in T 
and T', respectively, are aligned if v = h(u). The 
next two functions are used to move a-links up and 
down two aligned paths. 
function Move_link_down(p,p',u): Starting at. p 
and p', simultaneously scan u and h(u), respectively, 
using function Slow_scan. Stop as soon as a symbol 
is not matched. At each encountered node of T and 
at the (implicit) node of T corresponding to the last 
successful match, create an a-link to the paired (im- 
plicit) node of T'. Return the pair of nodes in the 
lastly created a-link Mong with the length of the suc- 
cessfully matched prefix of u. 
In the next function, we use function Fast_scan in- 
troduced in Section 3.1, but we run it upward the 
tree (with the obvious modifications). 
function Move_link_up(p,p'): Starting at p and p', 
simultaneously scan the paths to the roots of T and 
T', respectively, using function Fast_scan. Stop as 
soon as a node of T is encountered that already ha.s 
an a-link. At each encountered node of T create an 
a-link to the paired (implicit) node of T'. 
We also need a function that "shifts" a-links to a new 
pair of aligned paths. This is done using s-links. The 
next auxiliary function takes care of those (implicit) 
nodes for which the s-link is missing. (This is the 
case for implicit nodes of T ~ and for some nodes of 
T that have been newly created.) We rest on the 
property that the parent node of any such (implicit) 
node always has an s-link, when it differs from the 
root. 
function Up_link_down(p): If s-link(p) is defined 
then return s-link(p). Else, let pl = parent(p). If 
Pl is not the root node, let P2 = s-link(p1) and 
return (implicit) node FasLscan(p2,1abel(pl,p)). 
If Pl is the root node, return (implicit) node 
Fast_scan(p1, sufflZab~l(p~,p )i_ l ( label(pl , p) ). 
function Shifl_link(p,p'): Pl = Up_link_down(p), 
P'I = Up_linLdown(p'). Return (Pl,P~). 
We can now present the algorithm for the con- 
struction of suffix tree alignments. 
Algorithm 1 Let T and T' be the suffix trees for 
strings w and w', respectively: 
( bl~l ' Gl ' d) ,-- Move_link_down(root of T, 
446 
I il ab' I bi I a-linkl 
9 - ac 1, 2 
8 C C -- 
7 e cba 4 
6 ba bac 5 
5 ac aca 6 
4 ca ca 7 
3 a ac -- 
2 c c 8 
1 e $ - 
Figure 3: The table reports the values of 8bi, bi 
and the established a-links at each iteration of Algo- 
rithm 1, when constructing the suffix tree aligmnent 
in Figure 2. To denote a-links we use the same inte- 
ger numbers as in Figure 2. 
root of T', 
for i from \]w I - 1 downto 1 do begin 
(sbi, sb;) ~-- Shift_link(bi+l, b;+l) 
M ove_link_up( sbi , sb~ ) 
( b, , dd) Move_link_do n( ab, , abe, 
s  wl-d(w)) 
d.--d+dd 
end 
end 
In Figure 3 a sample run of Algorithm 1 is schemat- 
ically represented. 
In the next section we use the following properties 
of Algorithm 1: 
• after T and T' have been processed, for every 
node p ofT representing factor u of w, (implicit) 
node a-link(p) of T ~ is defined if and only if 
a-link(p) represents factor h(u) of w'; 
• the algorithm can be executed in time 
O(Iwl + Iw'l). 
The first property above can be proved as follows. 
For 1 < i < Iwl, bi in Algorithm 1 is (the node 
representing) the longest prefix of suffi(w ) such that 
h(bi) is an (implicit) node of T' (is a factor of w'). 
This can be proved by induction on \[w I -i, using the 
definition of Move_link_down and of s-link. We then 
observe that, if u is a node of T, then factor u is a 
prefix of some suffi(w ) and either u dominates bi or 
bi properly dominates u in T. If u dominates bi, then 
• h(u) must be an (implicit) node ofT'. In this case an 
a-link is established from u to h(u) by Move_link_up 
or Move_link_down, depending on whether u dom- 
inates or is dominated by sbi in T. If bi properly 
dominates u, h(u) does not occur in w'. In this 
case, node u is never reached by the algorithm and 
no a-link is established for this node. 
The proof of the linear time result is rather long, 
we only give an outline here. The interesting case 
is the function Shift_link, which is executed Iwl- 1 
times by the algorithm. When executed once on 
nodes p and if, Shift_link uses time 0(1) if s-link(p) 
and s-link(p ~) are both defined. In all other cases, 
it uses an amount of time proportional to the num- 
ber of (implicit) nodes visited by function FasLscan, 
which is called through function Up_link_down. We 
use an amortization technique and charge a constant 
amount of time to the symbols in w and w', for each 
node visited in this way. Consider the execution of 
Shifl_link(bi+l, b~+l) for some i, 1 < i < Iw\[- 1. As- 
sume that, correspondingly, Fast_scan visits nodes 
ul,...,Ud of T in this order, with d __ 1 and each 
uj some factor of w. Then we have that each uj is 
a (proper) prefix of uj+l, and Ud = sbi. For each 
u j, 1 < j _< d- 1, we charge a constant amount of 
time to the symbol in w "corresponding" to the last 
symbol of uj. The visit to Ud, on the other hand, is 
charged to the ith symbol of w. (Note that charging 
the visit to ud to the symbol in w "corresponding" 
to the last symbol of Ud does not work, since in the 
case of sbi ---" bi the same symbol would be charged 
again at the next iteration of the for-cycle.) It is not 
difficult to see that, in this way, each symbol of w is 
charged at most once. A similar argument works for 
visits to nodes of T' by Fast_scan, which are charged 
to symbols of u?. This shows that the time used by 
all executions of Shift_link is 0(Iwl + Iw'l). 
Suffix trees and suffix tree alignments can be gen- 
eralized to finite multi-sets of strings, each string 
ending with the same end-marker not found at any 
other position. In this case each leaf holds a record, 
called count, of the number of times the correspond- 
ing suffix appears in the entire multi-set, which will 
be propagated appropriately when computing factor 
statistic. Most important here, all of the above re- 
sults still hold for these generalizations. In the next 
section, we will deal with the multi-set case. 
4 Transformation learning 
This section deals with the computational problem 
of learning string transformations from an aligned 
corpus. We show that some families of transforma- 
tions can be efficiently learned exploiting the data 
structures of Section 3. We also consider more gen- 
eral kinds of transformations and show that for this 
class the learning problem is NP-hard. 
4.1 Data representation 
We introduce a representation of aligned corpora 
that reduces the problem of computing the pos- 
itive/negative evidence of transformations to the 
problem of computing factor statistics. 
Let (w, w') be an aligned pair, w = al.. "an and 
w'=a'l...a,~; withaiEEforl<i<n, andn>_ 1. 
We define 
w×w' = (al,a~)."(a~,a~). (1) 
447 
Note that w x w ~ is a string over the new al- 
phabet E x E. Let N > 1 and let L = 
{(wl, w~),..., (Wg, W~v)} be an aligned corpus. We 
represent L as a string multi-set over alphabet E x E: 
L× = {w x w' I (w,w') E L}, (2) 
where w x w ~ appears in Lx as many times as (w, w ~) 
appears in L. 
4.2 Learning algorithms 
Let L be an aligned corpus with N aligned pairs over 
a fixed alphabet E, and let n be the length of the 
longest string in a pair in L. We start by considering 
plain transformations of the form 
u --* v, (3) 
where u, v E E +, lul = Ivl, We want to find all in- 
stances of strings u, v E E* such that, in L, u ~ v 
has score greater or equal than the score of any other 
transformation. Existing methods for this problem 
are data-driven. They consider all pairs of factors 
(with lengths bounded by n) occurring at aligned 
positions within some pair in L, and update the 
positive and the negative evidence of the associated 
transformations. They thus consider O(Nn 2) fac- 
tor pairs, where each pair takes time O(n) to be 
read/stored. We conclude that these methods use 
an amount of time O(Nn3). We can improve on 
this by using suffix tree alignments. 
Let Lx be defined as in (2) and let hi : (E x E) 
(E x E) be the homomorphism specified as: 
h~((a,b)) = (a,a). 
Recall that, each suffix of a multi-set of strings is 
represented by a leaf in the associated suffix-tree, 
because of the use of the end-marker, and that each 
leaf stores the count of the occurrences of the corre- 
sponding suffix in the source multi-set. We schemat- 
ically specify our first learning algorithm below. 
Algorithm 2 
Step 1: construct two copies Tx and T x of the suf- 
fix tree associated with L× and align them using hi; 
Step 2: visit trees T× and T~ in post-order, and an- 
notate each node p with the number e(p) computed 
as the sum of the counts at leaves that p dominates; 
Step 3: annotate each node p of T× with the score 
e(p) - e(p'), where p' = a-link(p) if a-link(p) is an 
actual node, p~ is the node immediately dominated 
by a-link(p) if a-link(p) is an implicit node, and 
e(p ~) = 0 if a-link(p) is undefined; make a list of 
the nodes with the highest annotated score. 
Let p be a node of Tx associated with factor u x v. 
Integer e(p) computed at Step 2 is the number of 
times a suffix having u x v as a prefix appears in 
strings in Lx. Thus e(p) is the number of differ- 
ent positions at which factors u and v are aligned 
within Lx and hence the positive evidence of trans- 
formation u --~ v w.r.t. L, as defined in Section 2. 
Similarly, e(#) is the statistic of factor u >< u and 
hence the negative evidence of u --+ v (as well as the 
negative evidence of all transformations having u as 
left-hand side). It follows that Algorithm 2 records, 
at Step 3, the transformations having the highest 
score in L among all transformations represented by 
nodes of Tx. It is not difficult to see that the re- 
maining transformations, denoted by implicit nodes 
of Tx, do not have score greater than the one above. 
The latter transformations with highest score, if any, 
can be easily recovered by visiting the implicit nodes 
that immediately dominate the nodes of Tx recorded 
at Step 3. 
A complexity analysis of Algorithm 2 is straight- 
forward. Step 1 can be executed in time O(Nn), as 
discussed in Section 3. Since the size of Tx and T~< 
is O(Nn)~ all other steps can be easily executed in 
linear time. Hence Algorithm 2 runs in time O(Nn). 
We now turn to a more general kind of transfor- 
mations. In several natural language processing ap- 
plications it is useful to generalize over some trans- 
formations of the form in (3), by using classes of 
symbols in E. Let t > 1 and let C1, •.., Ct be a par- 
tition of E (each Ci ~-O). Consider F = {C1,..., Ct} 
as an alphabet. We say that string al...ad E ~+ 
matches string Ci,...Cid E F + if ak E Cik for 
1 < k < d. We define transformations 1 
u 7 -* v--, (4) 
u, v E E +, lut = Ivt, 7 E F +, and assume the follow- 
ing interpretation. An occurrence of string u must 
be rewritten to v in a text whenever u is followed 
by a substring matching 7. String 7 is called the 
right context of the transformation. The positive 
evidence for such transformation is the number of 
positions at which factors ux and vx ~ are aligned 
within the corpus, for all possible x, x ~ E E + with 
x matching 7. (We do not require x = x', since 
later transformations can change the right context.) 
The negative evidence for the transformation is the 
number of positions at which factors ux and ux ~ are 
aligned within the corpus, x, x ¢ as above. 
We are not aware of any learning method for 
transformations of the form in (4). A naive method 
for this task would consider all factor pairs appear- 
ing at aligned positions in some pair in L. The left 
component of each factor must then be split into 
a string in E + and a string in F +, to represent a 
transformation in the desired form. Overall, there 
are O(Nn 3) possible transformations, and we need 
time O(n) to read/store each transformation. Then 
the method uses an amount of time O(Nn4). Again, 
we can improve on this. We need a representa- 
tion for right context strings. Define homomorphism 
h2 :(E X E)---+ F as 
h~((a,~)) = C, a~C. 
1In generative phonology (4) is usually written as u ---+ 
v / _ 7. Our notation can more easily be generalized, as 
it is needed in some transformation systems. 
448 
(h2 is well defined since r is a partition of E.) Let 
also 
Lr = {h2(w x w') I w x w' e Lx}, 
where h2(w x w') appears in Lr as many times as 
w x w' appears in L x. 
uxv 
/ \ ~q 
Figure 4: At Step 3 of Algorithm 3, triple (q, e, e') 
is inserted in v(p) if the relations depicted above are 
realized, where dashed arrows denote a-links, black 
circles denote nodes, and white circles denote nodes 
that might be implicit. Integer e > 0 is a count of 
the paths from node q downward, having the form 
y x y' with a prefix of y matching 7. Similarly, e ~ is a 
count of the paths from node q~ downward satisfying 
the same matching condition with 7. The matching 
condition is enforced by the fact that the above paths 
have their ending leaf nodes a-linked to a leaf node 
of Tr dominated by node p. 
Below we link a suffix-tree to more than one suffix- 
tree. In the notation of a-links we then use a sub- 
script indicating the suffix tree of the target node, 
in order to distinguish among different linkings. We 
now schematically specify the learning algorithm; 
additional computational details will be provided 
later in the discussion of the complexity. 
Algorithm 3 
Step 1: construct two copies Tx and T~ of the suffix 
tree associated with L× and construct the suffix tree 
Tr associated with Lr; 
Step 2: align Tx with T" using hi and align the 
resulting suffix trees Tx and T~ with Tr using h~; 
Step 3: for each node p of Tr, store a set v(p) 
including all triples (q, e, e') such that (see Figure 4): 
• q is a node of Tx such that a-linkTr(q) properly 
dominates p 
• e > 0 is the sum of the counts at leaves of Tx 
dominated by q that have an a-link to a leaf of 
Tr dominated by p 
• if ql = a_linkT, x (q) is defined, e' is the sum of 
the counts at leaves of T x dominated by q' that 
have an a-link to a leaf of Tr dominated by p; 
otherwise, e ~ = 0; 
Step 4: find all pairs (p,q), p a node of Tr and 
(q, e, e') E v(p), such that e - e ~ is greater than or 
equal to any other el - e~, (ql, el, el) in some r(pl). 
We next show that if pair (p, q) is found at Step 4, 
then q represents a factor u x v, p represents a factor 
h2(u x v)7, and transformation u7 ~ v -- has the 
highest score among all transformations represented 
by nodes of Tx and Tr. Similarly to the case of Al- 
gorithm 2, this is the highest score achieved in L, 
and other transformations with the same score can 
be obtained from some of the implicit nodes imme- 
diately dominating p and q. 
Let p aid q be defined as in Step 3 above. Assume 
that q represents a factor u x v of some string in L× 
and p represents a factor 87 E F* of some string in 
Lr, where \[81 = lul. Since a-linkTr(q) dominates 
p, we must have h2(u x v) = 8. Consider a suffix 
(u x v)(z x x')(y x y') appearing in ~ > 0 strings 
in Lx, such that h2(x x x') = 7. (This means that 
x matches 7, and there are at least ~ positions at 
which u --+ v has been applied with a right-context of 
%) We have that string h2((u x v)(x x x')(y x y')) = 
&Th2(y x y') must be a suffix of some strings in Lr. 
It follows that (u x v)(x x z')(y x y') is a leaf of 
Tx with a count of ~, ~Th2(y x y') is a leaf of Tr, 
and there is an a-link between these two nodes. Leaf 
(u x v)(x x z')(y × y') is dominated by q, and leaf 
&Th2(y x y') is dominated by p. Then, at Step 3, 
integer ~ is added to e. Since no condition has been 
imposed above on string x' and on suffix (y x y'), we 
conclude that the final value ofe must be the positive 
evidence of transformation u7 --+ v --. A similar 
argument shows that the negative evidence of this 
transformation is stored in e'. It then follows that, at 
Step 4, Algorithm 3 finds the transformations with 
the highest score among those represented by nodes 
of Tx and Tr. 
Algorithm 3 can be executed in time O(Nn2). We 
only outline a proof of this property here, by fo- 
cusing on Step 3. To execute this step we visit Tr 
in post order. At leaf node p, we consider the set 
F(p) of all leaves q of Tx such that p = a-linkT× (q), 
and the set F~(p) of all leaves q~ of T~ such that 
p = a-linkTx (q'). For each (implicit) node of T" 
that dominates some node in F~(p) and that is 
the target of some a-link (from some source node 
of Tx), we record the sum of the counts of the 
dominated nodes in Fl(p). This can be done in 
time O(IF'(p)l n). For each node q of Tx dominat- 
ing some node in F(p), we store in v(p) the triple 
(q,e, e'), since a-linkTr(q) necessarily dominates p. 
We let e > 0 be the sum of the counts of the dom- 
inated nodes in F(p), and let e' be the value re- 
trieved from the a-link to T', if any. This takes 
time O(IF(P)l n). When p ranges over the leaves of 
449 
Tr, we have ~-~p IF(p)I = EC, IF'(p)I = O(Nn). We 
then conclude that sets r(p) for all leaves p of Tr 
can be computed in time O(Nn2). At internal node 
p with children Pi, 1 < i _< d, d > 1, we assume 
that sets r(pi)'s have already been computed. As- 
sume that for some i we have (q, ei, e~) E r(pl) and 
a-linkTr(q) does not immediately dominate Pi. If 
' to e, respectively; (q, e, e') E r(p), we add ei, e i e', 
otherwise, we insert (q, el, e{) in r(p). We can then 
compute sets r(p) for all internal nodes p of Tr using 
an amount of time }-'~p Ir(p)t = O(Nn=). 
4.3 General transformations 
We have mentioned that the introduction of classes 
of alphabet symbols allows abstraction over plain 
transformations that is of interest to natural lan- 
guage applications. We generalize here transforma- 
tions in (47 by letting 7 be a string over E U F. More 
precisely, we assume 7 has the form: 
"1 = uo~iul..-u~-i~'a~, (5) 
where u0,ud E ~*, ui E ~+ and ~j E F + for 1 _ 
i_< d-1 and l_<j_< d, and d>_ 1. The notion of 
matching previously defined is now extended in such 
a way that, for a, b E P,, a matches b if a = b. Then 
the interpretation of the resulting transformation is 
the usual one. The parameter d in (5) is called the 
number of alternations of the transformation. We 
have established the following results: 
• transformations with a bounded number of al- 
ternations can be learned in polynomial time; 
• learning transformations with an unbounded 
number of alternations is NP-hard. 
Again, we only give an outline of the proof below. 
The first result is easy to show, by observing that 
in an aligned corpus there are polynomially many 
occurrences of transformations with a bounded num- 
ber of alternations. The second result holds even if 
we restrict ourselves to IEI = 2 and Irl = 1, that is 
if we use a don~t care symbol. Here we introduce 
a decision problem associated with the optimiza- 
tion problem of learning the transformations with 
the highest, score, and outline an NP-completeness 
proof. 
TRANSFORMATION SCORING (TS) 
Instance: (L,K), with L an aligned corpus, K a 
positive integer. 
Question: Is there a transformation that has score 
greater than or equal to K w.r.t. L? 
Membership in NP is easy to establish for TS. To 
show NP-hardness, we consider the CLIQUE de- 
cision problem for undirected, simple, connected 
graphs and transform such a problem to the TS 
problem. (The NP-completeness for .the used restric- 
tion of the CLIQUE problem (Garey and Johnson, 
1979) is easy to establish.) Let (G,K') be an in- 
stance of the CLIQUE problem as above, G = (V, E) 
and K' > 0. Without loss of generality, we assume 
that V = {1,2,...,q}. Let E = {a,b}; we construct 
an instance of the TS problem (L, K} over E as fol- 
lows. For each {i, j} E V with i < j let 
wi,j = ai-lbaJ-i-lba q-j. (6) 
We add to the aligned corpus L: 
1. one instance of pair Pi,j = (awl j, bwi,j) for each 
i < j, {i,j} E E; 
2. q2 instances of pair Pi,j = (awi,j, awi,j) for each 
i,j E Y with i < j and {i,j} ~ E; 
3. q2 instances of pair Pa = (aaa, ban). 
Also, we set K = q2 + (~'). The above instance of 
TS can easily be constructed in polynomial deter- 
ministic time with respect to the length of (G, K'}. 
It is easy to show that when (G, K') is a positive 
instance of the source problem, then the correspond- 
ing instance of TS is satisfied by at least one trans- 
formation. Assume now that there exists a trans- 
formation r having score greater equal than K > 0, 
w.r.t.L. Since the replacement of a with b is the 
only rewriting that appears in pairs of L, r must 
have the form a7 --+ b --. If 7 includes some occur- 
rence of b, then r cannot match Pa and the positive 
evidence of r will not exceed IEI < (3) < K, con- 
trary to our assumption. We then conclude that 7 
has the form (? denotes the don't care symbol): 
aJl-l?aJ~-ji-1 ? ...?.aq'-Ja, 
where V" = {ji,...,Jd} C_ V, d > 0 and q' < q. If 
there exists i, j E V" such that {-i, j} ~ E, then r 
would match some pair Pi,j E L and it would have 
negative evidence smaller or equal than q2. Since 
the positive evidence of r cannot exceed q2 + IEI, 
r would have a score not exceeding IEI < (q) < If, 
contrary to our assumption. Then r matches no pair 
Pij E L and, for each i,j E V", we have {i,j} E E. 
= K' (K') Since K - q2 ( 2 ), at least pairs Pi,j E L are 
matched by r. We therefore conclude that d > K' 
and that V" is a clique in G of size greater equal 
than K'. This concludes our outline of the proof. 
5 Concluding remarks 
With some minor technical changes to function 
Up_link_down, we can align a suffix tree with itself 
(w.r.t. a given homomorphism). In this way we 
improve space performance of Algorithms 2 and 3, 
avoiding the construction of two copies of the same 
suffix tree. Algorithm 3 can trivially be adapted to 
learn transformations in (4) where a left context is 
specified in place of a right context. The algorithm 
can also be used to learn traditional phonological 
rules of the form a --* b / _7, where a,b are sin- 
gle phonemes and "/is a sequence over {C, V}, the 
classes of consonants and vowels. In this case the 
450 
algorithm runs in time O(Nn) (for fixed alphabet). 
We leave it as an open problem whether rules of the 
form in (4) can be learned in linear time. 
We have been concerned with learning the best 
transformations that should be applied at a given 
step. An ordered sequence of transformations can 
be learned by iteratively learning a single transfor- 
mation and by processing the aligned corpus with 
the transformation just learned (Brill, 1995). Dy- 
namic techniques for processing the aligned corpus 
were first proposed in (Ramshaw and Marcus, 1996) 
to re-edit the corpus only where needed. Those au- 
thors report that this is not space efficient if trans- 
formation learning is done by independently test- 
ing all possible transformations in the search space 
(as in (Brill, 1995)). The suffix tree alignment data 
structure allows simultaneous scoring for all trans- 
formations. We can now take advantage of this and 
design dynamical algorithms that re-edit a suffix tree 
alignment only where needed, on the line of a similar 
method for suffix trees in (McCreight, 1976). 
An alternative data structure to suffix trees for 
the representations of string factors, called DAWG, 
has been presented in (Blumer et al., 1985). We 
point out here that, because a DAWG is an acyclic 
graph rather than a tree, straightforward ways of 
defining alignment between two DAWGs results in a 
quadratic number of a-links, making DAWGs much 
less attractive than suffix trees for factor alignment. 
We believe that suffix tree alignments are a very flex- 
ible data structure, and that other transformations 
could be efficiently learned using these structures. 
We do not regard the result in Section 4.3 as a neg- 
ative one, since general transformations specified as 
in (5) seem too powerful for the proposed applica- 
tions in natural language processing, and learning 
might result in corpus overtraining. 
Other than transformation based systems the 
methods presented in this paper can be used for 
learning rules of constraint grammars (Karlsson et 
al., 1995), phonological rule systems as in (Kaplan 
and Kay, 1994), and in general those grammatical 
systems using constraints represented by means of 
rewriting rules. This is the case whenever we can 
encode the alphabet of the corpus in such a way 
that alignment is possible. 
Acknowledgements 
Part of the present research was done while the first 
author was visiting the Center for Language and 
Speech Processing, Johns Hopkins University, Bal- 
timore, MD. The second author is a member of the 
Center for Language and Speech Processing. This 
work was funded in part by NSF grant IRI-9502312. 
The authors are indebted to Eric Brill for technical 
discussions on topics related to this paper. 

References 
Apostolico, A. 1985. The myriad virtues of suf- 
fix trees. In A. Apostolico and Z. Galil, editors, 
Combinatorial Algorithms on Words, volume 12. 
Springer-Verlag, Berlin, Germany, pages 85-96. 
NATO Advanced Science Institutes, Seires F. 
Blumer, A., J. Blumer, D. Haussler, A. Ehrenfeucht, 
M. Chen, and J. Seiferas. 1985. The smallest au- 
tomaton recognizing the subwords of a text. The- 
oretical Computer Science, 40:31-55. 
Brill, E. 1995. Transformation-based error-driven 
learning and natural language processing: A case 
study in part of speech tagging. Computational 
Linguistics. 
Crochemore, M. and W. Rytter. 1994. Text Algo- 
rithms. Oxford University Press, Oxford, UK. 
Garey, M. R. and D. S. Johnson. 1979. Computers 
and Intractability. Freeman and Co., New York, 
NY. 
Kaplan, R. M. and M. Kay. 1994. Regular models 
of phonological rule sistems. Computational Lin- 
guistics, 20(3):331-378. 
Karlsson, F., A. Voutilainen, J. Heikkil~, and 
A. Anttila. 1995. Constraint Grammar. A 
Language Independent System for Parsing Unre- 
stricted Text. Mouton de Gruyter. 
McCreight, E. M. 1976. A space-economical suffix 
tree construction algorithm. Journal of the Asso- 
ciation for Computing Machinery, 23(2):262-272. 
Ramshaw, L. and M. P. Marcus. 1996. Explor- 
ing the nature of transformation-based learning. 
In J. Klavans and P. Resnik, editors, The Bal- 
ancing Act--Combining Symbolic and Statistical 
Approaches to Language. The MIT Press, Cam- 
bridge, MA, pages 135-156. 
Weiner, P. 1973. Linear pattern-matching algo- 
rithms. In Proceedings of the i4th IEEE Annual 
Symposium on Switching and Automata Theory, 
pages 1-11, New York, NY. Institute of Electrical 
and Electronics Engineers. 
