A Formalism for Universal Segmentation of Text 
Julien Quint 
GETA-CLIPS-IMAG, BP 53, F-38041 Grenoble Cedex 9, France 
Xerox Research Centre Europe, 6, chemin de Maupertuis, F-38240 Meylan, France 
e-mail: julien, quin'c@iraag.fr 
Abstract 
Sumo is a formalism for universal segmentation 
of text. Its purpose is to provide a franlework 
for the creation of segmentation applications. It 
is called "universal" as tile formalism itself is 
independent of the  of the documents 
to process and independent of the levels of seg- 
mentation (e.g. words, sentences, paragraphs, 
nlorphemes...) considered by the target applica- 
tion. This framework relies on a layered struc- 
ture representing the possible segmentations of 
the document. This structure and the tools to 
manipulate it are described, followed by detailed 
examples highlighting some features of Sumo. 
Introduction 
Tokenization, or word segmentation, is a fun- 
damental task of ahnost all NLP systems. In 
s that use word separators in their writ- 
ing, tokenization seenls easy: every sequence of 
characters between two whitespaces or punctu- 
ation marks is a word. This works reasonably 
well, but exceptions are handled in a cumber- 
some way. On the other hand, there are lan- 
guages that do not use word separators. A much 
nlore complicated processing is needed, closer 
to morphological analysis or part-of-speech tag- 
ging. Tokenizers designed for those s 
are generally very tied to a given system and 
. 
Itowever, the gap becomes smaller when we 
look at sentence segmentation: a simplistic ap- 
proach would not be sufficient because of the 
ambiguity of punctuation signs. And if we 
consider the segmentation of a document into 
higher-level units such as paragraphs, sections, 
and so on, we can notice that  becomes 
less relevant. 
These observations lead to the definition of 
our formalism for segmentation (not just tok- 
enization) that considers tile process indepen- 
dently fl:om the . By describing a seg- 
mentation systenl formally, a clean distinction 
can be made between tile processing itself and 
tile linguistic data it uses. This entails the abil- 
ity to develop a truly multilingual system by us- 
ing a common segmentation engine ~br the vari- 
ous s of the system; conversely, one can 
imagine evaluating several segmentation ,neth- 
ods by using the same set of data with different 
strategies. 
Sumo is the name of the proposed formal- 
isnl, evolving from initial work by (Quint, 1999; 
Quint, 2000). Some theoretical works from the 
literature also support this approach: (Guo, 
1997) shows that sonle segmentation techniques 
can be generalized to any , regardless of 
their writing systenl. The sentence segmenter of 
(Pahner and Hearst, 1997) and the issues raised 
by (Habert et al., 1.998) prove that even in l~n- 
glish or French, segmentation is not so trivial. 
Lastly, (A~t-Mokhtar, 1997) handles all kinds of 
presyntactic processing in one step, arguing that 
there are strong interactions between segnlenta- 
tion and morphology. 
1 The Framework for Segmentation 
1.1 Overview 
The framework revolves around the document 
representation chosen for Sulno, which is a 
layered structure, each layer being a view of 
the document at a given level of seglnentation. 
These layers are introduced by the author of the 
segmentation application as needed and are not 
imposed by Sulno. The example in section 3.1 
uses a two-layer structure (figure 4) correspond- 
ing to two levels of segmentation, characters and 
words. To extend this to a sentence seglnenter, 
a third level for sentences is added. 
These levels of segmentation can have a lin- 
656 
g,fistic or structural level, but "artificiar' levels 
can be introduced a.s well when needed. It is also 
interesting to note that several layers can belong 
to the same level. In the example of section 3.3, 
the result structure can have an indefinite num- 
ber of levels, and all levels are of the same kind. 
We (:all item the segmentation unit o\['a doc- 
untent at a given segmentation level (e.g. items 
of the word level are words). The document is 
then represented at every segmentation level in 
1;erms of its items; I)ecause segmentation is usu- 
ally ambiguous, item .qraph.~ are used to \['actorize 
all the possible segmc'l,ta.tions. Ambiguity issues 
are furthel' addressed in section 2.3. 
The main processing i)aradigms of Sumo are 
ident{/icatio'n and h'ansJbrmation,. With ideutifi- 
cal;ion, new item graphs are built by identif'ying 
items fi'om a source graph using a segmentation 
resource, q'hese graphs are 1;hen modified l)y 
translbrula.tion processes. Section 2 gives the 
details al)out both identificatio~l and t\]'a.nsfof 
mation. 
1.:2 Item Graphs . 
'l'lle iten:l gral)hs are directed acyclic gral)hs; 
they are similar to the word graphs of (Amtru 1) 
et al., 11996) or the string graphs of (C'olmer- 
auer, 1970). They are actually rel)resente(I 1)y 
means of finite-sta.te automata (see section 2.\]). 
IH order to facilitate their manilmlation , two a(1- 
ditio~tal prol)erties are on forced: these m Jtom ata 
ahvays lm.ve a single start-state and finite-slate, 
and no dangling arcs (this is verified by pruning 
the automata after modifications). The exam- 
pies of section 3 show va.rio~ls iteln graphs. 
An item is an arc in the automato~l. An arc 
is a complex structure containing a label (gen- 
erally the surface /brm of the item), named at- 
tributes and relations. Attributes are llsed to 
hold information on the item, like part of speech 
tags (see section 3.2). These attributes can also 
be viewed as annotations in the same sense as 
the annotation graphs of (Bird el; 3l., 2000). 
1.3 Relations 
Relations are links between levels. Items from 
a given graph are linked to items of the graph 
from which they were identified. We call the 
first graph the Iowcr graph and the gral)h that 
was the source \[br the identification the upper 
graph. Relations exist between a path in the 
upper graph and either a path or a subgraph in 
the lower graph. 
Figure i illustrates the first kind of relation, 
called path relation. This example in French is a 
relation between the two characters of the word 
"du" which is really a contraction of"de le". 
Figure 1: A path relation 
Figure 2 illustrates the other ldnd of relation 
called subgraph relation. In this example the 
sentence ABCI)EI, G. (we can imagine that A 
through G are Chinese characters) is related to 
several possible segmentations. 
AB CD ~ E 
"-( ) " 9 - -. 
A BC DE ~ FG . ( ) ,<) -,<) ~() ,-() ,-() 
,, > ~ /7 
~. BCDEF G ~' 1" 
l © .:~mCDZFG %-O 
FigElre 2: A graph relation 
The interested reader may refer to (Pla.nas, 
1998) for a conq)arable 8trllctul;e (multiple lay- 
ers of a document and relations) used in tra.ns- 
lation memory. 
2 Processing a Document 
2.1 Description of a Docmnent 
The core of the document representation is the 
item graph, which is represented by a finite- 
state automaton. Since regular expressions de- 
fine finite-state automata, they can be used to 
describe an item graph. Itowever, our expres- 
sions are extended because the items are more 
complex than simple symbols; new operators are 
introduced: 
• attributes are introduced by an @ sign; 
• path relations are delimited by { and }; 
• tile inlbrmation concerning a given item are 
parenthesized using \[ and \]. 
657 
As an exemple, the relation of figure 1 is de- 
scribed by the following expression: 
\[ de le { d u } \] 
2.2 Identification 
Identification is the process of identifying new 
items froln a source graph. Using the source 
graph and a segmentation resource, new items 
are built to form a new graph. A segmentation 
resource, or simply resource, describes the vo- 
cabulary of the , by defining a mapping 
between the source and the target level of seg- 
mentation. A resource is represented by a finite- 
state transducer in Sumo; identification is per- 
formed by applying the transducer to the source 
automaton to produce the target automaton, 
like in regular finite-state calculus. 
Resources can be compiled by regular expres- 
sions or indentification rules. In the former case, 
one can use the usual operations of finite-state 
calculus to compile the resource: union, inter- 
section, composition, etc) A benefit of the use 
of Sumo structures to represent resources is that 
new resources can be built easily from the doc- 
ument that is being processed. (Quint, 1999) 
shows how to extract proper nouns from a text 
in order to extend the lexicon used by the seg- 
reenter to provide more acurate results. 
In the latter case, rules are specified a.s shown 
in section 3.3. The left hand side of a rule de- 
scribes a suhpath in the source graph, while the 
right hand side describes the associated subpath 
in the target graph. A path relation is created 
between the two sequences of items. In an iden- 
tific~tion rule, one can introduce variables (for: 
callback), and even calls to transformation func- 
tions (see next section). Naturally, these possi- 
bilities cannot be expressed by a strict finite- 
state structure, even with our extended formal- 
ism; hence, calculus with the resulting struc- 
tures is limited. 
A special kind of identification is the auto- 
matic segmentation that takes place at the entry 
point of the process. A character graph can be 
created automatically by segmenting an input 
text document, knowing its encoding. This text 
document can be in raw form or XML format. 
Another possibility for input is to use a graph 
1The semanl, ics of these operations is broadened to 
accomodate the more complex nature of the items. 
of items that was created previously, either by 
Sumo, or converted to the tbrmat recognized by 
~1_11\]10. 
2.3 Transformation 
Ambiguity is a central issue when talking about 
segmentation. Tile absence or ambiguity of 
word separators can lead to multiple segmen- 
tations, and more than one of them can have a 
meaning. As (Sproat et al., 1996) testify, several 
native Chinese speakers do not always agree on 
one unique tokenization for a given sentence. 
Th~nks to the use of item graphs, Sumo can 
handle ambiguity efficiently. Why try to fully 
disambiguate a tokenization when there is no 
agreement on a single best solution? Moreover:, 
segmentation is usually just a basic step of pro- 
cessing in an NLP system, and some decisions 
may need more information than what a set- 
reenter is able to provide. An uninformed choice 
at this stage can affect the next stages in a neg- 
ative way. Transformations are a way to mod- 
ify the item graphs so that the "good" paths 
(segmentations) can be kept and the "bad" ones 
discarded. We can also of course provide fllll 
disambiguation (see section 3.1 for instance) by 
means of transformations. 
In Sumo transformations are handled by 
transformation 5mctions that manipulate the 
objects of the tbrmalism: graphs, nodes, items, 
paths (a special kind of graph), etc. These func- 
tions are written using an imperative  
illustrated in section 3.1. A transformation can 
either be apl)lied directly to a graph or attached 
to a graph relation. In the latter case, the orig- 
inal graph is not modified, and its transformed 
counterpart is only accessible through the rela- 
tion. 
Transformation functions allow to control the 
flow of the process, using looping and condition- 
sis. An important implication is that a same 
resource can be applied iteratively; as shown by 
(Roche, 1994:) this feature allows to implement 
segmentation models much more powerful than 
simple regular s (see section 3.3 for an 
example). Another consequence is that a Sumo 
application consists of one big transformation 
function returning the completed Sumo struc- 
ture as a result. 
658 
3 Examples of Use 
3.1 Maximum tokenization 
Some cla.ssic heuristics for tokenization a.re 
classified 1) 3, (G i% 1997) under the collective 
monil<er of mare\]mum tokenization. This s{Betion 
describes how to iml)lement a. "maxilnnm tok- 
enizer" tha.t tokenizes raw text doculnerits in a 
l A\]- given  and cha.racter encoding (e.g. e 
a<(!l\] glish in ..... , French in Iso-Latin-l, Chinese ill 
Big5 or GB). 
8.1.1 Comlnon set-up 
Our tokenizer is built with two levels: the in- 
put level is the character level, automatically 
segmented using the encoding intbrmation. The 
token level is built from these cha, racters, first by 
~li exllaustive identification of the tol<ens, then 
by re(hieing the UHlnber o\]" 1)>~tlis to tile one coil- 
sidere(1 tlle best 1)y the Ma.xil\]\]Ul\]\] \]bkenization 
heuristic. 
The system works ill three stel)S , with com- 
plete code shown ill figure 3. First, the charac- 
ter level is created 1) 3, automatic segnleutation 
(lines ;1-5, input levei being the special gi'aph 
that is automatically created from a. ra,w file 
throngh stdiu). The second step is to create the 
word grapli 1)y identif'ying words D'oln chata.ctoP 
llsiiig a dictiona.ry. A resour(:e called ABgdic is 
created from a transducer file (lines 6-8), then 
the gra,ph words is created by identifying it, enis 
from the SOllrCe level characters llSing the re- 
soIIrCO ABCdic (lines 9-12). The third step is the 
disalnl)igua,tion of' the woM level t)y al)l)lying a, 
Ma,xiniiin~ Toke\]iization lmuristic (line 13). 
i characters: input level { 
2 encoding: <ASCII, UTF-8, Big5...> 
3 type: raw; 
4 from: stdin; 
5} 
6 ABCdic: resource { 
7 file: ' CABCdic. sumo' ' ; 
8} 
9 words: graph <- identify { 
i0 source: characters; 
il resource: ABCdic ; 
12 } 
13 words <- ft(words.start-node); 
lPigure 3: Maximuln 'lk}kenizer in Sumo 
lqgure 4 illustrates the situatiori for the ill- 
put string "ABCI)I~FG" where A through G axe 
characters and A, AB, B, BC, 13Cl)\]';le, C, CI), 
13, 1)E, E, F, I"C and (3 are words folmd in the 
resource ABCdic. The situation shown is after 
line 12 and before line 13. 
A B C ~-< D E F G (3 (3 ~ .-<3 - ~--)(2) .-<3 ~- .<~ 
Z/,<=-,>, T A'°M \-/ / 
" BCDEF / 
Figllre 4: lPxhaustive tokenization of the string 
AB A)LI G 
We will see in the next three subsections l;he 
different heuristics and their implementations in 
Slll\]\]O. 
3.1.2 Forward Maxhnum Tokenization 
l%rward maxilnlltn 'lbkenization consists of 
scanning tile string from left to right and select- 
ing the token of maxinulm lerigth any time an 
ambiglfity occurs. On the exalnple of figure d, 
tile resl,lt tokeliization of the inI)~lt string would 
1)e A I~/CD/I'\]/IeG. 
lqgnre 5 shows a t'lmction called ft that 1)uilds 
a path recursively by traversing tile token graph, 
al)l)ending the longest item to the pa.th at each 
node. ft ta.kes a, node as input and retlirils a. 
path (line 1). If tile node is final, the enll)ty 
l)atll is retm'ned (lines 2-3), otherwise the array 
of items of tlle nodes In. items) is sea.rched and 
the longest item store(\] in longest (lines 4-10). 
The returned pa,th consists of this longest item 
prepended to the longest path starting from the 
destination node of this item (line 11). 
a.:t.a Backward Maxinmm Tokenlzation 
l~a.ckward Maximum Tokenization is tile same 
as librward Maximum 'lbkenization except that 
the string is scanned fi'om right to left, instead 
of left to right. On the example of figure 4, 
the tokenization of the input string would yield 
A/I~C/I)E/1,'C under Backward Maximum To- 
kenization. 
A function bt can be written. It is very sim- 
ila.r to ft, except that it works backward by 
looking at incoming arcs of' the considered node. 
bt is cMled on the final state of tile graph and 
659 
i function ft (n: node) -> path { 
2 
3 
4 
5 
6 
7 
8 
9 
I0 
Ii 
12 
13 } 
if final(n) { 
return (); 
} else { 
longest: item <- n.items\[l\]; 
foreach it in n.items\[2..\] { 
if it.length > longest.length { 
longest <- it; 
} 
} 
return (longest # ft(longest.dest)); 
Figure 5: The ft function 
stops when at the initial node. Another imple- 
mentation of this function is to apply ft on the 
reversed graph and then reversing the path ob- 
tained. 
3.1.4 Shortest Tokenization 
Shortest Tokenization is concerned with mini- 
mizing (;he overall number of tokens in the text. 
On the example of figure 4, the tokenization of 
the input string would yield A/BCI)I~,I:/G un- 
der shortest tokenization. 
Figure 6 shows a fnnction called st that finds 
the shortest path in the graph. This function 
is adapted from an algorithm for: single-source 
shortest paths discovery in a DAG given by 
(Cormen et al., 1990). It calls another func- 
tion, t_sort, returning a list of the nodes of the 
graph in topological order. The initializations 
are done in lines 2-6, the core of the algorithm is 
in the loop of lines 7-14 that computes the short- 
est path to every node, storing for each node its 
"predecessor". Lines 1.5-20 then build the path, 
which is returned in line 21. 
3.1.5 Combination of Maximum 
Tokenization techniques 
One of the features of Sumo is to allow the com- 
parison of different segmentation strategies us- 
ing the same set of data. As we have .just seen, 
the three strategies described above can indeed 
be compared efficiently by modifying only part 
of the third step of the processing. Letting the 
system run three times on the same set of input 
documents can then give three different sets of 
results to be compared by the author of the sys- 
tem (against each other" and against a reference 
tokenization, for instance). 
i function st (g:graph) -> path { 
2 d: list <- (); // distances 
3 p: list <- (); // predecessors 
4 foreach n in (g.nodes) { 
5 d\[n\] = integer.max; // ~Cinfinite~ 
6 } 
? foreach n: node in t_sort(g.nodes) { 
8 foreach it in n.items { 
9 if (d\[it.dest\] > din\] + i) then { 
i0 d\[it.dest\] = din\] + i; 
ii p\[it.dest\] = n; 
12 } 
13 } 
14 } 
15 n <- g.end; // end state 
16 sp: path <- (n); // path 
17 while (n != g.start) { 
18 n = pin\]; 
19 sp = (n # sp); 
2O } 
21 return sp; 
22 } 
Figrlre 6: the st function 
And yet a different set-up for our "maximum 
tokenizer" would be to select not .just the op- 
timal pa.th according to one of the heuristics, 
but the paths selected by the three of them, as 
shown in figure 7. Combining the three paths 
into a graph is perfbrmed by changing line 13 in 
figure 3 to: 
words <- ft(words.start-node) I 
bt(words.end-node) \] 
st(words.start-node); 
AB CD E 
Figure 7: Three maximum tokenizations 
3.2 Statistical Tokenlzation and Part of 
Speech Tagging 
This example shows a more complicated tok- 
enization system, using the same sort of set-up 
as the one from section 3.1, with a disalnbigua- 
tion process using statistics (namely, a bigram 
model). Our reference for this model is the 
Chasen Japanese tokenizer and part of speech 
660 
tagger documented in (Ma.tsumoto el; el., 1999). 
'.l'his example is a high-level description of how 
to implemen~ a simila.r system with Sumo. 
The set-up for this example adds a new level 
to the pre.vious example: the "bigra.m level." 
The word level is still built by identification us- 
ing dictionaries, then the bigraln level is built 
by computing a. connectivity cost between each 
pair of tokens. This is the level that will be 
used for disambigu~tion or selection of the best 
solutions. 
3.2.1 Exhaustive Segmentation 
All possible segmentartiOns ~re derived from the 
character level to create the word level. Tim 
re,~onrce used \['or this is a dictionary of the la.n- 
gua,ge that maps the surface form of the words 
(in terms of their characters) to their base form, 
part of speech, and a. cost (Chasell also a.dds 
l)ronunciation, co1\jugation type, and semantic 
information). All this inlbrmation is stored in 
the item as attril)utes, the base form heing used 
as the label for the item. I,'igure 8 sllows the 
identificaJ;ion of lille word "ca.ts" which is identi- 
fied as "cat", with category "noun" (i.e. @CAT=N) 
and with some cost k (@COST=k). 
c a t s ( ) ,,< )~,./~ ,*( ) ~. ) 
.... ,\/ ..... 
cat @CAT=N @COST=k 
Figure 8: Identification of the wor<l "cats" 
3.2.2 Statistical Disambiguation 
The disambiguation method relies on a bigranl 
model: each pair of successive items has a "con- 
nectivity cost". In the bigram level, tim "cost" 
attribute of an item W will be the connectiv- 
ity cost of W and a following item X. Note that 
if a same W can be followed by severaJ items 
X, Y, etc. with different connectivity costs for 
e~ch p~tir, then W will be replicated with a. dif- 
ferent "cost" attribute, l:igure 9 shows a word 
W followed by either X or Y, with two different 
connectivity costs h and U. 
The implementation of this technique in Su me 
is straightibrward. Assume there is a fllllCtion 
f that, given two items, computes their connec- 
tivity cost (depending on both of their category, 
i)ldividual cost, etc.) mid returns the first item 
/f 
0~! coopt=w() Y'"~O 
Figure 9: Connectivity costs for W 
with its modified cost. We write the following 
rule a.nd a,pply it to the word graph to creat;e 
the bigram graph: 
_ \[$wl = . e.\] _ \[$~2 = @.\] 
-> eval(f($wl, $2)) 
Tiffs r,lle can be read as: for any word $wl 
with any attribute (" ." matches any label, "O ." 
a.ny set of attributes) followed by any word $w2 
with any attribute ("_" being a context separa- 
tor), create the item returned by the fimction 
f ($ul, $u2). 
I)isambiguaJ;ion is then be perforlned by se- 
lecting the pa.th with optimal cost in this graph; 
but we ca,n also select a.ll paths with a cost col 
resl)onclillg to a certain threshold or the n best 
t)a.ths, etc. Note also that this model is easily ex- 
tensible to any kind of n-grams. A new fllnction 
f($wl ..... Swn) must be provided to corn- 
pule the connectivity costs of this sequence of 
items, and the above rule m,lst be modified to 
take a larger context into accom~t. 
3.3 A Forlnal Exmnple 
This last examl~h'~ is more formal and serw~s 
as an ilhlstra.tion of some powerful features of' 
Sumo. (Cohnerauer, 1970) has a similar exam- 
pie implemented using Q systems. In both cases 
the goaJ is to transform an input string of tilt 
lbrm a~'%"~c '~, n > 0 into a single item ,S' (as- 
suming theft the input a,lphal)et does not contain 
,S'), meaning tha.t the input string is a word of 
this laaguage. 
The set-up here is once again to start with a 
lower level automatically created fl'om the input, 
then to build intermediate levels until ~ final 
level containing oMy the item S is produced (at 
which point the input is recognized), or until the 
process Call no longer carry on (at which point 
the input is rejected). 
.I hc building of intermediary levels is handled 
by the identifica.tioll rule below: 
# S? a \[$A=a*\] b \[$B=b*\] c \[$c=c*\] # 
-> S SA SB $C 
661 
What this rule does is identify a string of the 
form S?aa*bb*cc*, storing all a's but the first 
one in the varia.ble SA, all b's but the first one in 
$B and M1 c's but the first one in $C. The first 
triplet abc (with a possible S in front) is then 
absorbed by ,5', and the remaining a's, b's and 
c's are rewritten after ,5'. 
Figure 1.0 illustrates the first application of 
this rule to the input sequence aabbcc, creating 
the first intermediate level; subsequent applica- 
tions of this rule will yield the only item ,5'. 
,..~_)a a b b c c 
8 
Figure 10: First application of the rule 
Conclusion 
We have described the main features of Sumo, a 
dedicated formalism \[br segmentation of text. A 
document is represented by item graphs at dif 
ferent levels of segmentation, which a.llows mul- 
tiple segmentations of the same document a.t 
the same time. Three detailed ex~mples illus- 
trated the features of Sumo discussed here. For 
the sake of simplicity some aspects could not 
be evoked in this paper, they include: manage- 
ment of the segmentation resources, ef\[iciency 
of the systems written in Sumo, larger a.pplica- 
tions, evaluation of segmentation systems. 
Sumo is currently being prototyped by the au- 
thor. 

References 

Sala.h APr-MOKHTAR, "Du texte ASCII au 
texte lemmatis5 : la prfsyntaxe en une seule 
6tape", in Proceedings of TALN-97, pages 60- 
69, Grenoble, France, June, 1997. 

Jan W. AMTRUP, Henrik HE,N~ and Uwe 
JEST, l/Vhat's in a Word Graph. Evaht- 
ation and Enhancement of Word Lattices, 
Verbmobil report 186, Universit'~tt ltamburg, 
http ://www. dfki. de/, l)ecember, 1997. 

Steven BraD, David DAY, John GAROFOI,O, 
John ItENDERSON, Christophe LAPRUN and 
Mark \[,mERMAN, "ATI,AS: A Flexible and 
Extensible Architecture for Linguistic Anne- 
tation", in Proceedings of L RI~C 2000, Athens, 
Greece, May, 2000. 

Ala.in (JOLMEI~AUER, Lc,~ ,~yst~1)~,c~" (2 ott "ttlt 
formalis'm.e pour analyser et synth.dtiser des 
phrases s~tr ordinateur', Publication interne 
numfro 43, Universit6 de Montr6a.1, 1970. 

Thomas H. COaM~.TN, Charles E. I,I;ISERSON 
and Ronald L. Rwl~,s'r, Introduction to Al- 
gorithms, MIT Press, Cambridge, Massachus- 
sets, 1990. 

Jin Guo, "Critical Tokenization and its Prop- 
erties", in Computational Linguistics, 23(4), 
pages 569-596, December, 1997. 

B. HABERT, G. ADI)A, M. ADDA-I)I~CI<EI~., P. 
BOULA l)E MARI';;UIL, S. I?I~,I{I/AI/I, O. \];'Ell- 
RET, G. \]LI,ouz a,nd P. PAI{OUBI,;K, "Towards 
Tokenization Evaluation", in Proceedings of 
LREC-98, pages 4:27-431, 1998. 

Yuji MATSUMOTO, Akira \[(\]TAUClII, Tatsuo 
YAMASItlTA el; Yoshitaka HIRANO, Japanese 
Morphological Analysis System Cha,5'cn ver- 
sion 2.0 Man'aal, Technical Report NAIST-IS- 
TR99009, Nara Institute of Science and Tech- 
nology, Nara., April, 11999. 

David 1). PALMER a.nd Ma.rti A. ItEAI{S'G 
"Adaptative Multilingual Sentence 13oundary 
Disambigua.tion", in Computational Ling'uis- 
lies, 23(2), pages 241-267, June, 11997. 

h;mmanuel P1,ANAS, 7'\['2LA. ,5~tr'uct'urcs et al- 
gorithmcs pour la 7}'aduction Fond& sur la 
Mdmoirc, Th6se d'lnformatique, Universit6 
.1 oseph l,'ourier, Grenoble, 1998. 

.Julien QUIN% "Towm'ds a fbrmalism for 
-independent text segnmntatioJl", in 
Procccdin9 s of NLPRS'99, pages 404-408, Bei- 
jing, November, 1999. 

Julien QUrN% "Universal Segmentation of Text 
with the Sumo l,brnmlism", im Proceedings of 
NLP 2000, pages 1.6-26, Pa.tras, Greece, June, 
2000. 

Em manuel ROCIIE, "Two Parsing Algorithms by 
Means of Finite-State Tl:ansducers", in Pro- 
ceedings of COLING-9/~, pages 431-435, 1994. 

Richard SPI~,oArp, Chilin SIIIII, William GaLl.; 
and Nancy CIIANG, "A Stochastic Finite- 
State Word-Segmentation Algorithm for Chi- 
nese", in Computational Linguistics 22(3), 
pages 377-404, 1996. 
