Language Independent Morphological Analysis 
Yamashita, Tatsuo and Matsumoto, Yuji 
Graduate School of Information Science 
Nara Institute of Science and Technology { 
tatuo-y, matsu} @is. aist-nara, ac.jp 
Abstract 
This paper proposes a framework of  inde- 
pendent morphological analysis and mainly concen- 
trate on tokenization, the first process of morpholog- 
ical analysis. Although tokenization is usually not 
regarded as a difficult task in most segmented lan- 
guages such as English, there are a number of prob- 
lems in achieving precise treatment of lexical entries. 
We first introduce the concept of morpho-fragments, 
which are intermediate units between characters and 
lexical entries. We describe our approach to resolve 
problems arising in tokenization so as to attain a 
 independent morphological analyzer. 
1 Introduction 
The first step in natural  processing is to 
identify words in a sentence. We call this process a 
morphological analysis. Various s exist in 
the world, and strategies for morphological analysis 
differ by types of . Conventionally, mor- 
phological analyzers have been developed in one an- 
alyzer for each  approach. This is a lan- 
guage dependent approach. In contrast, We propose 
a framework of  independent morphologi- 
cal analysis system. We employ one analyzer for 
any  approach. This approach enables a 
rapid implementation of morphological analysis sys- 
tems for new s. 
We define two types of written s: one 
is a segmented , and the other is a non- 
segmented . In non-segmented s 
such as Chinese and Japanese, since words are not 
separated by delimiters such as white spaces, tok- 
enization is a important and difficult task. In seg- 
mented s such as English, since words are 
seemingly separated by white spaces or punctuation 
marks, tokenization is regarded as a relatively easy 
task and little attention has been paid to. Therefore, 
each  dependent morphological analyzer has 
its own strategy for tokenization. We call a string 
defined in the dictionary lexeme. From an algorith- 
mic point of view, tokenization is regarded as the 
process of converting an input stream of characters 
into a stream of lexemes. 
We assume that a morphological analysis consists 
of three processes: tokenization, dictionary look- 
up, and disambiguation. Dictionary look-up gets 
a string and returns a set of lexemes with part-of- 
speech information. This implicitly contains lemma- 
tization. Disambiguation selects the most plausible 
sequence of lexemes by a use of a rule-base model 
or a hidden Markov model (HMM)(Manning and 
Schiitze, 1999). Disambiguation i s already  
independent, since it does not process strings di- 
rectly and therefore will not be taken up. On the 
other hand, tokenization and dictionary look-up are 
 dependent and shall be explained more in 
this paper. 
We consider problems concerning tokenization 
of segmented s in Section 2. To resolve 
these problem, we first apply the method of non- 
segmented s processing to segmented lan- 
guages (Section 3). However, we do not obtain a 
satisfactory result. Then, we introduce the con- 
cept of morpho-fragments to generalize the method 
of non-segmented  processing (Section 4). 
The proposed framework resolves most problems in 
tokenization, and an efficient  independent 
part-of-speech tagging becomes possible. 
2 Problems of Tokenization in 
Segmented Languages 
In segmented s such as English, tokeniza- 
tion is regarded as a relatively easy task and little 
attention has been paid to. When a sentence has 
clear word boundaries, the analyzer just consults 
the dictionary look-up component whether strings 
between delimiters exist in the dictionary. If any 
string exists, the dictionary look-up component re- 
turns the set of possible parts-of-speech. This string 
is known as graphic word which is defined as "a string 
of contiguous alphanumeric characters with space on 
either side; may include hyphens and apostrophes, 
but no other punctuation marks" (Ku~era and Fran- 
cis, 1967). 
Conventionally, in segmented s, an ana- 
lyzer converts a stream of characters into graphic 
words (see the rows labeled "Characters" and 
232 
Sentence Dr. Lee and John's son go to the McDonald's in New York. 
Characters IDIrl. I ILFe\[el lalnldl IJlo\[hJal'\[sl Islolal lglol Nol \]t\[hlel IMlc\]Dlolnlall\]dl'lm\] Iilnl INlelul fflo\]rlkl. I 
Graphic \[Drl I \]Lee I land I IJohn'sl Isonl \]go I Itol Ithel \]McDonald's\[ linl \]Newl IYork\].l Words 
Lexemes \[Dr.\[ ILee I landl \]John\]'s\[ Isonl Igo\] Itol \]thel IScDonald'sl \]inl tNew York\].l 
Morpho- fragments IDr\['\[ ILeel landl \[Johnl'lsl Isonl Igol Itol Ithel \]McDonaldl'\[s\] linl \[Newl IYork\].l 
Figure 1: Decomposition of Sentence in English 
Sentence (He is home for holiday.) 
Characters \[~\[ ~ \[z~k\[ ~t "C" I')~ I~" I L t "C \[ ~ I ~ \] o 
Morpho- 
fragments V I'J  I I L. I I I I o 
Figure 2: Decomposition of Sentence in Japanese 
"Graphic Words" in Figure 1) and searches the dic- 
tionary for these graphic words. However, in prac- 
tice, we want a sequence of lexemes (see the line 
labeled "Lexemes" in Figure 1). We list two major 
problems of tokenization in segmented s be- 
low (examples in English). We use the term segment 
to refer to a string separated by white spaces. 
1. Segmentation(one segment into several lex- 
emes): 
Segments with a period at the end (e.g, "Calif." 
and "etc.") suffer from segmentation ambigu- 
ity. The period can denote an abbreviation, the 
end of a sentence, or both. The problem of sen- 
tence boundary ambiguity is not easy to solve 
(Palmer and Hearst, 1997). A segment with 
an apostrophe also has segmentation ambiguity. 
For example, "McDonald's" is ambiguous since 
this string can be segmented into either "Mc- 
Donald / Proper noun" + " 's / Possessive end- 
ing" or "McDonald's / Proper noun (company 
name)". In addition, "boys' " in a sentence "... 
the boys' toys ..." is ambiguous. The string can 
be segmented into either "boys' / Plural posses- 
sive" or "boys/Plural Noun" ÷ " ' / Punctu- 
ation (the end of a quotation)" (Manning and 
Schiitze, 1999). If a hyphenated segment such 
as "data-base," "F-16," or "MS-DOS" exists in 
the dictionary, it should be an independent lex- 
eme. However, if a hyphenated segment such as 
"55-years-old" does not exist in the dictionary, 
hyphens should be treated as independent to- 
kens(Fox, 1992). Other punctuation marks such 
as "/" or "_" have the same problem in "OS/2" 
or "max_size" (in programming s). 
2. Round-up(several segments into one lexeme): 
If a lexeme consisting of a sequence of segments 
such as a proper noun (e.g., "New York") or 
a phrasal verb (e.g., "look at" and "get up") 
exists in the dictionary, it should be a lexeme. 
To handle such lexemes, we need to store multi- 
segment lexemes in the dictionary. Webster and 
Kit handle idioms and fixed expressions in this 
way(Webster and Kit, 1992). In Penn Tree- 
bank(Santorini, 1990), a proper noun like "New 
York" is defined as two individual proper nouns 
"New / NNP" ÷ "York / NNP," disregarding 
round-up of several:segments into a lexeme. 
The definition of lexemes in a dictionary depends 
on the requirement of application. Therefore, a sim- 
ple pattern matcher is not enough to deal with lan- 
guage independent tokenization. 
Non-segmented s do not have a delimiter 
between lexemes (Figure 2). Therefore, a treatment 
of further segmentation and rounding up has been 
well considered. In a non-segmented , the 
analyzer considers all prefixes from each position in 
the sentence, checks whether each prefix matches 
the lexeme in the dictionary, stores these lexemes 
in a graph structure, and finds the most plausible 
sequence of lexemes in the graph structure. To find 
the sequence, Nagata proposed a probabilistic lan- 
guage model for non-segmented s(Nagata, 
1994)(Nagata, 1999). 
The crucial difference between segmented and 
233 
non-segmented s in the process of morpho- 
logical analysis appears in the way of the dictionary 
look-up. The standard technique for looking up lex- 
emes in Japanese dictionaries is to use a trie struc- 
ture(Fredkin, 1960)(Knuth, 1998). A trie structured 
dictionary gives all possible lexemes that start at 
a given position in a sentence effectively(Morimoto 
and Aoe, 1993). We call this method of word 
looking-up as "common prefix search" (hereafter 
CPS). Figure 3 shows a part of the trie for Japanese 
lexeme dictionary. The results of CPS for "~j~ 
~'7 ~ o "(I go to Ebina.) are "~j~" and "~." To 
get all possible lexemes in the sentence, the analyzer 
has to slide the start position for CPS to the right 
by character by character. 
3 A Naive Approach 
A simple method that directly applies the mor- 
phological analysis method for non-segmented lan- 
guages can handle the problems of segmented 
s. For instance, to analyze the sen- 
tence, "They've gone to school together," we first 
delete all white spaces in the sentence and get 
"They'vegonetoschooltogether." Then we pass it to 
the analyzer for non-segmented s. However, 
the analyzer may return the result as "They / 've / 
gone / to / school / to / get / her / ." inducing a 
spurious ambiguity. Mills applied this method and 
tokenized the medieval manuscript in Cornish(Mills, 1998). 
We carried out experiments to examine the in- 
fluence of delimiter deletion. We use Penn Tree- 
bank(Santorini, 1990) part-of-speech tagged cor- 
pus (1.3M lexemes) to train an HMM and ana- 
lyze sentences by HMM-based morphological ana- 
lyzer MOZ(Yamashita, 1999)(Ymashita et al., 1999). 
We use a bigram model for training it from the cor- 
pus. Test data is the same as the training corpus. 
Table 1 shows accuracy of segmentation and part-of- 
speech tagging. The accuracy is expressed in terms 
of recall and precision(Nagata, 1999). Let the num- 
ber of lexemes in the tagged corpus be Std, the num- 
ber of lexemes in the output of the analyze be Sys, 
and the number of matched lexemes be M. Recall 
is defined as M/Std, precision is defined as M/Sys. 
The following are the labels in Table 1 (sentence for- 
mats and methods we use): 
LXS We isolate all the lexemes in sentences and 
apply the method for segmented s to 
the sentences. This situation is ideal, since 
the problems we discussed in Section 2 do 
not exist. In other words, all the sentences 
do not have segmentation ambiguity. We use 
the results as the baseline. Example sentence: 
"Itu ' suMr. uLeeu ' supenu •" 
NSP We remove all the spaces in sentences and 
apply the method for non-segmented lan- 
guages to the sentences. Example sentence: 
"It ' sMr. Lee ' spen." 
NOR Sentences are in the original normal format. 
We apply the method for non-segmented lan- 
guages to the sentences. Example sentence: 
"It ' SuMr. uLee ' supen." 
Because of no segmentation ambiguity, "LXS" 
performs better than "NSP" and "NOR." The fol- 
lowing are typical example of segmentation errors. 
The errors originate from conjunctive ambiguity and 
disjunctive ambiguity(Guo, 1997). 
conjunctive ambiguity The analyzer recognized 
"away, .... ahead," "anymore," and '~orkforce" 
as "a way," "a head," "any more," and '~ork 
force," respectively. In the results of "NSP," the 
number of this type of error is 11,267. 
disjunctive ambiguity The analyzer recognized 
"a tour," "a ton," and "Alaskan or" as "at our," 
"at on," and "Alaska nor," respectively. In the 
results of "NSP," the number of this type of er- 
ror is 233. 
Since only "NSP" has disjunctive ambiguity, "NOR" 
performs better than "NSP." This shows that white 
spaces between segments help to decrease segmenta- 
tion ambiguity. 
Though the proportional difference in accuracy 
looks slight between these models, there is a con- 
siderable influence in the analysis efficiency. In the 
cases of "NSP" and "NOR," the analyzer may look 
up the dictionary from any position in a given sen- 
tence, therefore candidates for lexemes increase, and 
the analysis time also increase. The results of our 
experiments show that the run time of analyses of 
"NSP" or "NOR" takes about 4 times more than 
that of "LXS." 
4 Morpho-fragments: The Building 
Blocks 
Although segmented s seemingly have clear 
word boundaries, there are problems of further seg- 
mentation and rounding up as introduced in Section 
2. The naive approach in Section 3 does not work 
well. In this section, we propose an efficient and so- 
phisticated method to solve the problems by intro- 
ducing the concept of morpho-/ragments. We also 
show that a uniform treatment of segmented and 
non-segmented s is possible without induc- 
ing the spurious ambiguity. 
4.1 Definition 
The morpho-fragments (MFs) of a  is de- 
fined as the smallest set of strings of the alphabet 
which can compose all lexemes in the dictionary. In 
other words, MFs are intermediate units between 
234 
, O--O, "~/Noun (shrimp) 
I~•-- "~/Proper noun (Ebina City) 
~/Noun (pawn) 
/ 
• -- -:~ (/Verb (walk) 
• i ?~/Noun (sidewalk) I 
~•--,:~/Noun (footbridge) 
LXS 
NSP 
NOR 
MF 
Figure 3: Japanese Trie Structured Dictionary 
Segmentation POS Tagging Analysis Time 
( Recall / Precision ) ( Recall / Precision ) ( Ratio ) 
100 96.98 1.0 
99.52 / 99.67 96.52 / 96.69 4.3 
99.87 / 99.91 96.84 / 96.88 4.2 
99.88 / 99.93 96.85 / 96.91 1.4 
Table 1: Results of Experiments 
characters and lexemes (see Figure 1 and Figure 2). 
MFs are well defined tokens which are specialized 
for  independent morphological analysis. 
For example, in English, all punctuation marks 
are MFs. Parts of a token separated by a punctu- 
ation mark such as "He," "s," and the punctuation 
mark itself, ..... in "He's" are MFs. The tokens in 
a compound lexeme such as "New" and "York" in 
"New York" are also MFs. In non-segmented lan- 
guages such as Chinese and Japanese, every single 
character is a MF. Figure 4 shows decomposition of 
sentences into MFs (enclosed by "\[" and "\]") for sev- 
eral s. Delimiters (denoted "J') are treated 
as special MFs that cannot start nor end a lexeme. 
Once the set of MFs is determined, the dictio- 
nary is compiled into a trie structure in which the 
edges are labeled by MFs, as shown in Figure 5 for 
English and in Figure 3 for Japanese. A trie struc- 
ture ensures to return all and only possible lexemes 
starting at a particular position in a sentence by a 
one-time consultation to the dictionary, resulting in 
an efficient dictionary look-up with no spurious am- 
biguity. 
When we analyze a sentence of a non-segmented 
, to get all possible lexemes in the sentence, 
the analyzer slides the position one character by 
one character from the beginning to the end of the 
sentence and consults the trie structured dictionary 
(Section 2). Note that every character is a MF in 
non-segmented s. In the same way, to an- 
alyze a sentence of a segmented , the an- 
alyzer slides the position one MF by one MF and 
consults the trie structured dictionary, then, all pos- 
sible lexemes are obtained. For example, in Figure 5, 
the results of CPS for "'m in ..." are ..... and "'m," 
and the results for "New York is ..." are "New" and 
"New York." 
Therefore, a morphological analyzer with CPS- 
based dictionary look-up for non-segmented lan- 
guages can be used for the analysis of segmented 
s. In other words, MFs make possible lan- 
guage independent morphological analysis. We can 
also say MFs specify the positions to start as well as 
to end the dictionary look-up. 
235 
Language Sentence Recognized Morpho-fragments 
English 
Chinese 
Korean 
Japanese 
I'm in New York. 
~J~. (He is my little brother.) 
L~--~ ~ot I ~c~. (I go to school.) 
-~g\[~ ~ $ b .~ ~) o (Let's go to school.) 
\[Zl\['\]\[mlu\[±nlu\[Newlu\[Yorkl\[.\] 
-* \] 
-* \] 
Figure 4: Recognition of Morpho-fragments 
0-- I/Pronoun 
A W I 
1 I '/Punctuation m A 'm/Verb 
S 0-- 's/Possessive ending 
New 0, New/Adjective 
York 
"'0 w A New_York/Proper noun 
Figure 5: English Trie Structured Dictionary 
4.2 How to Recognize Morpho-fragments 
The problem is that it is not easy to identify the 
complete set of MFs for a segmented . We 
do not make effort to find out the minimum and com- 
plete set of MFs. Instead, we decide to specify all 
the possible delimiters and punctuation marks ap- 
pearing in the dictionary, these may separate MFs 
or become themselves as MFs. By specifying the fol- 
lowing three kinds of information for the  
under consideration, we attain a pseudo-complete 
MF definition. The following setting not only sim- 
plifies the identification of MFs but also achieves a 
uniform framework of  dependent morpho- 
logical analysis system. 
1. The  type: 
The s are classified into two groups: 
segmented and non-segmented s. 
"Language type" decides if every character in 
the  can be an MF. In non-segmented 
 every character can be an MF. In 
segmented , punctuation marks and 
sequences of characters except for delimiters 
can be an MF. 
2. The set of the delimiters acting as boundaries: 
These act as boundaries of MFs. However, these 
can not be independent MFs (can not start nor 
end a lexeme). For example, white spaces are 
delimiters in segmented s. 
3. The set of the punctuation marks and other 
symbols: 
These act as a boundary of MFs as well as an 
MF. Examples are an apostrophe in "It's," a 
period in "Mr.," and a hyphen in "F-16." 
• Using these information, the process of recogniz- 
ing MFs becomes simple and easy. The process can 
be implemented by a finite state machine or a simple 
pattern matcher. 
The following is the example of the definition for 
English: 
1. Language type: segmented  
2. Delimiters: white spaces, tabs, and carriage- 
returns 
3. Punctuation marks: \[.\] \[,\]\[:\] \[;\] \['\] \["l \[-\] . " "\[0\] \[1\] \[2\]. • • 
As is clear from the definition, "punctuation marks" 
are not necessary for non-segmented , since 
236 
every character is an MF. The following is the ex- 
ample for Japanese and Chinese. 
1. Language type: non-segmented  
2. Delimiters: not required 
3. Punctuation marks: not required 
Though Korean sentences are separated by spaces 
into phrasal segments, Korean is a non-segmented 
 essentially, since each phrasal segment does 
not have lexeme boundaries. We call this type of lan- 
guages incompletely-segmented s. German 
is also categorized as this type. The following is the 
example for Korean. 
1. Language type: non-segmented  
2. Delimiters: spaces, tabs, and carriage-returns 
3. Punctuation marks: not required 
In incompletely-segmented s, such as Ko- 
rean, we have to consider two types of connection of 
lexemes, one is "over a delimiter" and the other is 
"inside a segment" (Hirano and Matsumoto, 1996). 
If we regard delimiters as lexemes, a trigram model 
can make it possible to treat both types. 
The definition gives possible starting positions of 
MFs in sentences of the  and the same 
morphological analysis system is usable for any lan- 
guage. 
We examined an effect of applying the morpho- 
fragments to analysis. Conditions of the experiment 
are almost the same as "NOR." The difference is 
that we use the morpho-fragments definition for En- 
glish. The row labeled "MF" in Table 1 shows the 
results of the analysis. Using the morpho-fragments 
decreases the analysis time drastically. The accuracy 
is also better than those of the naive approaches. 
Well studied  such as English may have 
a hand tuned tokenizer which is superior to ours. 
However, to tune a tokenizer by hand is not suitable 
to implement many minor s. 
4.3 Implementation 
We implement a  independent morphologi- 
cal analysis system based on the concept of morpho- 
fragments(Yamashita, 1999). With an existence of 
tagged corpus, it is straightforward to implement 
part-of-speech taggers. We have implemented sev- 
eral of such taggers. The system uses an HMM. 
This is trained by a part-of-speech tagged corpus. 
We overview the setting and performance of tagging 
for several s. 
English 
An HMM is trained by the part-of-speech 
tagged corpus part of Penn Treebank(Santorini, 
1990) (1.3 million morphemes). We use a tri- 
gram model. The lexemes in the dictionary are 
taken from the corpus as well as from the en- 
try words in Oxford Advanced Learner's Dictio- 
nary(Mitton, 1992). The system achieves 97% 
precision and recall for training data, 95% pre- 
cision and recall for test data. 
Japanese 
An HMM is trained by Japanese part-of-speech 
tagged corpus(Rea, 1998) (0.9 million mor- 
phemes). We use a trigram model. The 
lexemes in the dictionary are taken from the 
corpus as well as from the dictionary of 
ChaSen(Matsumoto et al., 1999), a freely avail- 
able Japanese morphological analyzer. The sys- 
tem achieves 97% precision and recall for train- 
ing and test data. 
Chinese 
An HMM is trained by the Chinese part-of- 
speech tagged corpus released by CKIP(Chinese 
Knowledge Information Processing Group, 
1995) (2.1 million morphemes). We use a bi- 
gram model. The lexemes in the dictionary 
are taken only from the corpus. The system 
achieves 95% precision and recall for training 
data, 91% precision and recall for test data. 
5 Related Work and Remarks 
We address two problems of tokenization in seg- 
mented s: further segmentation and round- 
up. These problems are discussed by several authors 
including Mills(Mills, 1998) Webster & Kit(Webster 
and Kit, 1992). However, their proposed solutions 
are not  independent. 
To resolve the problems of tokenization, we first 
apply the method of non-segmented s pro- 
cessing. However, this causes spurious segmenta- 
tion ambiguity and a considerable influence in the 
analysis times. Therefore, we propose the concept 
of morpho-fragments that minimally comprises the 
lexemes in a . Although the idea is quite 
simple, our approach avoids spurious ambiguity and 
attains an efficient look-up of a trie structured dic- 
tionary. In conclusion, the concept of morpho- 
fragments makes it easy to implemented  
independent morphological analysis. 

References 

Chinese Knowledge Information Processing Group, 1995. ~ ~;~\[ ~ \[~ ~z ~ ~ ~.~ ~ ~ ~ ~ ~ _~ ~j\]. Academia Sinica Institute of Information Science. in Chinese. 

Christopher Fox, 1992. Lexical Analysis and Stoplists, chapter 7, pages 102-130. In Frakes and Baeza-Yates (Frakes and Baeza-Yates, 1992). 

William B. Frakes and Ricardo A. Baeza-Yates, editors. 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall. 

Edward Fredkin. 1960. Trie memory. Communications of the ACM, 3(9):490-500, September. 

Jin Guo. 1997. Critical tokenization and its properties. Computational Linguistics, 23(4):569-596, December. 

Yoshitaka Hirano and Yuji Matsumoto. 1996. A proposal of korean conjugation system and its application to morphological analysis. In Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation (PACLING 11), pages 229-236, December. 

Donald E. Knuth. 1998. The Art of Computer Programming : Sorting and Searching, volume 3. Addison-Wesley, second edition, May. 

Henry Ku~era and W. Nelson Francis. 1967. Computational Analysis of Present-Day American English. Brown University Press. 

Christopher D. Manning and Hinrich Schiitze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press. 

Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, and Yoshitaka Hirano, 1999. Japanese Morphological Analysis System ChaSen version 2.0 Manual. NAIST Technical Report NAIST-IS-TR99009, April. 

Jon Mills. 1998. Lexicon based critical tokenization: An algorithm. In Euralex'98, pages 213-220, August. 

Roger Mitton, 1992. A Description of A Computer-Usable Dictionary File Based on The Oxford Advanced Learner's Dictionary of Current English. 

K. Morimoto and J. Aoe. 1993. Two tree structures for natural  dictionaries. In Proceedings of Natural Language Processing Pacific Rim Symposium, pages 302-311. 

Masaaki Nagata. 1994. A stochastic japanese morphological analyzer using a forward-dp backward-a* n-best search algorithm. In COLING 94, volume 1, pages 201-207, August. 

Masaaki Nagata. 1999. A part of speech estimation method for japanese unknown words using a statistical model of morphology and context. In 37th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 277-284, June. 

David D. Palmer and Marti A. Hearst. 1997. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241-267, June. 

Real World Computing Partnership, 1998. RWC Text Database Report. in Japanese. 

Beatrice Santorini, 1990. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision, 2nd printing), June. 

Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in nip. In COLING 92, volume 4, pages 1106-1110, August. 

Tatsuo Yamashita, 1999. MOZ and LimaTK Manual NAIST Computational Linguistics Laboratory, <http://cl.aist-nara.ac.jp/-tatuo-y/ma/>, August. in Japanese. 

Tatsuo Ymashita, Msakazu Fujio, and Yuji Matsumoto. 1999. Language independent tools for natural  processing. In Proceedings of the Eighteenth International Conference on Computer Processing, pages 237-240, March. 
