Extended Models and Tools for High-performance Part-of-speech 
Tagger 
Masayuki Asahara and Yuji Matsumoto 
Graduate School of Information Science, Nara Institute of Science and Technology 
8916-5, Taka.yama-cho, Ikoma-shi, Nara, 630-0101, Japan 
{masayu-a,matsu}@is. aist-nara, ac. jp 
Abstract 
Statistical part-of-st)eeeh(POS) taggers achieve high 
accuracy and robustness when based oil large, scale 
maimally tagged eorl)ora. Ilowever, enhancements 
of the learning models are necessary to achieve bet- 
ter 1)erforma.nce. We are develol)ing a learning 
tool for a Jalmnese morphological analyzer called 
Ch, aScn. Currently we use a fine-grained POS tag 
set with about 500 tags. To al)l)ly a normal tri- 
gram model on the tag set, we need unrealistic size 
of eorl)ora. Even, for a hi-gram model, we ean- 
no~, 1)ret)are a llloderate size of an mmotated cor- 
pus, when we take all the tags as distinct. A usual 
technique to Col)e with such fine-grained tags is to 
reduce the size of the tag set 1)y grouping the set 
of tags into equivalence classes. We introduce the 
concept of position-wise 9rouping where the tag set 
is t)artitioned into dill'el'lint equivalence classes at 
each t)osition in the. conditional 1)rohabilities in the 
Markov Model. Moreover, to eoi)e with the data 
Sl)arsen(?ss prot)lem caused 1) 3, exceptional t)henon> 
ena, we introduce several other techniques such as 
word-level statistics, smoothing of word-level an(l 
P()S-level statistics and a selective tri-gram model. 
To help users determine probabilistic 1)arameters, we 
introduce an error-driven method for the pm'mneter 
selection. We then give results of exl)eriments to see 
the effect of the tools applied to an existing Jat)anese 
morphological analyzer. 
1 Introduction 
Along with the increasing awfilability of mmotated 
eorl)ora, a number of statistic P()S tatters have 
been developed which achieve high accuracy and ro- 
bustness. On the other hand, there is still continu- 
ing demand for the iinprovement of learning lnod- 
els when sufficient quantity of annotated corpora 
are not available in the users domains or s. 
Flexible tools for easy tuning of leanfing models 
are in demand. We present such tools in this pa- 
per. Our tools are originally intended for use with 
the Japanese morphological analyzer, ChaSen (Mat- 
sumoto et al., 1999), which at present is a statis- 
tical tagger based on the w~riable memory length 
Marker Model (lion et al., 1.994). We first give a 
brief overview of the features of the learning tools. 
The t)art-of-speech tag set we use is a slightly 
modified version of tile IPA POS tag set (RWCP, 
2000) with about 500 distinct POS tags. The real 
tag set is even larger since some words are treated 
as distim't P()S tags. The size of the tag set is unre- 
alistic for buihting tri-grmn rules and even bi-gram 
rules which take all the tags as distinct. The usual 
technique for coping with such fine-grained tags is to 
reduce the size of the tag set by groul)ing the set of 
tags into equivalence classes (Jelinek, 1.998). We in- 
troduce the concept of position-wise grouping where 
the tag set is partitioned into different equivalence 
(;lasses at each position in the conditiolml probabili- 
lies in the Marker Model. This feature is especially 
useflfl for ,lapanese  analysis since Jal)anese 
is a highly (:onjugated , where conjugation 
fOl'lllS have a great etfeet on the succeeding mor- 
1)homes, trot have little to do with the t)receding nlor- 
phemes. Moreover, in colloquial , a number 
of eolltrael;ed expressio11s are eoinmon, where two or 
more morphemes are central:ted into a single word. 
The contracted word behaves as belonging to dif- 
ferent t)arts-of-st)eech by connecting to the previous 
word or to the next word. Position-wise grouping en- 
ables users to grouI) such words differently according 
to the positions in which they appear. 
Data sparseness is always a serious problem when 
dealing with a large tag set. Since it is unrealistic to 
adopt a simple POS tri-gram model to our tag set, 
we base our model on a hi-gram model and augment 
it with selective tri-grams. By selective tri-gram, we 
mean that only special contexts are conditioned by 
tri-gram model and are mixed with the ordinary bi- 
grmn model. We also incorporate some smoothing 
techniques for coping with the data sparseness prob- 
lelll. 
By eolnbining these methods, we constructed the 
learning tools for a high-lmrformance statistical mor- 
phological analyzer that are able to learn the prob- 
ability i)arameters with only a moderate size tagged 
('orl)us. 
The rest of this paper is structured as follows. 
21 
Section 2 discusses the basic concet)ts of tile statisti- 
cal morphological analysis and some problems of the 
statistical approach. Section 3 presents the charac- 
teristics of the our learning tools. Section 4 reports 
the result of some experiments and tile accuracy of 
the tagger in several settings. Section 5 discusses re- 
lated works. Finally, section 6 gives conclusions and 
discusses future works. 
Throughout this paper, we use morphological 
analysis instead of part-of-speecll tagging since 
Japanese is an agglutinative . This is the 
standard ternfinology in Japanese literatures. 
2 Preliminaries 
2.1 Statistical morphological analysis 
The POS tagging problem or the Japanese morpho- 
logical analysis problem must do tokenization and 
find the sequence of POS tags T = tl,. •., t:,~ tot the 
word sequence W = wl,..., w,~ in the int)ut string 
S. Tile target is to find T that maxinfizes tile fol- 
lowing probability: 
Using the Bayes' rule of probability theory, 
P(W,T) can be decomposed as a sequence of tile 
products of tag probabilities and word probabilities. 
P (TIW) P(T, W) = argmax P(W) 
= argu~}.xP(T,W) 
= .,': uv/xP(WlT)F(T) 
We assumed that tile word probability is con- 
strained only by its trig, and that the tag probability 
is constrained only by its preceding tags, either with 
the t)i-grmn or the tri-gram model: 
P(WIT) = HP(wilti) 
i=1 
P(T) = fl P(tilti_l) 
i=1 
P(T) = fl P(ti\[ti-2,i-1) ) 
i=1 
The values are estimated from tile frequencies in 
tagged corpora using maximum likelihood estima- 
tion: 
p(w lti) - F(w"L) r(t ) 
F(ti_,,td P(t lt -l) 
- 
F(t;i-2,1,i-1, ti) = 
F(ti-2, ti-1) 
Using these parameters, tim most probable tag se- 
quence is determined using the Viterbi algorithm. 
2.2 Hierarchical Tag Set 
We use the IPA POS tag set (RWCP, 2000). This 
tag set consist of three eleinents: tile part-of speech, 
the type of conjugation and tile form of conjugation 
(the latter two elements are necessary only for words 
that conjugate). 
Tile POS tag set has a hierarchical structure: The 
top POE level consists of 15 categories(e.g., Noun, 
Verb, ... ). The second and lower levels are th, e sub- 
division level. For example, Noun is fllrther subdi- 
vided into common nouns(general), llroper nomls, 
numerals, and so on. Proper Noun is sul)divided 
into General, Person, Organization and Place. Per- 
son and Place are subdivided again. The bottom 
level of tile subdivision level is th.c word level, which 
is conceptually regarded as a part of the subdivision 
level. 
In the Japanese , verbs, adjectives and 
auxiliary verbs have conjugation. These are catego- 
rized into a fixed set of conjugation types(CTYPE), 
each of which has a fxed set of conjugal;ion 
forms(CFORM). It is known that in Japanese that 
the CFORM varies according to the words appear- 
ing in the succeeding position. Thus, at tile condi- 
tional position of the estimated tag probabilities, the 
CFORM plays an important role, while in the case 
of other positions, they need not be distinguished. 
Figure 1 illustrates tile structure of the tag set. 
2.3 Problems in statistical models 
On the one hand, most of the i)rol)lems in statistical 
natural  processing stem fi'om the sparse- 
hess of training data. In our case, tile nuinber of the 
most fine-grained tags (disregarding the word level) 
is about 500. Even when we use the bi-gram model, 
we suffer from the data sparseness problem. The 
situation is nmch worse in the case of the tri-grmn 
model. This may be remedied by reducing tile tag 
set by grouping the tags into a smaller tag set. 
On the other hand, there are various kinds of ex- 
ceptions in  phenomena. Some words have 
different contextual features fi'om others in the same 
tag. Such exceptions require a word or some group of 
words to be taken itself as a distinct part-of-speech 
or its statistics to be taken in distinct contexts. In 
our statistical learning tools, those exceptions are 
handled by position-wise grouping, word-level statis- 
tics, smoothing of word-level and POS-level, and se- 
lective tri-gram model, which are described in turn 
ill the next section. These features enable users to 
22 
POS 
.,o 17571 
"rile subdivision level ~I I I I 
lhe word level GL~ WI~yO\] NL~ .......... 
(;TYPE Godul>"K" G°dan"lS" Illl No Conjugation No Conjugation Type Type 
tbrm form .... No Conjugation No Conjugation 
Figure 1: The examples of the hierarchical tag set 
adjust tile balance between fine and coarse grained 
model settings. 
3 Features of the tools 
This section overviews characteristic timtures of tim 
learning tools for coping with the above, mentioned 
prolflems. 
a.:t Position-wise grouping of POS tags 
Since we use a very fine-grained tag set, it is impor- 
tant to classit'y them into some equiva.lence classes 
l;o reduce the, size. of i)rotmbilistic lmramelers. More- 
over, as is discussed in the 1)revious section, some 
words or P()S bo, lmves ditli;rently according to Ihe 
l)osition they at)pear. In 3at)aneso, tbr instance, 
the CF()I/M play an iml)ortmlt role only to dis- 
mnbiguate the words at their succeeding position. 
In other words, the CFORM should be taken into 
account only when they appear at tim position of 
1,i-1 in either bi-gram or tri-grain model (ti-i in 
I'(t~lt~__~ ) and P(t,\]t~_.,,t~_l)). This means that 
when the statistics of verbs are, taken, they should be 
grouped diflbrently according to the positions. Not(', 
that, we named the positions; The current position 
means the position of ti in the hi-gram statistics 
P(tiIti-1) or the tri-grmn statistics P(till, i_.,, ti-~). 
The preceding position means the position of ti-1. 
The second preceding position means the position of 
ti-.2. 
There are quite a few contracted t~rms ill col- 
loquial expressions. For example, auxiliary verb 
"chau" is a contracted tbrms consisting of two words 
"te(particle) + simau(auxiliary verb)" and behaves 
quite differently from other words. One way to learn 
its statistical behavior is to collect various us~ges of 
the word and add the data to the training data after 
correctly mmotating them. In contrast, the idea of 
point-wise grouping provides a nice alternative so- 
hltion to this problem. By simply group this word 
into the same equivalence class of "te" for the cur- 
relfl; 1)osition I,i and grou I) it into the same equiva- 
le, nt class of "simau" for the t)rece(ling position ti-1 
in P(ti\]ti-. ), it learns the statistical behavior from 
these classes. 
We now describe the point-wise grouping ill a 
nlore precise way. l?or simplicity, we assume, bi- 
gram model. Let 7- = {A,/3,---} be ttw, original 
tag set. \¥e introduce two partitions of the tag set, 
one is fin' the current position T ~ = {A (',/)~,..-}, 
mid the other is for the preceding 1)osition T v = 
{AV,13v, .. .}. We define the equivalence mal>l)ing 
of the current position: I('(T -~ T"), and another 
mapping of the t)rece(ling position: U'('\]~ -4 "yv). 
lqgure, 2 shows an exalnple of the lmrtitiollS by 
those, mapl)ings, where the equivalence mappings 
I c = {el --> A c,L? -9 A",C ~ A':,\]) -+ B",E -5 
W,...} 
fv = {A --+ AV, \]\] -4 AJ', C -4 B v, D --+ B v, E -~ 
Cl,...} 
Supl)ose we express the equivalence class to which 
the tag t, belongs as It\]" for the current position and 
\[t,\]v for the preceding position, then: 
= F(w , \[td 
\[W) 1)(till, i_l) = 
3.2 Word-level statistics 
Seine words behave ditt'erently flom other words 
even in tile same POS. Especially Japanese particles, 
auxiliary verbs a.nd some affixes are known to have 
different e(mtextual behavior. The tools can define 
23 
03 
g 
ku 
o -- 
b (D 
The Preceding Positon Tag Set 
ALB C~Dlc ° FIG\[ H--- -~-- 
B" / D" 
- -~ .......... ', ................ i ...... i ............ 
..... ~ ........... ~ .............. i ...... i ........ 
03 
F- 
o = < 
n "E 
B O 
I- 
The Precedin( Position Tag Set 
Wb 1 
B 
iil..ll,ll.q Wbr, B ,~ 
Figure 2: Position-wise grouping of tags 
some words as distinct POS and their statistics are 
taken individually. 
The tag set T extends to a new tag set T ~xt that 
defines some words as individual POSs (the word 
level). Modification to the probability formulas for 
such word level tags is straightforward. 
Note that the statistics for POS level should be 
modified when some words in the same group are 
individuated. Suppose that the tags A and B are 
defined in tile T and some words l,l~,..., l'l~,~ E A 
and l, tZb~ , . . . , I:F~,,~ C 17 arc individuated in 7" ~xt. \Ve 
define tags Ae,~,l, Bext~ C T ~:t as follows: 
Figure 3: the word extended tag set 
We define two smoothing coetIicients: A~ is the 
smoothing ratio for the current position and /~j, is 
the smoothing ratio of the preceding position. Those 
values can be defined for each word. 
Suppose the word wi is individuated and its POS 
is ti. If the current position is smoothed, then the 
tag probability is defined as follows (note that wi 
itself is an individuated tag): 
/5(wilti_l ) = ((1 - A~)P(tilti_\]) + A~P(wilti_l)) 
If the word at the preceding positions is smoothed 
(assume ti-\] is the POS of wi-1): 
.ao~ = ,4 \ 0v,~,...,~,,~,,} 
Bcxt = /) \ {wv~,...,wt,,,~} 
To estimate the probability for tile comlection 
A-B, tile frequency F(Aea:t, Bext) is used rather than 
the total frequency F(A, B). Figure 3 illustrate the 
tag set extension of this situation. 
These tag set extension is actually a special case of 
position-wise grouping. The equivalence mappings 
are fl'om all word level tags to T ~t. The mapping 
I ~ maps all the words ill A~xt into A~t and maps 
each of {W~,..., 14(~., } into itself. In the same way, 
I p maps all the words in B~.~,t into B~:~t and maps 
each of {Wb,,..., Wb,, } into itself. 
3.3 Smoothing of word and POS level 
statistics 
When a word is individuated while its occurrence 
frequency is not high, x~e have to accumulate in- 
stances to obtain enough statistics. Another solu- 
tion is to smooth the word level statistics with POS 
level statistics. In order to back-off the st)arseness of 
the words, we use the statistics of the POS to which 
the words belong. 
P(ti\]wi-1) = (1 - ~Xp)P(tilti-l) + A~,F(tilwi-~) 
If the both words of the positions is extend: 
~- /~p((1 -- ~c)\]~(ti\[~Oi_l ) ~- )tcr('ll)i\]'ll)i_ 1 )) 
+(1 - Av)((1 - A~)P(tilti_\] ) + A~P(wilti-t)) 
3.4 Selective tri-gram model 
Simple tri-gram models are not feasible for a large 
tag set. As a matter of fact, only limited eases re- 
quire as long contexts as tri-grams. We 1)rot)ose to 
take into account Olfly limited tri-gram instances, 
which we call sclcctive tri-flrams. Our model is 
a mixture of such tri-gram statistics with bi-gram 
ones. 
The idea of mixture of different context length 
is not new. Markov Models with varial)le memory 
length are proposed by Ron(Ron et al., 1994), in 
whictl a mixtm'e model of n-grams with various value 
of n is presented as well as its learning algorithms. 
In such a model, the set of contexts (the set of states 
of the automata) should be mutually disjoint for the 
automata to be deterministic and well-defined. 
We give a little different interpretation to tri-grmn 
statistics. We consider a tri-grmn as an exceptional 
24 
,, ......... ,o,,,o. AIA~ 
C The preceding position A I B 
F- 
B hIE 
CA,.....;;....,. A B I( ....... 
Figure 4: Selective tri-gram 
context. Wheii a bi-grani context and a tri-grani 
context have sonie intersection, the tri-gram context 
is regarded as an exception within the, l)i-graui con- 
text. In this sense, all tim cont, exts are, mutually 
disjoint as well in our niodel, and it is possii)h, to 
convert our model into Ron's tormulal;ion. I{owevei, 
we think that oul" %rnmlation is iilore straighforward 
if the longer COlltex(;s ~-/1"(! interln'eted as exc, el)i;ions 
to (;lie shorter (;onte, xts. 
\Ve, assume that th(; grouping at the current 1)o- 
sit;ion ('7 -~) share the same grouping of the t)i-grain 
case. But for the l)re(:eding l)osition and (;lie s(!(:ond 
1)receding 1)osition, we can deline ditl'erent groupings 
of tag sets fl'om those of the bi-gram case. We intro- 
duce the two new tag sets tbr the preceding positions: 
The tag sol; of the preceding position: 
"P/- {W", S/,...} 
The tag set of the s(!(',()ii(l pre(:(!ding l>().~ition: 
.T),, ' _ { A I',' , H~,#...} 
We define the equiv~tlence mal)l)ing for the 1)re - 
ceding position: I p' (7- 4 "Y p'), and tile nml)ping for 
tile second 1)receding position: I s';/ (7- -4 "Y Jm' ). As- 
stoning that an equivalence classes for 1, detined by 
the mapping I pp' is expressed as \[t\] pj/, the, tri-grani 
t)robal)ility is defined naturally as fl)llows: 
P(tilt~-',, ti-i ) -- s'(\[I,d"lb,<-._,\]'"",\[l,~-,\]'") 
F(\[t.~_.4'"', \[l.~_, \],", It,;\]") 
F(\[ti_~\]m", rl. ,1,.>'~ L 't--J J \] 
Figure 4 shows an image of fl'equency counts for 
tri-gram model. 
In case some hi-gram COlltext overlaps with a tri- 
grmn context, the bi-graln statistics are taken by 
excluding the tri-gram statistics. 
For (:xmnl)h:, if we inchide (.lie tri-grmn context 
A - C - \]7 in our model, then the slat,)sties of the hi- 
grail) COilteX(; C- 13 is taken as folh>ws (F stands for 
true, frequency in training corpora while F' stands 
for estimated frequency to lie used for 1)robat.>ility 
calculation): 
s,"(c, .) = s~'(c, J3) - F(A, C, ix) 
Since selection of tii-gram contexts is not easy 
task, the tools supports the selection based on ml 
error-driven method. We omit the detail because of 
the sl)a(:e limitation. 
3.5 Estiinatioli for unseen words in eori)us 
Since not all the words in (;lie dictionary appear 
in the training corpus, |;lie occurrence probability 
i>f miseen words should 1)e allocated ill $Olil(: way. 
There are a number of method for estimating un- 
se, en events. Our era'rent tool adopts Lidstone's law 
of succession, wlfieh add ;~ fixed (:omit to each obser- 
vat)Oil. 
~'('.,10 = F(,.,, t) + ~ 
E,,~ F(,., t) + ~: . Itl 
At 1)resent, l;he de, fault frequency (:omit, (t is set (o 
0.5. 
4 Experiments and Evaluation 
For evaluating how the 1)rol)osed extension lint)roves 
a normal t)i-glain model, we condllcted several ex- 
periments. We group verl)s according (o the con- 
jugation forms at the preceding i/osition, take word 
level statistics for all l/articles, auxiliary verbs and 
synll)ols, each of which is smoothed with the illlliie,- 
dial:ely higher P()S level. Selective, tri-grani contexts 
are defined for dist:riniinating a few notoriously ;lil/- 
hi~uous particle "no" and auxiliary ve, i'l)s "nai" and 
"aruY This is a very simple extension but suffices 
tbr evaluating the ett'ect of the learning tools. 
We use 5-tbld cross ewfluation over (;he RWCP 
tagged corpus (RXVCP, 2000). The corpus (:late size 
is 37490 sentences(958678 words). The errors of the 
corpus are manually rood)tied. The annotated cor- 
pus is divided into the traiifing data set(29992 sen- 
tences, 80%) and the test data set(7498 se, ntence, s, 
20%). Experinients were repeated 5 times, and the 
reslllts \v(;r(; averaged. 
The, evaluation is done at the following 3 levels: 
• le, vell: only word segmentation (tokenizati(m) 
is ewduated 
• level2: word segmentation mid (;lie toI) level 
part-of-speech are ewfluated 
• level3: all infornmtion is taken into at, count for 
evaluation 
Using the tools, we create the following six models: 
D: hernial bi-granl model 
D,,,: D + word level statistics for particles, etc. 
25 
Table 1: Results for test data (F-value %) 
dataset 
D 98.69 98.12 96.91 
Dw 98.75 98.24 97.22 
Dw.~t 98.80 98.26 97.20 
Dws 98.76 98.27 97.23 
Dwqt 98.78 98.35 97.27 
Table 2: Results for learning data (F-value %) 
dataset 
D 98.84 98.36 92.36 
D,o 98.96 98.58 97.81 
Dwq 98.92 98.46 97.6t 
Dw~ 98.96 98.58 97.80 
D,vgt 98.92 98.55 97.70 
Dwo: Dw + groupilLg 
Dw,: D,o + smoothing of word level with POS 
level 
D,~,at: Dwo + selective tri-grmn 
The smoothing rate between the part-of-sl)eech 
and the words is fixed to 0.9 for each word. 
To evahmte the results, we use tile F-value defined 
by the tbllowing formulae: 
number of correct words 12ccall = 
number of words in corpus 
number of correct words Precision = 
number of words by system output 
(/32 + 1) - l~,ecall. Precision 
F~ = /32 • (Precision + 12ccall) 
For each model, we evaluate the F-value (with ~ = 
1) for tim learlfing data and test data at; each level. 
The results are given in the Tables 1 and 2. 
From the results the tbllowing observation is pos- 
sible: 
Smoothing improve on grouping dataset in test 
data slightly. But in tile other enviromnents the 
accuracy isn't improved. In this experiment, the 
smoothing rate for all words is fixed. We need to 
make the different rate for each word in the future 
work. 
The grouping performs good result for the test 
dataset. It is natural that the grouping is not good 
for learning dataset since all the word level statistics 
are learned in the case of learning dataset. 
Finally, the selective tri-gram (only 25 rules 
added) achieves non-negligible improvement at 
level2 and level3. Compared with the normal bi- 
gram Inodel, it improves about 0.35% on level3 and 
about 0.2% on level2. 
5 Related work 
Cutting introduced grouping of words into equiv- 
a.lence classes based on the set of possible tags to 
reduce the number of the parameters (Cutting et 
al., 1992) . Schmid used tile equivaleuce classes for 
smoothing. Their classes define not a partition of 
POS tags, but mixtures of some POS tags (Schmid, 
1995). 
Brill proposed a transfbrmation-based method. In 
the selection of tri-gram contexts we will use a sim- 
ilar technique (Brill, 1995) . 
Haruno constructed variable length models based 
on the mistake-driven methods, and mixed these tag 
models. They do not have grouping or smoothing 
facilities (Haruno and Matsumoto, 1997). 
Kitauchi presented a method to determine refine- 
meat of the tag set by a mistake-driven technique. 
Their inethod determines the tag set according to 
the hierarchical definition of tags. Word level dis- 
crimination and grouping beyond the hierarchical 
tag structure are out of scope of their method (Ki- 
tauchi el; al., 1999). 
6 Conclusion and Future works 
We proposed several extensions to the statistical 
model for Japanese morphological analysis. We also 
gave preliminary experiments and showed tile effects 
of the extensions. 
Counting some words individually and smooth- 
ing them with POS level statistics alleviate the data 
sparseness problem. Position-wise grouping enables 
an eflk~ctive refiimment of the probability parameter 
settings. Using selective tri-grain provides an easy 
description of exceptional  phenomena. 
In our future work, we will develol) a method to re- 
fine the models automatically or semi-automatically. 
For example, error-driven methods will be applica- 
ble to the selection of the words to be individuated 
and the useflfl tri-gram contexts. 
For the morphological analyzer Ch, aSen, we are us- 
ing the mixture modeh Position-wise grouping used 
for conjugation. Smoothing of tile word level and 
the POS level used tbr particles. 
The analyzer and the learning tools are available 
publicly i . 

References 

E. Brill. 1995. Transformation-Based Error-Driven 
Learning and Natural Language Processing: A 
Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4):543 565. 

D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. 
1992. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing. 

M. Haruno and Y. Matsumoto. 1997. Mistake Driven Mixture of Hierarchical Tag Context Trees. 
In 35th, Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 230-237, July. 

F. Jelinek. 1998. Statistical Methods for @cech 
Recognition. MIT Press. 

A. Kitauchi, T. Utsm'o, and Y. Matsmnoto. 1999. 
Probabilistic Model Le~arning tbr JatmlmSC' Mor- 
1)hological Analysis 1)3' lgrror-driven Feat;,rc Se- 
lection (in .lal)mmse). 'J}'(t~t.sa(:l/io'l~, of \]'nfi)rmatio'n 
l'roccssi'ng Sci('ty of ,\]apa'n, 40(5):2325 2337, 5. 

Y. Matsmnoto, A. Kitau('hi, T. Ymmtshita, Y. 1\]i- 
mno, H. M~tsuda, and 54. Asahm'a. 1999. 
Japanese MorphologicM Analyzer ChaSen Users 
Ma.mml version 2.0. Technical l/el)oft NAIST-IS- 
T1~99012, Nma Institute of Science mM ~ibx:lmol- 
ogy ~l~(:lmicM lR,eport. 

l). I~.on, Y. Singer, and N. Tishby. 1994. \]A~A/I'll - 
ing Prolml)ilistic Automal a with Vm'iM)lc Memory 
Length. In COLT-g4, tinges 35 ~16. 

\]/,WCP. 2000. I{\YC Tt~xt l)atabas(~. 
http://www, rwcp. or. j p/wswg/rwcdb/text/. 

H. Schmid. 1995. Improvements In part-of-speech 
Tagging With an Application To German In 
EACL SIGDAT workshop, pages 17-50. 
