Robust Segmentation of Japanese Text into a Lattice for Parsing 
Gary Kaemarcik, Chris Brockett, Hisami Suzuki 
M icrosofl Research 
One Microsoft Way 
Redmond WA, 98052 USA 
{ garykac,chrisbkt, hisamis } @in i croso ft.com 
Abstract 
We describe a segmentation component that 
utilizes minimal syntactic knowledge to produce a 
lattice of word candidates for a broad coverage 
Japanese NL parser. The segmenter is a finite 
state morphological analyzer and text normalizer 
designed to handle the orthographic variations 
characteristic of written Japanese, including 
alternate spellings, script variation, vowel 
extensions and word-internal parenthetical 
material. This architecture differs from con- 
ventional Japanese wordbreakers in that it does 
not attempt to simultaneously attack the problems 
of identifying segmentation candidates and 
choosing the most probable analysis. To minimize 
duplication of effort between components and to 
give the segmenter greater fi'eedom to address 
orthography issues, the task of choosing the best 
analysis is handled by the parser, which has access 
to a much richer set of linguistic information. By 
maximizing recall in the segmenter and allowing a 
precision of 34.7%, our parser currently achieves a 
breaking accuracy of ~97% over a wide variety of 
corpora. 
Introduction 
The task of segmenting Japanese text into word 
units (or other units such as bunsetsu (phrases)) 
has been discussed at great length in Japanese NL 
literature (\[Kurohashi98\], \[Fuchi98\], \[Nagata94\], 
et al.). Japanese does not typically have spaces 
between words, which means that a parser must 
first have the input string broken into usable units 
before it can analyze a sentence. Moreover, a 
variety of issues complicate this operation, most 
notably that potential word candidate records may 
overlap (causing ambiguities for the parser) or 
there may be gaps where no suitable record is 
found (causing a broken span). 
These difficulties are commonly addressed using 
either heuristics or statistical methods to create a 
model for identifying the best (or n-best) sequence 
of records for a given input string. This is 
typically done using a connective-cost model 
(\[Hisamitsu90\]), which is either maintained 
laboriously by hand, or trained on large corpora. 
Both of these approaches suffer fiom problems. 
Handcrafted heuristics may become a maintenance 
quagmire, and as \[Kurohashi98\] suggests in his 
discussion of the JUMAN scgmenter, statistical 
models may become increasingly fi'agile as the 
system grows and eventually reach a point where 
side effects rule out fiwther improvements. The 
sparse data problem commonly encountered in 
statistical methods is exacerbated in Japanese by 
widespread orthographic variation (see §3). 
Our system addresses these pitfalls by assigning 
completely separate roles to the segmeuter and the 
parser to allow each to delve deeper into the 
complexities inherent in its tasks. 
Other NL systems (\[Kitani93\], \[Ktu'ohashi98\]) 
have separated the segmentation and parsing 
components. However, these dual-level systems 
are prone to duplication of effort since mauy 
segmentation ambiguities cannot be resolved 
without invoking higher-level syntactic or 
semantic knowledge. Our system avoids this 
duplication by relaxing the requirement that the 
segmenter identify the best path (or even n-best 
paths) through the lattice of possible records. The 
segmenter is responsible only for ensuring that a 
correct set of records is present in its output. It is 
the filnction of the parsing component to select the 
best analysis from this lattice. With tiffs model, 
our system achieves roughly 97% recall/precision 
(see \[Suzuki00\] for more details). 
1 System Overview 
Figure shows a simple block diagram of our 
Natural Language Understanding system for 
Japanese, the goal of which is to robustly produce 
syntactic and logical forms that allow automatic 
390 
Word Segmentation 
l{ 
\[ Dcrivational .,\sscmhl~ I 
Syntactic Analysis \] 
\[ \[,ogical Form \] 
,0 ( )rthograph.v 
l.exicon 
Syntax 
I.cxicon 
% 
Figure 1: Block diagram of Japanese NL system 
extraction of semantic relationships (see 
\[Richardson98\]) and support other lirlguistic 
projects like information retrieval, NL interfaces 
and dialog systems, auto-.summarization and 
machine translation. 
The segmenter is the frst level of' processing. This 
is a finite-state morphological analyzer responsible 
for generating all possible word candidates into a 
word lattice. It has a custom lexicon (auto: 
matically derived from the main lexicon to ensure 
consistency) that is designed to facilitate the 
identification of orfllographic variants. 
Records representing words and morphemes are 
handed off by the segmenter to the derivational 
assembly component, which uses syntax-like rules 
to generate additional derived forms that are then 
used by the parser to create syntax trees and logical 
forms. Many of the techniques here are similar to 
what we use in our Chinese NI., system (see 
\[Wu98\] for more details). 
The parser (described exterisively in \[Jensen93\]) 
generates syntactic representatioris arm logical 
forms. This is a bottomoup chart parser with 
binary rnles within the Augnmnted Phrase 
Structure Grammar formalism. The grammar rules 
are --specific while the core engine is 
shared among 7 s (Chinese, Japanese, 
Korean, English, French, German, Spanish). The 
Japanese parser is described in \[Suzuki00\]. 
2 Recall vs° Precision 
In this architecture, data is fed forward from one 
COlnponent to the next; hence, it is crucial that the 
base components (like the segmenter) generate a 
minimal number of omission errors. 
Since segmentation errors may affect subsequent 
components, it is convenient to divide these errors 
into two types: recoverable and non-recoverable. 
A ram-recoverable error is one that prevents the 
syntax (or any downstream) component from 
arriving at a correct analysis (e.g., a missing 
record). A recoverable error is one that does not 
interfere with the operation of following 
components. An example of the latter is the 
inchision of an extra record. This extra record 
does not (theoretically) prevent the parser from 
doing its lob (although in practice it may since it 
eonsunles resotlrces). 
Using standard definitions of recall (R) and 
precision (P): 
*~ Jr R - Seg~,,,.,.,.,., p = Seg~,,,.,.~.~., 
7bg,,,,,/ &g,,,,,,i 
where Segcor~ec t and .<,egmxal are the number q/" "'cotwect" 
and total number o/'segments returned by the segmentet; 
and "\['agto~a I is the total Jlttmber of "correct" segments 
fi'om a tagged corpus, 
we can see that recall measures non-recoverable 
errors and precision measures recoverable errors. 
Since our goal is to create a robust NL system, it 
behooves us to maximize recall (i.e., make very 
few non-recoverable errors) in open text while 
keeping precision high enough that the extra 
records (recoverable errors) do not interfere with 
the parsing component. 
Achieving near-100% recall might initially seem to 
be a relatively straightforward task given a 
sufficiently large lexicon - simply return every 
possible record that is found in the input string, in 
practice, tile mixture of scripts and flexible 
orthography rules of Japanese (in addition to the 
inevitable non-lexicalized words) make the task of 
identifying potential lexical boundaries an 
interesting problem in its own right. 
3 Japanese Orthographic Variation 
Over tile centuries, Japanese has evolved a 
complex writing system that gives tile writer a 
great deal of flexibility when composing text. 
Four scripts are in common use (kanji, hiragana, 
katakana and roman), and can co-occur within 
lexical entries (as shown ill Table 1). 
Some mixed-script entries could be handled as 
syntactic compounds, for example, ID ~a---1-" /at 
dii kaado="ID card'7 could be derived fl'om 
1DNotJN + 79-- I ~ NOUN. tlowever, many such items 
are preferably treated as lexical entries because 
391 
i!i \[~ ~ ' \[atarashii ,: "'new "\] 
Kanji-Iliragana I~LII~J \[ho~(vtnn'ui = "'mammal"\] 
~ 7/" "7 :./\[haburashi -~ "'toothbrush "\] 
................... ......... 
K'}n.!i-<~!P!\]!! ...................... E(!S '!t(!Z./~(:GS tm!'i t? ::c:(';S's:vstet!* :'1 ........ 
12 ) J \[lmtmL, atsu - "December"/ 
Kallji-Synlbol v ,{'~ \[gcmma sen = "'~amma rays "\] 
.i-3 1" 4" t/ \[otoim - "'toilet "'\] Mixed kana 
............................. \[.-/")3 hl~¢!,,!! £ .;;(9 c¢f~,,,~/5' ;7/ ........... 
II3 )-~ -- b" \[aidtt kaado = "lP card"\] 
Kana-Alpha ..t .y-t~ .>"-5;-V --RNA \[messe~gaa RA'A = 
................................................................. T ie~t.~:s~'~lxe~ ~U.:I. '7 .................................. 
7, J- U 2/-)'- ~) 2, 90 \[suloronchiumu 90 - 
"'Strontiunt 90 "\] 
Kana-Symbol I~ 2_ 7~ 40 ° \[hoeru yonjuu do = "'roaring 
............................................................................ fot:!!~,,;.7 ......................................................... 
~i~ b ~" ~, \[keshigomu - "'eraser "1 
a >,I- 21" ~ e/ ~) ~: \[artiSt kentauri set : Other mixed 
"'Alpha Centauri "\] 
\[. ~: ~ \[togaki = "'stage directions"\] 
Table 1: Mixed-script lexical entries 
they have non-compositional syntactic or semantic 
attributes. 
In addition, many Japanese verbs and adjectives 
(and words derived from them) have a variety of 
accepted spellings associated with okurigana, 
optional characters representing inflectional 
endings. For example, the present tense of e)j b ~;¢; 
~- (kiriotosu = "to prune ") can be written as any 
of: ~)Ji:~~J --, ~;)J b .."¢,:t, ¢:JJTf; & ,~, ~JJb.~s&~-, ~ ~:; ~s ~ I 
or even ~ ~) ~'g ~ 4-, -~ 9 7~;-~-. 
Matters become even more complex when one 
script is substituted for another at the word or sub- 
word level. This can occur for a variety of 
reasons: to replace a rare or difficult ka@ (.~£~ 
\[rachi= "kMnap"\] instead of f,):~); to highlight a 
word in a sentence ( ~ >" t2 -h, o ~_ *) \[henna 
kakkou = "strange appearance '7); or to indicate a 
particular, often technical, sense ( 7 Y o -c \[watatte 
="crossing over"\] instead of iA~o-c, to emphasize 
the domain-specific sense of "connecting 2 
groups" in Go literature). 
More colloquial writing allows for a variety of 
contracted tbrms like ~ t~j~.\-~ ~ ~ t~ + !=t \[ore~ 
tacha = ore-tachi + wa = "we'" + TOPIC\] and 
phonological inutations as in ~d.~:--~- = -d'4 \[dee- 
su ~ desu = "is "\]. 
This is only a sampling of the orthographic issues 
present in Japanese. Many of these variations pose 
serious sparse-data problems, and lexicalization of 
all variants is clearly out of the questioi1. 
II.\]'~., ~lJ <..~ lt,l~L~l~;'~q ¢gJ \[H/ikc,kkoku "'every 
repeat moment "'\] 
characters Ill ~ ~., e L I~, .~ HI :!:~ til-!~ U ~ ' \[kaigaishu 
"'diligent "'\] 
distribution of t: " '>" v~- ~ t:":#v\]- \[huleo "vMeo"\] 
voicina nl,qrks 
halt\vidth & 
lhllwidth 
composite 
symbols 
F M b2J~" ~ FM )/ZJ~ \[I"M housml ~ "FM 
broadcast "'\] 
;~ (i'?" ~U, "-~ 5" 4, "V <e" :>, 2, \[daivaguramu : 
"'diagram 'J 
;~; "-~ ..... L" 2/ 1- \[paasento =: "percent :;\] 
rtz ,. ../~, . 
"'incorporated 'i\] 
N\] ~ 2 8 FI \[n!jmthactH niciu = "28 'j' day of the 
month "\] 
Table 2: Character type normalizations 
4 Segmenter Design 
Given the broad long-term goals for' the overall 
system, we address the issues of recall/precision 
and orthographic variation by narrowly defining 
the responsibilities of the segmenter as: 
(i) Maximize recall 
(2) Normalize word variants 
4.1 Maxinffze Recall 
Maximal recall is imperative, Any recall mistake 
lnade in the segmenter prevents the parser from 
reaching a successful analysis. Since the parser in 
our NL system is designed to handle ambiguous 
input in the fbrm of a word lattice of potentially 
overlapping records, we can accept lower precision 
if that is what is necessary to achieve high recall° 
Conversely, high precision is specifically not a 
goal for the segmenter. While desirable, high 
precision may be at odds with the primary goal of 
maximizing recall. Note that the lower bound for 
precision is constrained by the lexicon. 
4°2 Normalize word variants 
Given tile extensive amount of orthographic 
variability present in Japanese, some form of 
normalization into a canonical form is a pre- 
requisite for any higher-.order linguistic processing. 
The segmenter performs two basic kinds of 
nomlalization: \[,emmatization of inflected forms 
and Orthographic Norlnalization. 
392 
,~kurieana .... n),: g.). z~ -+ i,J,: ~- ~),~ :~ \[lhkmuk~ :: "drafty"/ ;5'./~ <% J)-tJ- ;'3_ \[11i1:i:;5,.,7~-\]\[ ~", ,. ,,,,.. ~ ' >\]-~J-'"o kammwaseru = 7o 
................... ~ !o{ ~ c!i ~s /./!,a!!{,6! : ::e ,!:e!l?!el<, 71 .... ........................... engage (gear.w i' 
a mll.~'llmori - "'~111 non-standard tA'ct) ::~ "~ ~-0~ \]'- \[onnanoko :: "girl"/ .;.'/. <) ~, l) \[ ~u:<;'~-\]\[ ~\[:~-). {,. ~') \] estimate "' 
script +~:t ,-t - "+ 0:4 7, :~ \[d,s'uko : "disco"/ II) ):2 I" \[li',;4- lll\]):<fi4 --i:DZ i D: a/ih, kaado2:"lDciml '" 
• D, I\] "+ " "~ Jl \[tkka,4etsu :: "one month"/ ................ 9:0 variants kagi tahako "'smf/'f 
--I, I!:~ " \['~\] \[ktl.~'uml~aseki = "Kasunli~aseki "\] 
numerals fi~r 5 i\[i~ .+ ti\]!i~ \[gorm : "'Olympics "\] 
kaltji 1 ),, ~ ")v \[hitori : "one person"/ 
:t.; i Z -- ~ ,t~, k. ~ -~J { 2 b ' ~5 /~ \[oniisan = "'older vovcel 
extensions brother "'\] 
-7 -,- 4' \]" 4 "+ 7 -," 4" I" \[/'aito 'ftght "'\] 
'Y-/ 4- ~- 9 "../-i~ i<.( ~- 9 "-./\[batorm - "'vloti~F7 kalakana 
-)" 9 :E- -- Iv ..+ 9 ~ . O • :E- --" F' \[aramoode "'h la v ari all Is 
too& "'\] 
IPL~(!rt)2_ ~ + ll~),t 2. ".'~ \[haeru = "'to shine "7 
inline yomi ~ ~_lt(f:l: ~o ~I> ") )\]~(i "+l/'~ ~t JL¢\[ \[hachuurui 
D,3 '/, < :~' II~';L D ', ~ II:t:~'i 0:: 'J "< -~ I toDac~'o" 
/g;'~; V, 1-{{:/~ D;\]\[!i,~:l' '1 na~ai "'a long visit" 
Table 4: Orthography lattices 
lexical entry. Examples are given in "fable 3. Two 
cases of special interest are okurigana and inline 
yomi/kanji normalizations. The okurigana 
normalization expands shortened forms into fully 
specified forms (i.e., fornls with all optional 
................................. 7"~'!~t!?C'\] ..................................................................................... characters present). Tile yomi/kanji handling takes inline kanji lYo(N)~:iil2d -'~ tgs ~ft~ \[~essh~rul =: "'rodent"/ 
Table 3: Script normalizations 
4.Z 1 Lemmatization 
LEMMATIZATION iri Japanese is the same as that 
for any  with inflected forms - a lemma, 
or dictionary form, is returned along with the 
inflection attributes. So, a form like ~:~-7~ \[tabeta 
= "ate "J would retum a lemma of f,~ ~<~; \[taberu = 
"'eat"\] along with a PAST attribute. 
Contracted forms are expanded and lemmatized 
individually, so that f,~ ~<-<o ~:~ .> o ?= \[tabe~ 
tecchatta = "has ectten and gone'7 is returned as: 
f~ ~-Z. 7 0 G\[-RUND -F (, x < GERUND -F L. +E ") PASr 
\[taberu: "eat" + iku--++go" F s\]Timazl=..iSpE(7\[.'\]. 
4..2.2 Orthographic Normalizatio,, 
ORTIIOGRAPttlC NORMALIZATION smoothes out 
orthographic variations so that words are returned 
in a standardized form. This facilitates lexical 
lookup and allows tile system to map the variant 
representations to a single lexicon entry. 
We distinguish two classes of orqthographic 
normalization: character type normalization and 
script normalization. 
CIIARAC'IER TYPE NORMAI.IZATION takes tile 
various representations allowed by the Unicode 
specification and converts them into a single 
consistent form. Table 2 summarizes this class of 
normalization. 
SCR.II'T NORMAI,IZAI'ION rewrites the word so that 
it conforms to tile script and :~pelling used in the 
infixed parenthetical material and normalizes it out 
(after using the parenthetical infommtion to verify 
segmentation accuracy). 
5 Lexicon Structures 
Several special lexicon structures were developed 
to support these features. Tile most significant is 
an orthography lattice* that concisely encapsulates 
all orthographic variants for each lexicon entry and 
implicitly specifies the normalized form. This has 
the advantage of compactness and facilitates 
lexicon maintenance since lexicographic inform- 
ation is stored in one location. 
The orthography lattice stores kana inforrnation 
about each kanji or group of kanji in a word. For 
example, the lattice far the verb Y~:-<~D \[taberu = 
"eat'7 is \[~:#_\]-<~, because the first character 
(ta) can be written as either kanji 5~ or kana 1=. A 
richer lattice is needed for entries with okurigana 
variants~ like LJJ 0 i'~:~ 4 \[kiriotosu = "'to prune "\] 
cited earlier: commas separate each okurigana 
grouping. The lattice for kiriotosu is \[OJ:~, 0 \]\[i"#: 
~, E \]~j-. Table 4 contains more lattice examples. 
Enabling all possible variants can proliferate 
records and confiise the analyzer (see \[Kurohashi 
94\]). We therefore suppress pathological variants 
that cause confusion with more common words 
and constructions. For example, f:L-t,q- \[n~gai = "a 
long visit'7 never occurs as I.~ ~' since this is 
ambiguous with the highly fi'equent adjective ~-~v, 
/nasal - "l<mg'7. Likewise, a word like !t 
' Not to be confiised with the word lattice, which is the 
set of records passed fi'om the segmenter to the parser. 
393 
\[nihon = ",Aq)an "7 is constrained to inhibit invalid 
variants like 124< which cause confusion with: {c 
I'OSl' + # NOUN \[tIi::I'.-tRTICI.I:" + /1on = "book "\]. 
We default to enabling all possible orthographies 
for each ennT and disable only those that are 
required. This saves US from having to update the 
lexicon whenever we encounter a novel 
orthographic variant since the lattice anticipates all 
possible variants. 
6 Unknown Words 
Unknown words pose a significant recall problem 
in s that don't place spaces between 
words. The inability to identify a word in the input 
stream of characters can cause neighboring words 
to be misidentified. 
We have divided this problem space into six 
categories: variants of lexical entries (e.g., 
okurigana variations, vowel extensions, et al.); 
non-lexiealized proper nouns; derived forms; 
foreign Ioanwords; mimetics; and typographical 
errors. This allows us to devise focused heuristics 
to attack each class of unfound words. 
The first category, variants of lexical entries, has 
been addressed through the script normalizations 
discussed earlier. 
Non-lexicalized proper nouns and derived words, 
which account for the vast majority of unfound 
words, are handled in the derivational assembly 
component. This is where compounds like -: ~ >i 
x ~':, ffuransugo = "French ()"\] are 
assembled from their base components ;1 5~ J x 
\[furansu : "France "\] and at~ \[go = " "J. 
Unknown foreign Ioanwords are identified by a 
simple maximal-katakana heuristic that returns the 
longest run of katakana characters. Despite its 
simplicity, this algorithm appears to work quite 
reliably when used in conjunction with the other 
mechanisms in our system. 
Mimetic words in Japanese tend to follow simple 
ABAB or ABCABC patterns in hiragana or 
katakana, so we look for these patterns and 
propose them as adverb records. 
The last category, typographical errors, remains 
mostly the subject for future work. Currently, we 
only address basic : (kanji) ~-~ -: (katakana) and 
i-, (hiragana) +~ :'- (katakana) substitutions. 
50% 
40% 
30% 
20% 
"10% 
0% 
15 25 35 45 55 65 75 85 95 105 115 
---~Japanese =-~t-=Chinese \] 
• ..... 72 .27_~_z z zs? 27 ~ 7 ~Lz77 ~z ~25z ~ 2 7~ ........ 
Figure 2: Worst-case segmenter precision (y-axis) versus 
sentence length (x-axis - in characters) 
7 Eva|uation 
Our goal is to improve the parser coverage by 
improving the recall in the segmenter. Evaluation 
of this component is appropriately conducted in the 
context of its impact on the entire system, 
Z 1 Parser Evaluation 
Running on top of our segmenter, our current 
parsing system reports ~71% coverage + (i.e, input 
strings for which a complete and acceptable 
sentential parse is obtained), and -,97% accuracy 
for POS labeled breaking accuracy° A full 
description of these results is given in \[Suzuki00\]. 
Z 2 Segmenter Evaluatkm 
Three criteria are relevant to segmenter per- 
formance: recall precision and speed. 
Z Z 1 Recall 
Analysis of a randonlly chosen set of tagged 
sentences gives a recall of 99.91%. This result is 
not surprising since maxindzing recall was a 
prinlary focus of our efforts. 
The breakdown of the recall errors is as follows: 
missing proper nouns = 47%, missing nouns = 
15%.. missing verbs/adjs = 15%, orthographic 
idiosyncrasies = 15%, archaic inflections = 8%. 
It is worth noting that for derived forms (those that 
Tested on a 15,000 sentence blind, balanced corpus. 
See \[SuzuldO0\] for details. 
394 
3000 \[i 
2000 I 
1 
<':> ,# ,~, ¢> e <# ~,~, e e @,,,+>,e,e 
Figure 3: Characters/second (y~axis) vs. sentence 
length (x-axis) for se~<ginenter alone (upper curve) 
and our NL system as a whole (lower curve) 
are tiandled in the derivational assembly corn-. 
ponent), tim segmenter is considered correct as 
long as it produces the necessary base records 
needed to build the derived fom-t. 
ZZ2 Precision 
Since we focused our effbrts on maximizing recall,, 
a valid concern is the impact of the extra records 
on the parser, that is, the effect of lower segmenter 
precision oll the system as a whole. 
Figure 2 shows the baselirie segrnenter precision 
plotted against sentence length using the 3888 
tagged sentences ~: For compaiison~ data for 
Chinese ~ is included. These are baseline vahles in 
the sense they represent the riumber of records 
looked up in the lexicon without application of ariy 
heuristics to suppress invalid records. Thus, these 
mnnbers represent worst--case segmenter precision. 
The baseline precisior, for the Japariese segmenter 
averages 24.8%, whicl-i means that a parser would 
need to discard 3 records for each record it used in 
the final parse. TMs value stays fairly constant as 
the sentence length increases. The baseline 
precision for Chir, ese averages 37.1%. The 
disparity between the Japanese and Chinese worst- 
case scenario is believed to reflect the greater 
ambiguity inherent in the Japanese v<'riting system, 
owing to orthographic w~riation and the use of a 
syllabic script. 
++ The " <,<," o .~ t<%~,% was obtained by usin,, the results of the 
parser on untagged sentences. 
39112 sentences tagged in a sirnilar fashion using our 
Chinese NI,P system. 
100% 
70% "-:-: 5 ~ ::.::~::,~::-5.. ,'i ,': ~ -"r.'-~,'~7,:'s~'-,.: ~: .~ ~ ::,~ x;K< ~ 
50% 
40% ~ ~ - . ---..-~ 
30% ::::::::::::::::::::::::::::::::::::::::::::::::::::::: 
20% :::i:!)?i:~:~)}ii!:i\]::{i)~:,x::i!illii.:i!:'-.~!!~\]:!21{7-i\[.g{:!:'7:7:~::?. ....................... ,~< ............ 
10% !::::::ii'::::ii!i'ii{f!}{ii'.".'::iii::::ii 
0% ~'~S 2 
15 25 35 45 55 65 75 85 95 105 115 125 135 
\[BSegmenter \[\]Lexical EIDeriv BOther El Parser 
Figure 4: Percentage of time spent in each component (y- 
axis) vs. sentence length x-axis) 
Using conservative pruning heuristics, we are able 
to bring the precision tip to 34.7% without 
affecting parser recall. Primarily, these heuristics 
work by suppressing the hiragana form of shork 
ambiguous words (like ~ \[ki="tree, air, .slJirit, 
season, record, yellow,... '7, which is normally 
written using kanji to identify the intended sense). 
Z2..3 Speed 
Another concern with lower precision values has to 
do with performance measured in terms of speed. 
Figure 3 summarizes characters-per.-second per- 
formance of the segmentation component and our 
NL system as a whole (irmluding the segmentation 
component). As expected, the system takes more 
time for longer senterlces. Crucially, however, the 
system slowdowri is shown to be roughly linear, 
Figure 4 shows how nluch time is spent in each 
component during sentence analysis. As the sen- 
tence length increases, lexical lookup+ derivational 
morphology and '+other" stay approximately con- 
starit while the percentage of time spent in the 
parsing component increases. 
Table 5 compares parse time performance for 
tagged and untagged sentences. This table 
qnantifies the potential speed improvement that the 
parser could realize if the segmenter precision was 
improved. Cohunn A provides baseline lexical 
lookup and parsing times based on untagged input. 
Note that segmenter time is not given this table 
because it would not be comparable to tile hypothetical 
segmenters devised for columns P, and C. 
395 
A 
Lexical processing 7.66 s 
Parsing 3.480 s 
Other 4. 95 s 
Total 25.336 s 
Overall 
Percent Lexical 
Improvement I'arsing 
Other 
B c 
2.5 0 s 2.324 s 
8.865 s 7. 79 s 
3.620 s 3.5 9 s 
4.995 s 3.022 s 
40.82% 48.60% 
67.24% 69.66% 
34.24% 46.74% 
3.7 % 6. % 
Table 5: Summary of performance (speed) 
experiment where untagged input (A) is compared 
with space-broken input (B) and space-broken input 
with POS tags (C). 
Columns B and C give timings based on a 
(hypothetical) segmenter that correctly identifies 
all word botmdaries (B) and one that identifies all 
word boundaries and POS (C) 1'I". C represents the 
best-case parser performance since it assumes 
perfect precision and recall in the segmenter. The 
bottom portion of Table ,5 restates these 
improvements as percentages. 
This table suggests that adding conservative 
pruning to enhance segmenter precision may 
improve overall system performance. It also 
provides a metric for evaluating the impact of 
heuristic rule candidates. The parse-time 
improvemeuts from a rule candidate can be 
weighed against the cost of implementing this 
additional code to determine the overall benefit to 
the entire system. 
8 Future 
Planued near-term enhancements include adding 
context-sensitive heuristic rules to the segmenter as 
appropriate. In addition to the speed gains 
quantified in Table 5, these heuristics can also be 
expected to improve parser coverage by reducing 
resource requiremeuts. 
Other areas for improvement are unfotmd word 
models, particularly typographical error detection, 
and addressing the issue of probabilities as they 
apply to orthographic variants. Additionally, we 
are experimenting with various lexicon formats to 
more efficiently support Japanese. 
tt For the hypothetical segmenters, our segmenter was 
modified to return only the records consistent with a 
tagged input set. 
9 Conclusion 
The complexities involved in segmenting Japanese 
text make it beneficial to treat this task 
independently from parsing. These separate tasks 
are each simplified, thcilitating the processing of a 
wider range of phenomenon specific to their 
respective domains. The gains in robustness 
greatly outweigh the impact on parser performance 
caused by the additional records. Our parsing 
results demonstrate that this compartmentalized 
approach works well, with overall parse times 
increasing linearly with sentence length. 

References 

\[Fuchi98\] Fuchi,T., Takagi,S., "Japanese 
Morphological Analyzer using Word Co-occurrence", 
ACL/COLING 98, pp409-4 3, 998. 

\[Hisamitsu90\] Hisamitsu,T., Nitta, Y., 
Morphological Analyis by Minimum Connective-Cost 
Method", SIGNLC 90-8, IEICE pp 7-24, 990 (in 
Japanese). 

\[Jensen93\] Jensen,K., Heidorn,G., Richardson,S, 
(eds.) "Natural Language Processing: The PLNLP 
Approach", Kluwer, Boston, 993. 

\[Kitani93\] Kitani,T., Mitamura,T., "A Japanese 
Preprocessor for Syntactic and Semantic Parsing", 9 th 
Conference on AI in Applications, pp86-92, 993. 

\[Kurohashi94\] Kurohashi,S., Nakamura,Y., 
Matsumoto,Y., Nagao,M., "hnprovements of Japanese 
Morphological Analyzer JUMAN", SNLR, pp22-28, 
994. 

\[Kurohashi98\] Kurohashi,S., Nagao,M., "Building a 
Japanese Parsed Corpus while hnproving the Parsing 
System", First LREC Proceedings, pp7 9-724, 998. 

\[Nagata94\] Nagata,M., "A Stochastic Japanese 
Morphological Analyzer Using a Forward-DP 
Backward-A* N-Best Search Algorithm", COL1NG, 
pp20-207, 994. 

\[Richardson98\] Richardson,S.D., Dolan,W.B., 
Vanderwende,L., "MindNet: Acquiring and Structuring 
Semantic Information from Text", COLING/ACL 98, 
pp 098- 02, 998. 

\[Suzuki00\] Suzuki,H., Brockett,C., Kacmarcik,G., 
"Using a broad-coverage parser for word-breaking in 
Japanese", COLING 2000. 

\[Wu98\] Wu,A., Zixin,J., "Word Segmentation in 
Sentence Analysis", Microsoft Technical Report MSR- 
TR-99- 0, 999. 
