Automatic Semantic Sequence Extraction 
from Unrestricted Non-Tagged Texts 
Shiho Nobesawa and Hiroaki Saito mad Masakazu Nakanishi 
Dept. of Computer Science 
Keio University 
3-14-1 Hiyoshi Kohoku, Yokohama 223-8522, Japan 
{shiho, hxs, czl}@nak.ies.keio.ac.jp 
Abstract 
Mophological processing, syntactic parsing and 
other useflfl tools have been proposed in the field 
of natural  processing(NLP). Many 
of those NLP tools take dictionary-based ap- 
proaches. Thus these tools are often not very 
efficient with texts written in casual wordings 
or texts which contain maw domain-specific 
terms, because of the lack of vocabulary. 
In this paper we propose a simple method 
to obtain domain-specific sequences from unre- 
stricted texts using statist;ical information only. 
This method is -independent. 
We had experiments oil sequence extraction 
on email l;exts in Japanese, and succeeded in 
extracting significant semantic sequences in the 
test corpus. We tried morphological parsing 
on the test corpus with ChaSen, a Japanese 
dictionary-based morphological parser, and ex- 
amined our system's efficiency in extraction of 
semantic sequences which were not recognized 
with ChaSen. Our system detected 69.06% of 
the unknown words correctly. 
1 Introduction 
I/eeognition of contained words is an impof 
tan| preproecssing for syntactic parsing. Word 
recognition is mostly done based on dictionary 
lookup, and unknown words often cause parse 
errors. Thus most of the researches have been 
done on fixed corpora with special dictionaries 
for the domain. 
Part-of-speech(POS) tags are often used for 
term recognition. This kind of preprocessing 
is often time-consmning and causes anfi)iguity. 
Wtmn it conies to the corpus with high rate of 
unknown words it is not easy to do a fair parsing 
with dictionaries and rules. 
Obtaining the contained terms and phrases 
correctly can be an efficient preprocessing. In 
this paper we propose a method to recognize 
domain-specific sequences with simple and non- 
eosty processing, which enables the use of unre- 
stricted corpora fc)r NLP tools. 
We concentrate on building a tool for extract- 
ing nmaningful sequences automatically with 
less preparation. Our systcnl only necds a fair 
size of non-tagged training corpus of tim tar- 
get . No restriction is required for the 
training corpus. We do not need any preprocess- 
ing for the training corpus. 
We had experiments on email messages in 
Japanese and our system could recognize 69.06% 
of the undcfined sequences of the test corpus. 
2 Japanese Characters and Terms 
Taking a word as a basic semantic unit simplifies 
the conflming tasks of processing real s. 
However single words are often not a good unit 
regarding the meaning of the context, because 
of the polysemy of the words(Fung, 1998). In- 
stead a phrase or a term can be taken as smallest 
semantic units. 
Most of the phrase/term extraction systems 
are based on recognizing noun phrases, or 
domain-specific terms, fi'om large corpora. Arg- 
anion et a1.(1998) proposed a memory-based ap- 
proach for noun phrase, which was to learn 
patterns with several sub-patterns. Anani- 
adou(1994) proposed a methodology based on 
term recognition using morphological rules. 
2.1 Term Extraction in Japanese 
Japanese has 11o separator between words. On 
noun phrase extraction many researches have 
been done in Japanese as well, both stochas- 
tic and gramlnatical ways. In stochastic ap- 
proaches ~z-gram is one of the most fascinat- 
ing model. Noun phrase extraction(Nagao and 
Mori, 1994), word segmentation(Oda and Kita~ 
579 
1999) and diction extraction are the major is- 
sues. There also are many researches on segmen- 
tation according to the entropy. Since Japanese 
has a great number of characters use of the infor- 
mation of letters is also a very interesting issue. 
2.2 Characters in Japanese 
Unlike English, Japanese has great mnount of 
characters for daily use. Japanese is special not 
only for its huge set of characters but its con- 
taining of three character types. Hiragana is a 
set of 71 phonetic characters, which are mostly 
used for flmction words, inflections and adverbs. 
Katakana is also a set of phonetic characters, 
each related to a hiragana character. The use 
is mainly restricted to the representation of for- 
eign words. It's also used to represent pronun- 
ciations. Kanji is a set of Chinese-origin char- 
acters. There are thousands of kanji characters, 
and each kanji holds its own meaning. They are 
used to represent content words. We also use 
alphabetical characters and Arabic numerals. 
3 Overview 
This system takes Japanese sentences as input. 
It processes sentences one by one, and we obtain 
segments of the sentences which are recognized 
as meaningful sequences as output. The flow of 
this system is as follows(Figure 1): Our system 
TRAINING EXTRACTION 
input (sentences) 
cooccurrence 
information 
--_4 
input (a sentence) 
linking score 
calculation 
sequence extraction 
output (sequences) 
Figure 1: The Flow of tim System 
takes one sentence as an input at one time, and 
calculates tile scores between two neighboring 
letters according to the statistical data driven 
from the training corpus. After scoring the sys- 
tem decides which sequences to extract. 
3.1 Automatic Sequence Extraction 
Nobesawa et a1.(1996; 1999) proposed a system 
which estimates the likelihood of a string of let- 
ters be a meaningfifl block in a sentence. This 
method does not need any knowledge of lexicon, 
and they showed that it was possible to segment 
sentences in meaningflfl way only with statisti- 
cal information between letters. Tile experimeN; 
was in Japanese, and they also showed that tile 
cooccurrence information between Japanese let- 
ters had enough information for estimating the 
connection of letters. 
We use this point in this paper and had ex- 
periments on extracting meaningfnl sequences in 
email message texts to make up the lack of vo- 
cabulary of dictionaries. 
3.2 Scoring 
Our system introduces the linking score, 
which indicates the likelihood that two let- 
ters are neighboring as a (part of) meaningful 
string(Nobesawa et al., 1996). 
Only with neighboring bigrams it is impossi- 
ble to distinguish the events 'XY' in 'AXYB' 
fi'om 'CXYD'. Thus we introduce d-bigram 
which is a bigram cooccurrence information con- 
cerning the distance(Tsutsumi et al., 1993). 
Expression (1) calculates the score between 
two neighboring letters; 
UK(i) = E E M~(wj,wi+d;d ) x,q(d) (1) 
d=l j-=i--(d--1) 
where wl as an eveN;, d as the distance between 
two eveN;s, dmax as the maximum distance used 
in the processing (we set drnax -~ 5), and g(d) as 
the weight fimction on distance (for this system 
g(d) = d-2(Sano et al., 1996), to decrease tile in- 
fluence of tile d-bigrams when the distance get 
longer (Church and Hanks, 1989)). When cal- 
culating the linking score between the letters wi 
and Wi+l, tile d-bigram information of the let- 
ter pairs around tim target two (such as (wi-l, 
wi+2; 3)) are added. 
Expression (2) calculates the mutual informa- 
tion between two events with d-bigram data; 
v; d) d) -- (2) 
where x, y as events, d for the distance between 
two events, and P(x) as the probability. 
3.3 Sequence Extraction 
Using the linking score calculated according to 
tile statistical information, our system searches 
for the sequences to extract (thus we call our 
system LSE(linky sequence extraction) system). 
580 
Figure 2 shows an example graph of the link- 
ing scores for a sentence. Each alphabet letter 
on the x-axis stands for a letter in a sentence. 
Figure 2: The Score Graph 
The linking scores between two neighbor- 
ing letters are dotted on the graph on the 
y-axis. Since the linking score gets higher 
when the pair has stronger connection, the 
mountain-shaped lines may get considered as 
unsegmentable blocks of letters. The linking 
scores of the pairs in longer words/phrases can 
be higher with the influence of the statistical in- 
formation of other letter pairs around them. On 
the other hand, the linking score between two 
letters which are accidentally neighboring gets 
lower, and it makes valley-shaped point on the 
score graph. Our system extracts the mountain- 
shaped parts of the sentence as the qinky se- 
quences', which is considered to be nleaningflfl 
according to the statistical information. In ex- 
ample Figure 2, strings AB, CDEF and HIJK 
might be extracted. 
The height of mountains are not fixed, accord- 
ing to the likelihood of the letters blocked as 
a string. Tiros we need a threshold to decide 
strings to extract according to the required size 
and the strength of connection. With higher 
threshold the strings gets shorter, since the 
higher linking score means that the neighboring 
letters can be connected only wlmn they have 
stronger commotion between them. 
3.4 I-Iow the System Uses the 
Statistical Information 
Figure 3 shows the example graph on a sentence 
"~i~"~"~2 \[o-gen-ki-de-su-ka-?\]" (: How are 
you?)(Sano, 1997). Each graph line indicates 
the linking score of the sentence after learning 
some thousamts of sentences of the target do~ 
main (for this graph we used a postcard corpus 
as Lhe target domain, and for the base domain 
we took a newspaper corpus). When the system 
have no information on the postcard domain, 
the system could indicate that only the pair of 
letters "~/z(, (gen-ki)" is connectable (there is a 
mountain-shaped line for this pair). Obtaining 
the information of postcard corpus, the linking 
scores of every pair in this sentence gel; bigger, 
to make higher mountain. And the shape of 
the mountain also changes to a flat one moun- 
tain which contains whole sentence from a steel) 
mountain with deep valleys. 
t0.00 
5.00 
o.0~ 
-5.03 
-t0.00 
-150~ 
-20.~ 
-3o.oo 
a N 
! 
! '>,, 
I 
'131RIiS HOLD 
0 
3011 
6014 
I_~£ql 
t~/(, 
Ah~ 
? 
Figure 3: Score Graph for "@@23~-~-~-)5~ ? 
(@@-o-gen-ki-de-su-ka-?: How are you?)" 
4 Experiments 
We had experiments on extracting semantic se- 
quences based only on letter cooccm'rencc infor- 
mation. 
We tried a dictionary-based Jap~mese mor- 
phological parser ChaSen vet. 1.51 (\] 9 9 7) oil the 
test corpus as well to check sequences whid~ a 
dictionary-based parser can not: recognize. 
4.1 Corpus 
We chose email messages as the corpora for ex- 
periments of our system. Email messages are 
mostly written in colloquialism, especially when 
they are written by younger people to send to 
their friends. In Japanese colloquialism has ca- 
sual wording which (lifters from literary style. 
Casual wording contains emphasizing and terms 
not in dictionary such as slangs. In English an 
emphasized word may he written in capital let;- 
ters, such as in "it SURE is not true", which 
is easily connected to the basic word "sure". 
We do the same kind of letter type changes in 
Japanese for emphasizing, however, since the re- 
lationship between letter types is not the same 
as English, it is not easy to connect the empha- 
sized terms and the basic terms. 
581 
4.1.1 Training Corpus 
The training corpus we used to extract statisti- 
cal information is a set of email messages sent 
between young female friends during 1998 to 
1999. This corpus does not cover the one used 
as the test corpus. All the messages were sent 
to one receiver, and the number of senders is 
17. The email corpus contains 351 email mes- 
sages, which has 7,865 sentences(176,380 letters, 
i.e. 22.4 letters per sentence on average). 
We did not include quotations of other emails 
in the training corpus to avoid double-counting 
of same sentences, though email messages often 
contain quotations. 
4.1.2 Test Corpus 
The test corpus is a set of email messages sent 
between young female friends during 1999. This 
corpus is not a part of the training corpus. All 
the messages were sent to one receiver, and the 
number of senders is 3. This corpus contains 
1,118 sentences(24,160 letters, i.e. 21.6 letters 
per sentence on average). 
4.2 Preliminary Results 
Figure 4 shows the distribution of the linking 
scores. The average of the scores is 0.34. The 
pairs of letters with higher linking scores are 
treated as highly 'linkable' pairs, that is, pairs 
with strong connection according to the statis- 
tical inforination of the domain (actually of the 
training corpus). 
.,a .~ .m .,a 
\] 
t 
J { 
I m 
Figure 4: Score Distril)ution 
Pairs of letters with high scores are mainly found 
in high-scored sequences (Tahle 1). 
Table 1 shows a part of the sequences ex- 
tracted with our system using letter cooccur- 
rence information. The threshold of extraction 
for Table 1 is 5.0. 
Table 1: Sequences Extracted Based on Letter 
Cooccurrence 
sequence memfing frequency 
(with scores over 5.0) 
... a ...... 72 
~: ~ so 52 
~:~ ~" ~ but 48 
/J~ ~ '~ ttmrefore 43 
/~ ~J~ mail 39 
(~) (laugh) 36 
~Y~'5/b I 29 
Jc~h~ it 26 
.... a ...... 25 
I~ 5~ myself 20 
25 ~, b a net/Internet 20 
!! a !! 16 
I) >/~ link 15 
2~ fl'iend 13 
casuM wording 
b representation change (written in katat~na) 
These sequences which extracted frequently 
are the ones often use in tile target domain. 
4.3 Undefined Words with ChaSen 
Since ChaSen is a dictionary-based system, it 
outputs unknown strings of letters as they are, 
with a tag 'undefined word'. 
Table 2 shows the number of sequences which 
ChaSen resulted as "undefined words". The row 
'undefined words' indicates the sequences which 
were labeled as 'undefined word' with ChaSen, 
and the row 'parsing errors' stands for the se- 
quences which were not undefined words with 
ChaSen but not segmented correctly 1 . The ex- 
traction threshold is 0.5. 
ChaSen had 627 undefined words as its out- 
put. Since the test corpus contains 1,118 sen- 
tences, 56.08% of the sentences had an unde- 
fined word on average. As it is impossible to di- 
vide an undefined sequence into two undefined 
words, when two or more undefined sequences 
are neighboring they m'e often connected into 
one undefined word s Ttms the real number of 
undefined sequences can be more than counted. 
Table 2 shows that our system on statistical in- 
formation can be a help to recover 69.06% of the 
undefined sequences detected by ChaSen. 
1Since our system is not to put POS tags, we do not 
count tagging errors with ChaScn (i.e., we do not contain 
tagging errors in the 'parsing errors'). 
2ChaSen cm, divide two neighboring undifined se- 
quences when the letter types of the sequences differs. 
582 
Table 2: Undefined Words with ChaSen 
undefined words w/LSE system 
frequency ~/:total suc. ~ part. b failed 
over 10 281 230 7 44 
3 - 9 143 100 13 30 
2 56 43 4 9 
1 147 60 44 43 
total 627 433 68 126 
69.06% 10.85% 20.10% 
a sue.: succeeded to extract 
b pm't.: pm'tiMly extracted 
Table 2 also shows that this system has bet- 
ter precision with tile sequences with larger fre- 
quency. For the sequences with frequency over 
10 times (in the test corpus), 81.85% of the se- 
quences have extracted correctly. Ignoring se- 
quences which appeared in the test corpus once, 
the rate of correct extraction rose up to 77.71%. 
Table 3 shows how our system worked with 
the sequences whirl1 are found as undefined 
words with ChaSen parsing system. The thresh- 
old for extraction is 0.5. Table 3 shows that tim 
Table 3: Categories of undefined Words 
mldefined words w 7 LSE system 
category #total sue." part) fidled 
proper nouns 60 39 17 4 
new words 70 48 12 l0 
letter additions 119 89 4 26 
changes ~ 276 194 28 54 
term. marks ~z 58 43 0 15 
smileys 15 9 6 0 
et:c. 29 12 1 16 
toted 627 433 68 126 
a sue.: succeeded to extract 
I, part.: partially extracted 
c changes: representation changes 
't tenn. marks: termination marks 
biggest reason for the undefined words are the 
woblem of the representation. As descril)ed in 
Section 4.3.2, we change the way of description 
wlmn we want to emphasize the sequence. The 
pronunciation extension with adding extra vow- 
els or extension marks is also for the same rea- 
son. Adding these two categories, 356 sequences 
out of 627 undefined words(56.78%) are caused 
in this emphasizing. 
Termination marks as undefined words con- 
lain sequences such as " ...... " and " ! ! " The 
termination marks not in dictionary often indi- 
cate the impression, sudl as surprise, hal)piness, 
considering and so on. 
New words including proper nouns are the ac- 
tual 'undefined words'. ChaSen had 130 of them 
as its output, that is 20.73% of the undefined 
words. 
4.3.1 Letter Types in Undefined Words 
Table 4 shows the types of letters included in 
the 'undefined words' with ChaSen. Tile figures 
indicate the numbers of letters. 
We had 627 undefined words in the test cor- 
pus with ChaSen (Table 2), which contain 1,493 
letters totally. The average length of the unde- 
fined words is thus 2.38. 70.40% of the letters in 
%~ble 4: Letter Types of Undefined Words 
undefined words w/LSE system 
type variety #total sllc. a pal't, b failed 
l~mji I. 19 19 0 0 
hiragmm 12 200 155 7 38 
katal~ma 73 1051 712 188 151 
nmneral 1 1 0 1 0 
alphabet 23 122 43 72 7 
symbol 22 100 39 37 24 
total 1493 968 305 220 
a sue.: succeeded to ex~rac~ 
l, part.: pm-t, ially extracted 
undefined words were katakana letters(Table 4), 
which are phonetic and often used for describing 
new words. Katakana letters are also often used 
for emphasizing sequences. 
OI1 the other hand, there was only one letter 
each for kanji and numeral figure. That is be- 
cause each kanji letter and numeral figure has its 
own meaning, and those letters are mostly found 
in the dictionary, even though tlle tags are not 
semantically correct. Or, as for kan.\]i letters, it 
sometimes can be tagged with incorrect segmen- 
tation 3. Thus undefined words in kanji letters 
are not counted as 'undefined words' mostly, and 
instead they cause segmentation failure(Section 
4.4). 
4.3.2 Representation Changes 
Since Japanese have two phonetic character sets, 
we have several ways to represent one term; in 
kanii (if thm'e is any for the tin-m), in hiragana, 
in katakana, or sevm'al dlaracter type mixed. It 
is also possible to use Romanization to represent 
a tern1. 
a ,,~_ a) \[ko-no\] (:this)/t!k'~ \[se-l~d\] (:the world)" is incor- 
rectly segmented as ",~ 0~91~: \[ko-no-yo\](:the present life)/ 
\[kai\](:world)"; "kono yo" is a fixed phrase, and "lmi" 
is a suffix for a nmm to put the meaning of "the worht 
of", e.g. "~7::gl ~ (:the literary world)" 
583 
Table 5 shows the numbers of ChaSen errors 
according to the representation change. Most of 
Table 5: Undefined Words because of Represen- 
tation Changes 
undefined words w/LSE system 
subeategory ~total sue." part. b failed 
term chmlges 40 33 3 4 
lmtal~na 137 102 12 23 
chmlge & katalmala 55 34 10 11 
etc. 44 25 3 16 
total 276 194 28 54 
. sue.: succeeded to extract 
b part.: partiMly extracted 
the dictionaries have only one basic representa- 
tion for one term as its entry 4. However, in ca- 
sual writing we sometimes do not use the basic 
representation, to emphasize the term, or just 
to simplify tile writing. 
4.3.3 Pronunciation Extension 
In JapalmSe  we have many kinds of 
function words to put at the end of sentences 
(or sometimes 'bunsetsu' blocks). The function 
words for sentence ends are to change the sound 
of the sentences, to represent friendliness, order- 
ing, and other emotions. These function words 
are not basically used in written texts, but in 
colloquial sentences. 
In Japanese  we put extra letters to 
represent the lengthening of a phone,. Since 
almost all Japanese phones have vowels, to 
lengthen a phone for emphasizing we put extra 
vowels or extension marks after tim letter. Table 
TaMe 6: Extra Letters output as Undefined 
Words 
letter ~b ~, -) ~ $~ ~ 'y ~" total 
a i u e o t t n 
sue." 39 2 5 32 7 3 1 0 89 
part.b 0 0 0 0 0 4 0 0 4 
failed 5 1 4 2 1 7 5 1 26 
total 44 3 9 34 8 14 6 1 119 
suc.: succeeded to extract 
b part.: partially extracted. 
6 shows that 74.79% of the small letters which 
resulted as undefined words with ChaSen could 
be salvaged as parts of semantic sequences with 
our system. 
4Dictionaries may have phonetic representations for 
the entries, not as headings. 
These small letters in this table are extra let- 
ters to change the pronunciation; i.e. they ~re 
mostly not included in the dictionary. However 
they are actually a part of the word, since they 
could not be separated from the previous se- 
quences. 
4.4 Segmentation Failure with ChaSen 
Table 7 shows (;he result of the extraction of 
sequences which ChaSen made parsing errors. 
It indicates that our system could recognize 
70.88% of the sequences which ChaSen made 
parsing errors. 
Table 7: Segmentation Failure with ChaSen 
undefined words w/LSE system 
category 7~total SilO. a part. b failed 
A 42 41 1 0 
B 60 35 10 15 
C 92 81 5 6 
D 11 2 5 4 
E 8 4 3 1 
F 176 106 37 33 
G 19 10 5 4 
H 257 154 73 30 
I 253 233 6 14 
J 115 82 19 14 
torn 941 667 159 115 
70.88% 16.90% 12.22% 
A: sequences incl. alphabetical characters 
B: sequences incl. numeral figures 
C: proper nouns 
D: new words excl. proper nouns 
E: fixed locutions 
F: sequences with representation changes 
G: sequences in other character types 
II: emph,xsized expressions 
I: termination marks 
J: parsing errors 
a sue.: succeeded to extract 
part.: partially extracted 
Category F is for the sequences which changed 
their representations according to tile terms' 
pronunciation changes for casual use. For ex- 
ample, "~° \[ya-p-pa\]" is a casual form of "~ 
lTk 9 \[ya-ha-ri\](: as I thought)". In casual talk- 
ing using original terln "yahari" sounds a little 
too polite. Sonm common casual forms are in 
dictionaries, but not all. 
For the category B, our system failed to ex- 
tract 25 sequences. All the sequences in B are 
with counting suffixes. 12 sequences out of the 
584 
25 couhl not l)e connected wil;h the counting suf- 
fixes, e.g. "3 0 H \[3-0-nichi\](: 30 days, or, the 
30th day)" got over-segmented l)etween zero and 
the suffix. We have a big w~riety of counting suf- 
fixes in Japanese and since our system is only 
on letter cooccurrence information we couM not 
avoid tlm over-segmentation. 
Category C indicates the sequences wlfich are 
written in other character types for emphasizing. 
The major changes are: (1) to write in hiragana 
characters instead of kan.ji characters, and (2) to 
write in katakana characters to emphasize the 
term. 
5 Conclusion 
Dictionary-based NLP tools often have worse 
precision with ~exts written in casual wordings 
and texts which contain many domain-specific 
terms. 'lbrm recognition system available fi)r 
any corpora as a preprocessing enables the use 
of NLP tools on many kinds of texts. 
In this paper we proposed a simple mefllod fi)r 
term recognition based on statistical informa- 
tion. We had experiments on extracting seman- 
tically meaningflfl sequences according to the 
statistical information drawn fi:om the training 
corpus~ and our system recognized 69.06% of the 
sequences whidl were tagged as undefined words 
witll a conventional nmrphologieal parser. 
Our sysi;em was efl3cient in recognizing differ- 
ent representations of terms, proper nouns, and 
other casual wording phrases. This helps to sal- 
vage semantically meaningful sequences not in 
dictionaries and this can be an efficient prepro- 
cessing. 
6 Future Work 
In this paper we proposed a simt)le term recog- 
nition method based only on statistical informa- 
tion. There may be several ways to combine the 
extracted sequences with the dictionaries. We 
may need to put POS tags to the sequences for 
the use with other NLP tools. We ext)ect that 
we can use tagging tools for this. 
This system we propsed is - 
independent. For example, we Call use this 
system on English to extract English sequences 
which appeared frequently in the training 
corpus, such as proper nouns. 

References 

Sophia Ammiadou. 1994. A Methodology for Auto- 
tactic Term ll,ecoglfit;ion. Colin9-9~, pages 1034- 
1038. 

Shlomo Argmnon, Ido DagmL and Yuwtl Kry- 
molowski. 1998. A M(~mory-Based Approach 
to Learning Shallow Natural Lang~lage, Patterns. 
Col'ing-ACL'98, pages 67-73, August. 

Kenneth W. Church and Patrick IIanks. 1989. Word 
Association Norms, Mutual hfformation, and Lex~ 
icography. The 27th Annual Conference of th, e As- 
sociation of Computational Lin quistics. 

PascMe Fung. 1998. Extracting Key Terms fl'om 
Chinese and .lapmmse texts. Th, e International 
Journal on Computer P~vccssin9 of Oriental Lan- 
9~zagc, Special Issue on Information Retrieval on 
Oriental Languages. 

Yuji Matsumoto, Akira Kitauchi, Tatsuo Ya- 
mashit:a, Yoshital~t tlirano, Osmnu hnaMfi, 
and Tomo~fld hnanmra. 1997. Jat)mlese 
Morpholotical Analysis System ChaSen 1.51 
Manual. '2bchnical report, Nara Institute of 
S(:ience mid Technology. http://cactus.aist- 
nara.a(:.jp/lab/nlt/chasen.litnfl. 

Makoto Nagao mid Shinsuke Mori. 1994. A New 
Method of N-gram Statistics for Large, Number of 
n m~d Automatic Extraction of Words and Phrases 
fi'om Large Text Data of ,lapmlese. Colin9-95, 
pages 611-615, August. 

Shiho Nobesawa, Junya Tsutsumi, Da Jiang Sun, ~\[~)- 
mohisa State, Kengo Sate, mM Masalmzu Nakan- 
ishi. 1996. Segmenting Sentenc(',s into Linky 
Strings Using D-bigram St:atistics. Uolin9-96 , 
t)ages 586 591, August;. 

Shiho Nobesawa, Itiroaki S~fito, and Mas~fl(azu 
Nal(anishi. 1999. String Extraction Based Only 
on SLatisl;ic Lint~tbilil.y. IUCI)OL'99, pages 23- 
28, March. 

Iliroki Oda and Kenji Kita. 1999. A Character- 
Based Japanese Word Segmenter Using a PPM*- 
B~use(t Language Model. ICCPOL'99, pages 527- 
532, Mm'ch. 

Tomohisa Sane, Junya Tsutsmni, Da Jimlg Sun, 
Shiho Nobesawa, Kc, ngo Sate, Kumiko Omori, 
m~d Masal~/zu Nal~mishi. 1996. An Experiment 
on Good Usages of D-bigram Statistics in Natu- 
ral Lmtguage Ev~fluation. End Annual Meeting of 
th, c ANLP (NLP96), pages 185-188. Written ill 
Japmw, se. 

r\]Smohisa Sane. 1997. NaturM Language Processing 
Using Dynmnie StatisticM Information. Master's 
thesis, Keio University. \¥ritten in Jal)anese. 

Junya Tsutsumi, Tomoaki Nitta, Notate One, m~d 
Shiho Nobesawa. 1.993. A Multi-Lingual Transla- 
tion System Based on A Statistical Model. JSAI 
~chnical report, S}G-PPAI-9302-g pages 7 12. 
Written in Japanese. 
