Segmenting Sentences into Linky Strings 
Using D-bigram Statistics 
Shiho Nobesawa 
Junya Tsutsumi, Sun Da Jiang, Tomohisa Sano, Kengo Sato 
Masakazu Nakanishi 
Nakanishi Laboratory, Keio University 
3-\]4-1 Hiyoshi, Kohoku-ku 
Yokohama 223 Japan 
shiho@nak.math.keio.ac.jp 
Abstract 
It is obvious that segmentation takes 
an important role in natural language 
processing(NLP), especially for the lan- 
guages whose sentences are not eas- 
ily separated into morphemes. In this 
study we propose a method of segment- 
ing a sentence. The system described 
in this paper does not use any gram- 
matical information or knowledge in 
processing. Instead, it uses statistical 
information drawn from non-tagged cor- 
pus of the target language. Most of 
the segmenting systems are to pick out 
conventional morphemes which is de- 
fined for human use. However, we still 
do not know whether those conventional 
morphemes are good units for compu- 
tational processing. 
In this paper we explain our system's 
algorithm and its experimental results 
on Japanese, though this system is not 
designed for a particular language. 
1 Characteristics of Japanese 
Text 
1.1 Letters in Japanese 
Japanese text is composed of four kinds of charac- 
ters kanji, hiragana, katakana, and others such 
as alphabetic characters and numeral characters. 
Itiragana is used fbr Japanese words, inflections 
and flmction words, while k~takana is used for 
words from foreign languages and for other spe- 
cial purposes. 
Table 1 shows examples of rates of those four 
characters in texts (Teller and Batchelder, 1994). 
The bus__._=, corpus consists of a set of newspaper 
articles on business ventures from Yomiuri. The 
ed__:, corpus contains a series of editorial columns 
from Asahi Shinbun. 
Table 1: Character Rates in Japanese Text 
bus__~, ed._= 
size(K chars) 42 275 
% hiragana 30.2 58.0 
% kanji 47.5 34.6 
% katakazla 19.3 4.8 
% num/alph 2.9 2.6 
1.2 Morphemes in Japanese 
Segmenting a Japanese text is a difficult task. A 
phrase "~b -C~ ~ b f~ (was studying)" call be a 
single lexical unit or can be separated into as m~,ny 
as six elements (Teller and Batchelder, 1994): 
'study' 'do' particle progressive polite past 
Acquiring "morphemes" from Japanese text is 
not a simple task because of this flexibility. 
2 Linky Strings 
This paper is on dividing non-separated language 
sentences into meaningful strings of letters with- 
out using any grammar or linguistic knowledge. 
Instead, this system uses the statistical informa- 
tion between letters to select the best ways to seg- 
ment sentences in non-separated languages. 
It is not very hard to divide a sentence using a 
certain dictionary for that. The problem is that a 
'certain dictionary' is not easily obtainable. There 
never is a perfect dictionary which holds all the 
words that exist in the lmlguage. Moreover, build- 
ing a dictionary is very hard work, since there are 
no perfect automatic dictionary-making systems. 
586 
llowever, machine-readable dictionaries are 
needed anyway. ~br this reason, we propose a new 
method for picking out meaningflfl strings. Our 
purpose is not to segment a sentence into conven- 
tional morphemes. We introduce a concept for a 
type of language unit for machine use. We named 
the unit a 'linky string'. A linky string is a series 
of letters extracted from a corpus using statisti- 
cal intbrmation only. It is a series of letters which 
share a strong statistical relationship. 
3 LINKING SCORE 
3.1 Linking Score 
To pick out linky strings, we need to find highly 
connectable letters in a sentence. We introduce 
the linking score., which shows the linkability be- 
tween two neighbor letters in a sentence. This 
score is estimated using d-bigram statistics. 
3.2 D-bigram 
The idea of bigrams and trigrams is often used in 
studies on NI,P. Wgram is the information of the 
association between n certain events. In this study 
we use thed-bigram data (Tsutsumi et al., 1993), 
which is a kind of bigrmn data with the concept of 
distance between events (Figure l). l)-higram is 
equal to bigram when d = l, thus d~bigrmn data 
includes the conventional bigrmn relation. 
d.bigrsm 
7q \[E3 D 
L__I L___I \[__l L__J ° =' 
I; I I, j J 
L Jr \] ...~ .... 
Figure l: i)-bigram 
3.3 Calculation 
Mutual InfiJrmation with Distance 
Expression (1) iv for calculating mutual intbrma- 
tion between two events(Nobesawa et al., 1994): 
l'(ai, bj, d) bj, d) = log  v(.d/'(b,) (1) 
ai : a letter 
P(ai) : the possibility the letter ai appears 
l°(ai, bj, d) : the possibility ai and bj appear together 
with the distance d in a sentence 
The parameter d shows the distance between 
two events. In Figure 2, the distance between "a" 
m~d "pen" is 1, and the distm,ce between "is" and 
"pen" is 2 as well. Since the event order has a 
meaning, in this case the distance between "pen" 
and "a" is defined as -1. 
ThL~ is a pen . 
dM2 
d=3 
Figure 2: D-bigram Example 
As the vahm of MI gets bigger, the stronger is 
the association between the two events. 
Linking Score 
Expression (2) is tbr calculating the linking score 
between two letters in a sentence ~. 
Z (2) 
d:-:l j=i-(d-1) 
dmax : max distance used 
wl : the i-th letter in the sentence w 
g(d) : a certain weight for iV// 
concerning distance between letters 
The information between two remote words 
has less nmaning in a sentence when it comes to 
the semantic analysis(Church and Hanks, 1989). 
According to the idea we put g(d) in the expres- 
sion so that nearer pair can be more effective in 
calculating the score of the sentence. 
| 
h'i , , I I 
I--'1 
B C @@ F G H 
Figure 3: Calculation of Linking Score 
A pair of far-away letters do not have strong 
relation between each other, neither syntactically 
nor semantically. For this reason we use dma,, and 
in this paper we set tile dmax value 2 to 5 and 1. 
When the dma, is 1, the MI used in calculation is 
only bigram data. 
1We made a Japanese word ">(i~" tlar the word "linky", 
We used it's pronunciation "UK \[ju:kei\]" in the expre~ion. 
~We had experiments for tinding a good value for dmaa:, 
587 
U ff L__t L_ ..... ! L_.__J U U U 
l t 
Figure 5: The Score Graph 
4 THE SYSTEM L~S 
4.1 Overview 
This system is called LSS, a "linky string segmen- 
tor". This system takes a corpus made of non- 
separated sentences as its input and segments it 
into linky strings using d-bigram statistics. 
Figure 4 shows the flow of LSS's processing. 
Input sentences to segment. 
Calculate the linking score of 
each pair of neighboring letters. 
Check the score graph 
to see where to segment. 
pick out each linky string 
found in the given corpus. 
Figure 4: System Processing Flow 
In this paper we used a fixed score for the start- 
ing score, so that/S~ can decide whether the first 
letter should be a one-letter linky string. 
4.2 The Score Graph 
What a Score Graph Is 
To segment a sentence into statistically-meaningful 
strings, we use the linking scores to locate bound- 
aries between linking strings. A score graph has 
the letters in a sentence on the x-axis and link- 
ing scores on the y-axis (Figure 5). We get one 
score graph for each sentence. Figure 5 shows two 
sentences (one above and one below), each of 14 
letters (including an exclamation/question mark 
as the sentence terminator). 
When the linking score between a pair of neigh- 
boring letters is high, we assume they are part of 
the same word. When it is low, we assume that 
the letters, though neighbors, are statistically in- 
dependent of one another. In a score graph, a 
series of scores in the shape of mountain (ex.: A-B 
and C-F part in Figure 5) becomes a linky string, 
and a valley (ex.: between the letter B and C in 
Figure 5) is a spot to segment. 
Score-Graph Segmenting Algorithm 
The system LSS finds the valley-points in a sen- 
tence and segments the sentence there into strings. 
Following is the algorithm to find the segment- 
ing points in a sentence. 
1. Do not segment in a mountain. 
2. Segment at the valley point. 
3. Cut before and after a one-lettered linky string. 
One-Lettered Linky String 
A one-lettered linky string needs to (a) place at 
the valley point, and (b) look flat 3 in the score 
graph. In Figure 5, one-lettered linky strings are 
G,L,N 4'0'Y, Zand?. 
Mountain Threshold 
A linky string takes a mountain shape because of 
high linking scores. Note that a linky string is not 
equal to a morpheme in human-handmade gram- 
mars. When a certain pair of morphemes occurs 
in a corpus very often, the system recognizes the 
pair's high linking score and puts them together 
into one linky string. For example, "~" "~ "2/~::)k: 
~$~ (President Bush)" is often treated as a linky 
string, since ":)" "9 "J ~ (Bush)" and "gk:})~YI (pres- 
ident)" appear next to each other very frequently. 
The mountains of letters are not always simple 
hat-shaped; most of time they have other smaller 
mountains in them. This means that there can be 
shorter strings in one linky string. In one linky 
string "7" y "5' J-:}%i~ (President Bush)", there 
must be two smaller mountains, just like H-I and 
J-K in the mountain H-K in Figure 5. To control 
the size of linky strings we introduce a mountain 
threshold, which is shown in the sentence below 
in Figure 5. When the score of a valley point is 
higher than the mountain threshold, the system 
judges the point isnot a segmenting spot. In this 
paper the mountain threshold value is 5.0. 
3We use a constant value as a threshold. 
aN is a special one-lettered linky string which places at 
the beginning of a sentence. 
588 
Figure 6: 
Bigram 
r---1 
B C~D E F 
bl|rh'n ~ly 
I 
I i 
I '+ ' i I I--\] 
A S C~D E 
d-lolCz, am sum up 
The Difference between D-bigram and 
'Fable 3: Output 
d-bigram bigram 
dmaz = 5 
# of input sentences 302 302 
# of linky strings 6,145 7,098 
# of linky strings per 
sentence 20.35 23.50 
# of over-segmented spots 454 689 
over-segmented spots 7.39% 9.71% 
4.3 Corpus 
LcN accepts all the non-separated sentences with 
little preparation. All we need is a set of certain 
amount of the target-language corpus for training. 
In this paper we show the experimental results 
on Japanese. The corpus prepared for this paper 
is of Asahi Shinbun Newspaper. 
5 RESULTS 
5.1 Experimental Results 
Experiment Condition 
LK¢/takes a set of non-separated sentences as its 
input and segments them into linky strings. For 
the test corpus we chose sentences at random from 
the training corpus. 
into 20-25 linky strings on average 5. And in one 
sentence there are only one or two spots on average 
which break a morpheme into meaningless strings. 
With no linguistic knowledge, this can be said to 
be quite a good result. 
It is hard to check whether an extracted linky 
string is a right one, however, it is not that difficult 
to find over-segmented strings, for a linky string 
needs to hold the meaning. We check those over- 
segmented linky strings according to a dictionary, 
Iwanami Kokugo Jiten. 
Table 4 shows the numbers of over-segmented 
spots. The figure is the ~mmber of over-segmented 
spots, not the number of morphemes over-segmented 6. 
In Table 4 A and B are neighboring letters in a 
sentence which are forced to separate. The row 
"kanji hiragana" stands fdr over-segmented spots 
between a km~ji letter and a hiragmm letter. 
'Fable 2: Training Corpus Condition 
language: Japanese 
form: non-separated 
kanji-kana mixed sentences 
Asahi Shinbun Newspaper corpus: 
of sentences 
for training corpus: 7,502 
# of sentences 
for test corpus: 302 
To see the efficacy of d-bigram, we compare the 
experimental results of two data: d-bigram data 
and bigram data. 
Experimental Results 
As shown in Table 3, with d-bigrmn information 
only 7.39% of the segment spots are over-segmented. 
Table 3 shows that a sentence gets separated 
Table 4: Over-Segmented Morphemes by Charac- 
ter Types and Segmentation Methods 
d-bigram 
A B d .... = 5  anji 
kanji 
kanji hiragana 
hiragana kanji 
hiragana hiragana 
katakana katakana 
total 454 
bigram 
59 65 
29 43 
18 22 
333 507 
15 52 
689 
The ratio of over-segmented morphemes for 
each part of speech is shown in Table 5. 'K' stands 
for kanji, 'h' is for hiragana and 'k' is for katakana. 
There was no missegmentation between katakana 
and other character types. There also was not any 
5'Fhe range of numbers of linky strings found in a sen- 
tence is 5-60 with d-bigram and 6-66 with bigram. 
6Thus a morpheme gets counted twice when it is divided 
into three strings. 
589 
Table 5: Over-Segmented Morphemes in Output 
with D-bigram 
A 
B 
noun 
proper noun 
pronoun 
verb 
aux. verb 
adjective 
adj. verb 
adverb 
rentai-shi 
conjunction 
funcion word 
suffix 
compound word 
total 
K K h h k V---- ~ 
K h K h k 
49 19 6 49 6 129 
5 8 13 
16 16 
1 3 12 84 100 
60 60 
4 12 16 
4 13 17 
1 53 1 55 
11 11 
7 7 
15 15 
1 4 5 
1 15 16 
59 29 18 333 151454\] 
missegmentation concerning alphabets, numeral 
characters and other symbols. 
5.2 A Linky String 
Characteristics of Linky Strings 
Linky strings in Japanese are not equal to conven- 
tional morphemes in Japanese. As discussed in 
section 1.2, it is not easy to decide an absolutely 
correct segmenting spot in a Japanese sentence. 
That is one of the reasons that we decided to ex- 
tract linky strings, instead of conventional mor- 
phemes. Itowever, if those linky strings do not 
keep the meanings, it is useless. 
The result shows the linking score works well 
enough not to segment senteces too much (Table 
3). That is, we succeeded in extracting memfingfld 
strings using only statistical information. Figure 
7 shows some examples of extracted linky strings. 
.~1{~ bank (s) meaatingful 
~Z ~ move/shift to meaningfifl 
¢) ~T~)J action of meaningful 
~T ~ ~: did meaningful 
(?) ~=Y (?) over-segmented 
(~V) q~'~\]':~ over-segmented 
Figure 7: Examples of Linky Strings (1) 
Sometimes LS8 extracts strings that look too 
long (Figure 8). This is not a bad result, though. 
When a linky string contains several morphemes 
in it, it is something like picking out idioms. A 
linky string with several morphemes may be a 
compound word, or an idiom, or a fixed locution. 
1 7 t~a)$ 
help 
London Summit 
nuclear non-proliferation treaty 
at the end of 17th century 
Japan Railway Kyoro Station 
Figure 8: Examples of Linky Strings (2) 
The Concept of the Linky Strings 
Grammar-based NLP systems generally specify a 
target language. On the other hand statistically- 
based approachs do not need rules or knowledge. 
This makes a statistically-based approach suitable 
to nmltilingual processing. 
ISg is not only for Japanese. With a corpus of 
non-separated sentences of any language, LSS can 
perform the same kind of segmentation. 
To deal with natural languages most systems 
use conventional morphemes or words as their pro- 
cessing units. That is, most systems need to rec- 
ognize morphemes or words in sentences, and they 
need to make up a fairly-good morphological anal- 
ysis before the main processing. We have been 
working for processing natural languages in lin- 
guistic ways, though we do not know whether it is 
a right way in computational linguistics. A linky 
string is extracted only with statistical informa- 
tion, using no grammars nor linguistic knowledge. 
The system does not need to behave like a native 
speaker of the target language; all it has to do is 
check statistical information, which is what com- 
puters are good at. We expect that linky strings 
can be a key to solve problems of NLP. 
Compound Words 
The results show that the system has 7.39% in- 
correct segmentation. This result is based on a 
Japanese dictionary, and when a morpheme listed 
in the dictionary gets separated, we count it as 
over-segmented. However, a dictionary often holds 
compound words. That is, some number of tile 
segmented spots which we have counted as "over- 
segmented" ones are not really over-segmented. 
From this point of view, the percentage of over- 
segmentation is actually even lower. 
Inflections 
Verbs, adjectives, adverbs and auxiliary verbs are 
inflected in Japanese. In the experimental result, 
89.7% (with d-bigram data) of over-segmented spots 
between kanji and hiragana occurs in inflective 
morphemes. We decided correct segmenting spots 
590 
for inflective morphemes according to a Japanese 
dictionary. According to statistical information, 
segmenting method for inflective morphemes is dif- 
ferent fl'om grammatical one. So most of the over- 
segmented spots can be treated as correct seg- 
menting spots according to statistical information. 
5.3 D-bigram Statistics 
According to Table 3, it scents that using the bi- 
gram method the output is apt to be more seg- 
mented than with the d-bigram method. 
This happens t.)ecause bigrmn cannot pick out 
long strings. Bigrmn does not hoht information 
between remote (actually more than one letter 
away) letters. That makes long strings of letters 
easily segmented. When LcN checks a three-letter 
morpheme ABC, with bigram data it can see the 
string only as A-B and B-C. If those strings AB 
and BC do not .appear often, the linking scores 
get low and Lq.S decides to segment between A- 
B and B-C. IIowever, with d-bigram data ISS can 
get the information between A and C as well, that 
helps to recognize tlmt A, B and C often come out 
together. This happens frequently between two 
katakana letters (Table 4), because of the usage of 
katakana letters in Japanese. 
This does not mean that with d-bigram method 
sentences are less likely to be segmented. As shown 
in Figure 9, the distribution is not so different be- 
tween two methods. The x-axis shows the nmn- 
bers of linky strings in sentences and the y-axis 
shows the number of sentences with x linky strings. 
s~nt oncos 
20 
\] 8 ','~ - - -- d-biglam 
'~, .... bigram 
1 6 i 
14 7 ...... 
i _ 
12 
i o ~{~ A 
6 
4 
2 
i0 20 30 40 50 60 
st:I hlgs 
Figure 9: The Number of Strings in Outlmt Sen- 
tences 
According to Figure 9, the distributions of sen- 
tences are not so different between the method 
with d-bigram and the one with bigram. 
6 CONCLUSION 
This paper shows that this automatic segmenting 
system /NS is quite efficient for segmentation of 
non-separated language sentences. /AN does not 
use any grammatical information to divide input 
sentences into linky strings, that is, a new refit 
for NLP. According to the results of the experi- 
ments, Lcx¢~ can segment ahnost all the sentences 
'correctly', with strings keeping their meanings. 
This remarkable result of a statistic-based sys- 
tem l~ shows that d-bigram statistical informa- 
tion can be a key to meaningful-string extracting. 
This result also shows that the concept of linky 
strings is an interesting concept for NLP. We ex- 
pect that this linky string can he a unit for ma- 
chine translation systems or key word/phrase ex- 
traction systems, and other NLP systems. 
References 
\[1\] Tsutsumi, J., Nitta, T., Ono, K. and Nobesawa, 
S. A Multi-Lingual Translation System Based on 
A Statistical Model(written in Japanese). JSAI 
Technical report, SIG-PPAI-9302-2, pages 7 12, 
1993. 
\[2\] Nobesawa, S., Tsutsumi, J., Sun D. J., Sano, T., 
Sato K. and Nakanishi, M. Automatic Extraction 
of Linky Strings in Natural Languages (written 
in Japanese). 62nd Annual Meeting of the ANLP 
(NLP96), pages 181-184, 1996. 
\[3\] Nobesawa, S., Tsutsumi, J., Nitta, T., Ono, K., 
Sun, 1). J. and Nakanishi, M. Segmeting a 
Japanese Sentence into Morphemes Using Statisti- 
cal Information between Words. Coling-94, pages 
227-233, 1994. 
\[4\] Teller V. and Batchelder E. O. A Probabilisitic 
Algorithm for Segmenting Non-Kanji &q)anese 
Strings. AAAI, 1994. 
\[5\] Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., 
Jelinek, F., Mercer, R. and l~oossin, A. A Statisti- 
cal Approach to Language Translation. Coling-88, 
pages 71--76, 1988. 
\[6\] Church, K. and Hanks, l'. Word Association 
Norms, Mutual Information, and Lexicography. In 
Proceedings of the 27th Annual Conference of the 
association of Computational Linguistics, 1989. 
\[7\] Nishio, M., Iwabuchi, E. ~ld Mizutani, S. 
Japanese-Japanese Dictionary The 3rd Edition. 
Iwanami Shoten, 1985. 
591 
