A New Mel;hod of' N-gram S(;al:istics for 
Lmge Nulril.)er of' n 
all(\] Autorna, tJc \]:.xl;ra, ct on of Words ml(\] Phra,ses 
from Large icxl; Data o\[' ,Ja,pa,nes(; 
M,~d,:otc, Nagao, Shilisul(e Mori 
l)(;pa, rl;illolit 0\[" l!,\]o,('.trl(:al \]!AIII~II\]C(H'IIIg 
Kyo(,o University 
Abstract 
In the process of establish in g the it, form ation the- 
ory, C. F,. Shannon prol)ose.d the Markov I)ro(:ess as 
a good model to characterize ~t natural la.nguage. 
The core or this ide.a is t;o cah:ula.te the \['re(lU('Ii- 
des of strings compose(l of 'n characters ('n-grams), 
but this statistical analysis of large text. (lata a.,id 
for a large n lilts llever be(HI carried ()tit })eca./ise of 
the memory limitation of (:omputer and the short- 
age of text data. Taking advantage of the recent 
powerful computers we developed a. new aJgorithm 
of n-grams of large text data for arbitr~try hu'ge 'n 
a,nd (:alculated successl'ully, within ,'ela, tiv(.ly short 
thlle~ n-grams of some Japa,nese text (la, t~t con- 
taining between two an(l thirty million chara,(:ters. 
From this exl)eriment it 1)ecame (:loa,r t\]l&t the au- 
tomatic extraction or detern,i,tation of words, (:om- 
l)ound words and (;ol\]ocations is possible by mutu- 
ally comparing n-gram statistics for dill'etch t values 
of lt. 
category: topical pa,per~ quantitative linguisth:s, 
large text corpora, text t)rocesshlg 
1 Introduction 
Claude E. Shannon estal)lished tile in foH.atio, the- 
ory in 19d8 \[1\]. Iris theory included the co,lcept 
tlutt ~ hmgnage could be a,pproximated by an n- 
th order Markov model by n to l)e extended t(~ 
infinity. Since his proposal there were ma.ny tri- 
~tls to ea.h:ulate n-grams (statistics of 'n c},ara.(:ter 
strings of a language) lbr a big text data, of a, la.- 
guage, l\[owever computers u 1) to tim present to.hi 
not ca.h:ulate them for a large n 1)e(:ause the cah:u- 
|ation require(1 hug(; amount of memory space au(\[ 
time. For example the I'r(,quen('y ea, h;ulati(m of 10- 
grams of English requires a,t least 2(;win l0 s ~ I() (~ 
giga word memory space. Therefore tile ca,lcuh~tiou 
was done at most for n :: d ,.o 5 wil;h modest text 
qua, n tit;,, 
We developed a new method of calcula, ting w 
gra.ms For large 'n's. We (1o not In'el)are a table 
for an 'n. gra.m. Our methods consists of two stages. 
The first stage perh)rms I;he sorting of sul)strings of 
a text, aim fin(Is out tile lenlKth of t:he prefix parts 
which axe the same for th(; a, dja(:ent sul)stritGs iN 
the st)rted ta,ble. The second sta,ge is the (:a,lcuh~,tion 
of an 'n-gram when it is aal(ed for a sl)ecific n. Only 
the existing 'n, chara,cter combinations require the 
ta, hle entrie,g t'm' (,lie l'requen(:y count, so that we 
.eed not r(,serve a, big si)ace for 'n-gram table. The 
progranl we ha,ve develol)e(1 requires 71 bytes for an 
l cha, ra,cter text of two byte (:ode such as Japa, llcse 
and Chinese texts and 6! bytes for an l character 
text o1' English and other F',uropean la,nguages. By 
the present program '., ca, n be extended up to 255. 
The program can l)e (:hant~;ed very easily for la,rger 
'., if it is required. 
We l)erf.rme(l '.,-l,;ram Irequen(:y (:a,h'ulations for 
three (11 fl'(;ren t text data. We were not so m u(:h in- 
terested in tile. entropy wdue of a \]a.nguage \])ut were 
illter('ste(I in the extra.orion of varieti(,s ol langua~e 
i)rOl)(,rties, su(:\]l as wor(ls, (:olnl)oun(I words, (:o\[- 
\]oca.l.ions and so on. The cah:ula.tion of fre(luelu:y 
of o(:(:urren(:es of clul.ra('.t(,r sl, rings is t)articularly 
ilnlmrtallt to (leterlnilm what is ;~ wor(l in such 
la, nguage.s as 3al)alle.~e a,nd Chinese where there is 
no sl)a.(:es between words a.nd the determinath)n of 
word boundarh~.~ is not so easy. In this l)aper we 
will explain some of our results on 1,hose probh;nls. 
2 Calculation of 'n-grams for an 
arbit, rary large number of'n 
It was w!ry difficult to calculate 'n-grams for a large 
number o1" 'n because of the. memory limitation of 
a computer. For examph.', Ja, panese langua,ge ha.s 
m~t'e thall d000 di/l'ere.t characters a,nd if we want 
677 
to have 10-gram frequencies of a Japanese text, 
we must reserve 4000 l° entries, which exceed 10 aa. 
Therefore only 3 or 4-grams were calculated so far. 
A new method we developed can calculate n_ 
grams for an arbitrary large number of n with a 
reasonable memory size in a reasonable calcula.tion 
time. It consists of two stages. The first stage is to 
get a table of alphabetically sorted substrings of a 
text string and to get the value of coincidence num- 
ber of prefix characters of adjacently sorted strings. 
The second stage is to calculate the fl'equency of 'n- 
grams for M1 the existing ?z character strings from 
the sorted strings for a specific number of n. 
2.1 First stage 
(1) When a text is given it is stored in a computer a.s 
one long character string. It may include sentence 
boundaries, paragraph boundaries and so oil if they 
are regarded as components of text, When a text is 
composed of I characters it occupies 2I hyte mernory 
because a Japanese character is encoded by 16 bit 
code. We prepare another table of the same size (I), 
each entry of which keeps the pointer to a substri ug 
of the text string. This is illustrated in l"igure 1. 
text string ( /characters : 21bytes) 
V' '""'t"'°r~ ....... \[ 
l 
l 
i 
4bytes 
Figure 1: Text string and tile pointer table to sub- 
strings. 
A substring pointed by i-1 is defined as compose, d 
of the characters fi'om the/-tit position to the end of 
the text string (see Figure 1). We call this substring 
a word. The first word is the text string itself, a.nd 
the second word is the string which starts fi'om the 
second character and ends at the final ch~u'acter of 
the text string. Similarly the last word is the final 
character of the text string. 
As the text size is I characters a l)ointer imlst 
have at least p bits where 27' _>_ l. In our program 
we set p = 32 bits so that we can accept the text 
size tip to 2 a2 ~ d giga. characters. The. pointer 
table represents a set of l words. 
We apply the dictionary sorting operation to this 
set of/words. It is performed by utilizi ng the point- 
ers in the pointer t~d)le. We used comb sort\[2\] which 
is an improved version of bubble sort. The sorting 
thne is the order of O(llogl). When the sorting is 
completed the result is the change of pointer posi- 
tlons in the pointer table., and there, is no replace- 
ment of actual words. As we are iuterested in n- 
grams of 'n less than 255, actual sorting of woMs is 
performed for the lertmost 255 or less cha.ra.cters of 
words. 
(2) Next we compare two adjacent words in tile 
l)ointer t~dtle, and count the length of tile prefix 
parts which are the s~tme ill the two words. For ex- 
ample when "extension to the left side ..." and "ex- 
tension to the right side ..." are two words placed 
adjacent, the nutrlber is 17. This is stored in tile 
t:d)le of coincidence nulnber of prefix characters. 
'l?lils is shown hi l,'igure 2. As we ;ti'e interested ill 
1 < 'n < 255, one byte is given to an el/try Of this 
table,. The total lnemory space required to this first 
stag(,, operation is 214-4I-I-I = 7I bytes. For example 
when a text size is 10 mega Japa.nese the.ratters, 70 
mega hyte memory Intist be reserved. This is not 
difficult by the preseut-dag conipnl;ers. 
table of coincidence 
I'~tlrnbor e| cl~aracters 
pointer table 
1byte 4bytes 
text string ( /characters : 2lbytes) ~ZZ~ZE~Z22~I :1\[_ 
Figure 2: Sorted poh/ter ta.ble and t~ble of coinci- 
dence nuiillter of cha.r:-i.cte.rs 
We developed two software versions, one by using 
main memory alone, and tile other by using a (list" 
memory where the software has tile a,dditional op- 
eral;ions of disc merge sort. lilly the disc version we 
can ha.ndle a text of more than 100 meg~ character 
Japanese text. The. software was iml>lelnented on ~ 
612 
SUN SPARC Station. 
2.2 Second stage 
Tile second stage is the calculation of n-gra.m fre- 
quency table. This is done by using the pointer 
table and the table of coincidence number of prefix 
characters. Let us tix n to a certMn number. We 
first read out the tirst n characters of the first word 
in the pointer table, and see the number in the table 
of coincidence number of prefix char~tcters. If this 
is equal to or larger than n it means that the second 
word has at least the same n prefix characters with 
the first word. Then we see the next entry of the 
coincidence number of I)refix characters a,nd checl( 
whether it is equal to or larger than n or not. We 
continue this operation until we meet the condition 
that the number is smMler than n. The number of 
words checked up to this is the frequency of the n 
prefix characters of the first word. At this stage the 
tirst n prefix characters of the next word is d ifferen t, 
and so the same operation as the th'st n characters 
is performed from here, that is, to che.ck the num- 
ber in tile coincidence number of prefix characters 
to see whether it is equal to or larger than 7z or 
not, and so on. In this way we get the frequency 
of the second n prefix characters. We l,e,'form this 
process until the last entry of the table. These op- 
erations give the n-gram table of the glve.n text. We 
do not need any extra memory space in this opera- 
,ion when we print out every n-gram string and its 
fl'equency when they ;,re obtained. 
We calculated n-grams for some diflhrent 
Japanese texts which were available in electronic 
form in our 1Mmratory. These were the followings. 
1. Encyclopedic l)ictionary of Coml>uter Science 
(a.7 M bytes) 
2. JournMistic essays from Asahi Newsl)al)er (8 
M bytes) 
3. Miscellaneous texts availM)le in our laboratory 
(59 M bytes) 
The first two texts were not large and could \[)(, 
managed in the main memory. 'l'he third one was 
processed by using a disc memory l)y a.pi)lyi,lg a 
merge sort prognun three thnes. 'l'he llrst two 
texts were processed within one. ~md two hours hy 
a, standard SUN SPAR.C Station for the first stage 
mentioned above. The thh'd text required about 
twenty tbur hours. Calculation of n-gram frequency 
(the second stage) too\]( less than an hour including 
i)rint-out. 
Extraction of useful linguistic 
information fl'om n-gram fl'e- 
quency data 
3.1 Entropy 
Everybody is interested in the entropy wdue of ;~ 
language. Shannon's theory s~tys tlmt the. entropy 
is cah:ula.ted hy the formula \[3\] 
H,,(:.) = r(.,.) 
where l'(w) is the prol)ability of occurrence, of w, 
and the suInma.tion is tb," a.ll the different strings 
'w of ~z characters appea.ring hi ~L l~mguage_ The 
entropy of a langua.ge L is 
::(:,) = ,ira 
We cah'ulated .II,L(L) for the text:s mentioned in 
Section 2 for ~,. = 1,2,3,... The results is shown 
in Figure 3. Unlike our hfitia.lexpecta.tion tha.t the 
entropy will converge to a certain constant value be- 
1.ween 0.G and 1.3 \vllich C. E. Shtulnon esthrutte.d 
for English, it cotktimted to decrease to zero. We 
checked in detail whether our method had some- 
thing wrong, but there was nothing doubtful. Our 
conclusion for this strange phenomenon was that 
the text quantity of a few mega characters were 
too small to get a meanh,gful statistics for a. large 
'/Z be.cause \v(! h ave 11lo,'e than ,1000 different ch ar- 
a.cters in the .lal~;mese language, l,'or English and 
ma.ny other l"tlrope.a,n \]allgtla.g;es which hawe alpha: 
betic sets of less than fifty cha.racte.rs the situation 
may he better. Ilut still the text quantity of a few 
giga. byles or more will be necessa.ry to gel: a. n,ean- 
ingful el,tropy value for ~t = 10 or more. 
II ,~ 
• l : ', a \[ ..,O -~ 
~-.--I ............ ~ ............ i ........... ! ............ i ............ ! ............. t ............ 
A i ...... i i .... ! ........ ' 
2 i i ............ i i117111 iii;i!i ii 71111 
', b; .... ' "-*+"+ ..... }; a .... i ..... I : i 4 , *~*T ..... ,----* 
,0 is ~0 a~ 10 a~, 40 
\]"igure :l: EI~tropy curve by n-gram 
613 
3.2 Obtaining the longest compound 
word 
l?rom the n-gram frequency table we can get many 
interesting information. When we have a string w 
(length n) of high fl'equency as shown in Figure 4, 
we can try to find out the longest string w* which 
includes w by the following process by using the 
n-gram frequency table. 
2' ,~....~, W :.'-~'x, frequency 
........... ............................... __:""<. 
........ i ................................. I-4-.:-t-----! ........ : ...................... i "., 
 2ZiK2ZZ2Z ' ,, i X ~ a 
..................... ii 
...................... ~ ...... 
Figure 4: Obtaining the longest word w ~ from a 
high fi'equency word fragment w 
(1) extension to the left: We cut off the last char- 
acter of w and add a character a: to the left 
of w. We call this a cut-and-pasted word. 
We look for the character x which will give 
the maximum frequency to the cut-and-pasted 
word. Repeat the same operation step by step 
to the left and draw a frequency curve for these 
words. This operation will be stopped wheu 
the frequency curve drops to a certain wdue. 
This process is performed by seeing the 'u-grain 
#equeney table alone. 
(2) extension to the right: The same ol)eratiot~ a.s 
(1) is performed by cutting the left character 
and adding a character to the right. 
(3) extraction of high frequency part: From the 
fi'equency curve as shown in Figure ,1 we can 
easily extract a high fl'equency part a.s the 
longest string. An example is shown in Pig- 
ure 5 
The strings extracted in this way are very of- 
ten compound words of postpositions in Japanese. 
PostpositionM phrases are usually composed of one 
to three words, and are used as if they are com- 
pound postpositions. Some extracted exam ples are, 
partial strings freqtmncies 
~9- 5 c 101 
O- ~u ~ k 1689 
5 K ~ h~ :1310 
C ~ \]fi-'(" 784 
E \]fi'(" + 78,1 
:b~C+ ~ 70 770 
Figure 5: Frequencies of partial strings and 
ing the longest word " 9-5 C ~ ~)~-e~ 5" 
obtain- 
(must do ...) 
(it is known that ...) 
(can do ...) 
(can ask ...) 
3.3 Word extraction 
After getting high frequency chat'~cter strings by 
the. above method we can nmke. consultations with 
dictiona.ries for these strings. Then we find out 
many strings which are not included in tim dictio- 
naries. 
Some a.re phrases(colloca,tions, idiomatic expres- 
slons), some others are terminology words, and un- 
known (new) words. From the text data of Pmcyclo- 
pedic l)ictkmary of Computer Science we extracted 
many termhmlogica.1 words. In general the frequen- 
cies of n-grams become smaller as n becomes larger. 
But we had sometimes relatively high frequency 
wdues in n-grams of large n's. These were very oil 
ten terminological words or terminological phrases. 
We extracted such terminological phrases as, 
• (...) ~iiili-e~J~/J,~; k 7" ,, ,/~ a (p,'or;,a,,,s 
w,'iu.e,, I,y (...)l',,,,g,,,,¢e) 
(i)roble, m solving in a, rLificial intelligence) 
(page re place merit algorit hm) 
(partial correctness of programs) 
3.4 Compound word 
We can get more. interesthlg lnforma, tion when 
we compare data of different;n's. When wehave a 
character string (length 'n) of high frequency, which 
we may be al)le to de, fine as a word ('iv), we are 
recommended to check whether tWO substr\]ngs (W 1 
and w2) o1' tile length 'ul a,nd 'n2 ('hi -I-'n2 = 'n) as 
614 
Compound word 
,,~E ~'~;~''~''~',.at~,e,. (280) = 
fi!l~.~J! (166) = 
~II~I~¢ (\]ss) = 
'Pal)h: 1: l)etermination of compound word 
proper segmentation iml)roper segmentation 
~iE ~'~ 154 5 " 
f,~ (205s). ~21! (2(\].,)s) fi:;~t~L_ (tSS), ~t/~\]:~l\[ (:l (;S), @~.,~ (\] (;S) 
~J.~g(((2,12) . fin (1:~50) ~h'/l'l (\]Ss), ~i'/\[i,IN (lSS), ~J/lf,\] (188) 
( ):\[reque.ncy in li;ncyclol)edic Dictionary of Computer Schmce 
. . I . , 
Figure 6: Possible segmentation of ;~ word into two 
components 
sltown in l,'igure 6 h:we high frequency a l)pea.ra.nce 
in n.t-gram aild n~-gram tables. If we can find out 
such a situation by she.rising n~ (and 'n~) we cm~ 
conclude that the original character string 'w is a 
eOlnliollnd Wol'd of W 1 and 102. ~Olll(: eXhllll)les ill'e 
shown in TM)le 1. 
3.5 Collocation 
We can see whether a particular word w has strong 
colloeatioilal relatioils with some other words from 
the n-gram fi'equency results. We cnn get an 'n- 
gram tM)le where n is sufficiently la.,'ge, w is the 
prefix of tltese n-grams, and some words (w ~, 'w", 
,..) may appear in relatively high freqnency. This 
is shown in Figure 7, We can find out easily that 
"tO -- 'tO t aild ?o -- '~o It al'e t%vo allocational expres- 
sions fl'om this ligurc'. \]for example we have \[j;~ 
~_1 (effect) and llnd out that \[j;,~9,1!~"~U5 i (reo 
ceive effect) and Fj;~N~'-~j:2. 5 J (give eflbct) have 
relatively high frequencies and there are no othCw 
significant combinations in the n-gram tM)le with 
r J~gNI as the prefix, l-)kN~'(I (ill ;I.ll(l ()lit Ims- 
pital) b~we ahnost all the time I-~@ b~9-.I (re- 
peat) as the Mlowiilg phrase, and so we will be able 
to judge that \[~,J~\[¢~'~'~,)/) b ~)-_1 i~.an idioma.tic 
expression. 
4 Conclusions 
We developed a new method and software for '.,- 
gram frequency ealcula, tion for n up to 255, and 
cah:ulated n-grams for some hu'ge text da, ta of 
Jaq)~ilese, From these da.ta we could derive words, 
compound words and collocations automatically. 
W B' ' i 
-----+-~--+ i 
~---~ - ~-------%--i 
i ............................................ ! 
....... i~; .............. ii; ;~ .............. i 
Figure 7: l"indilql colloca.1fonal word pairs "w -'w' 
all(1 'to • - 'u/t 
We thinl< tha,t this method is equMly useful lbr hm- 
guages lil,:o Chiu('.se wh(,re there is no word spaces 
in a. sentence, a.nd for EUrOl)e:u~ langna£;es a.s well, 
al\](l ;I\]so foF sl)eec\[l p\]lOl\]<qlle s(;qttell(-(~s to ~et ll)ol'e 
deta.iled I I M M models. 
Another possil)ility is that when we get a large 
text data wil, h part-speech tags, we can extract high 
frequency pa.rt-of..sl)eeeh sequences by this n-gra, m 
calcula, tiou ,Jver th(~ pa.rt..ol:.speech data. These 
ma.y be regarded as grammar rules of the primary 
level, lly tel)hieing these pa,rt-ofspeech se,qllellces 
by sis~gle ~lou-terminal symbols we ca.n cah:ula.te 
new 'n.--grams, a.nd will be able to get hit';her lewd 
gra.m m;~r rules. These e×a.m ph's indical,e that lar.ge 
te×t data, with wLriel, ies of annotations ;tre very illl- 
imrtaut and valual)le h)t" the extra.orion of littguistic 
inforul:Lti(m I>y c,dcula.tit~g 'n-gralus for la.rger va.lue 
O\[' 1l.. 
References 
\[I) (,'. l'\]. ,~hallllOll" A mathematical theory of 
corn m unicatiou, Bell System Teci~.#)., Vol.27, 
i)1).379-423, pp.(~2:b(;56, (19,18). 
\[2\] SI;ephen Laeey, ll.lchard Box: Nikkei BYTE, 
November, pp.:105-312, (1991). 
\[3\] N. AI)ramson: tllfo)'m:Ltion theory a.nd co(\[- 
iug, McGraw 11111, (1963). 
615 
