A Statistical Method for Extracting Uninterrupted and 
Interrupted Collocations from Very Large Corpora 
Satoru Ikehara, Satoshi Shirai and Hajime Uchino 
NTT Communication Science Laboratories 
Take 1-2356, Yokoshuka-shi, Japan 
(E-mail:{ikehara, shirai, uchino}@nttkb.ntt.jp) 
Abstract 
In order to extract rigid expressions with a high fre- 
quency of use, new algorithm that can efficiently extract 
both uninterrupted and interrupted collocations from very 
large corpora has been proposed. 
The statistical method recently proposed for calculating 
N-gram of m'bitrary N can be applied to the extraction of 
uninterrupted collocations. But this method posed pro- 
blems that so large volumes of fractional and unnecessary 
expressions are extracted that it was impossible to extract 
interrupted collocations combining the results. To solve 
this problem, this paper proposed a new algorithm that 
restrains extraction of unnecessary substrings. This is 
followed by the proposal of a method that enable to extract 
interrupted collocations. 
The new methods are applied to Japanese newspaper 
articles involving 8.92 million characters. In the case of 
uninterrupted collocations with string length of 2 or mere 
characters and frequency of appearance 2 or more times, 
there were 4.4 millions types of expressions (total fre- 
quency of 31.2 millions times) extracted by the N-gram 
method. In contrast, the new method has reduced this to 
0.97 million types (total frequency of 2.6 million times) 
revealing a substantial reduction in fractional and un- 
necessary expressions. In the case of interrupted collo- 
cational substring extraction, combining the substring with 
frequency of 10 times or more extracted by the first 
method, 6.5 thousand types of pairs of substrings with the 
total frequency of 21.8 thousands were extracted. 
I. Introduction 
In natural language processing, the importance of large 
volume corpus has been pointed out together with the need 
for technology of analyzing these linguistic data. For 
example, in machine translation, there are many expres- 
sions that are difficult to be translated literally. Phrase 
translations or pattern translations based on phrase or 
pattern dictionaries are considered very useful for the 
translations of these expressions. 
In order to realize these translation, it is required to 
identify phrases of high frequency and patterns of expres- 
sions from the corpora. There are many method proposed 
to extract rigid expressions from corpora such as a method 
of focusing on the binding strength of two words (Church 
and Hanks 1990); the distance between words (Smadja and 
Makeown 1990); and the number of combined words and 
frequency of appearance (Kita 1993, 1994). But it was not 
easy to identify and extract expressions of arbitrary lengths 
and high frequency of appearance from very large corpora. 
Thus, conventional methods had to introduce some kinds 
of restrictions such as the limitation of the kind of chains or 
the length of chains to be extracted (Smadja 1993, Shinnou 
and Isahara 1995). 
Recently, a new method which can calculate arbitrary 
number of n-gram statistics for very large corpora has been 
proposed (Nagao and Mori 1994). This method has made it 
possible to automatically and quickly extract and tabulate 
substrings of any length used in source texts. Unfortu- 
nately, in this method, so many fractional substrings that 
were grammatically and semantically inconsistent were 
being extracted that it was difficult to extract combi nations 
of expressions collocated at separate locations (i.e. inter- 
rupted collocation) which requires a search of the source 
text by combining the strings thus extracted. Thus, the 
analyses had to be limited into small texts (Colier 1994). 
To overcome this problems, this paper first, proposes a 
method that can automatically extract and tabulate un- 
interrupted collocational substrings and without omission 
from the corpora in the order of substring length and fre- 
quency under the condition that fractional substrings are 
excluded. Second, using the results of the first method, it 
also proposes a method that can automatically extract and 
tabulate interrupted coUocational substrings. 
2. N-gram Method and the Problem Involved 
(1) Conditions for Collocational Substring extradtion 
In order to extract uninterrupted collocation without 
omission and to minimize extraction of fractional sub- 
strings, we will introduce the following three conditions. 
1st Condition: Substrings can be extracted in the order of 
the number of matching character (string length). 
2nd Condition: Substrings can be extracted in the order of 
frequency of use. 
3rd Condition: Substrings should be extracted according to 
the principle of the longest match. 
Fig. 1 Substrings to be Extracted 
Here, 3rd condition means that when a string (for instance 
a in Fig.l) is extracted from a certain location within the 
source text, any substring ( B, T ) that is included within 
the string ( a ) is not subject to extraction. But should 
such substring ( 6 ) be located in a separate or overlap 
574 
position, it is to be extracted. 
(2) Conventional Algorithm for N-gram Statistics 
Before discussing the algorithm which satisfies the 
previous conditions for uninterrupted collocational sub- 
string, let's consider the Nagao and Mori's algorithm 
propose for N-gram statistics. 
\[Statistical Method for N-gram\] 
Assume that the total number of characters in a 
source text (corpus) is N. 
Procedure 1: Preparation of Pointer Table 
Prepare PT-0 (Pointer Table-O) of N records of SP 
(Source Pointer), with the values of 0, 1, 2,... i,...,N-1. 
Here, the value i represents the String-word i which is the 
substring from position i to the last character (N-1 
address) in the source text. 
Procedure 2: Pointer Table Sorting 
The records of .PT-0 are sorted in the order of corres- 
ponding String-words to obtain SPT-O (Sorted Pointer 
Table-0). 
Procedure 3: Counting of Matching Characters 
The characters of String-word i is compared with that of 
the next String-word i+1 from the beginning. The number 
of matched characters are registered in the field of a NMC 
(Number of Matching Character) in the record i. 
Procedure E: Extraction of Substrings 
Comparing the values of NMCs of record i and that of the 
record i+1 of the SPT from i=1 to i=N-1, substrings are 
extracted and their frequency are determined* 1. 
(3) Problems of N-gram Statistics 
Nagao and Mori's method obviously fulfills require- 
ments of Conditions 1 and 2, but not Condition 3. It is 
expected that the accurate frequency of any substring a is 
obtained subtracting the frequency by the frequency of the 
other substring ~ which is included in substring o~ *2. 
Unfortunately, this does not satisfy Condition 3. At the 
time when extracted substring list has been compiled, 
information regarding mutual inter-relationship between 
the extracted substrings within the original text has been 
lost rendering calculations impossible. 
3. Extraction of Uninterrupted Collocation 
3.1 Invaliditafion of Extracted Substfings 
(1) Co-relations between Extracted Substrings 
In order to satisfy the requirement of Condition 3, con- 
sider the extraction of n-gram substring after extracting m 
-gram substring. The problem arises when there is a 
certain overlap between them as shown in Fig.1. 
The Case of Absorbed Relation (Case 1) can be classi- 
fied into three sub-cases as shown, but regardless of which 
situation, the m-gram substfing is absorbed in the sub- 
string of n-gram and therefore there is no need to extract 
such a m-gram substring. Thus, when extracting n-gram 
strings, there is a need to invalidate the related record of 
the SPT so that m-gram strings do not become involved in 
processes to follow. 
Fn g,'am \] 
.,  ram ..... ! 
................... , 
Coincided Beginning 
< case 1 - 1 > 
II1 gram 
L'2.L~)2.KZ:I \] I 
ll grain \] K~!Lg.!!!CZ\] 
Holy Included Coincided Ending 
< easel- 2> <casel- 3 > 
<Casel> Absorbed relation 
t~-n gram \] 
............................ l 
:--I11.~ r alll ............ : \[-rl gram \] 
preceded by m - gram preceded by tl - gram 
<case2-1 > < case2- 2 > 
<Case2> Overlapped relation 
Fig. 2 Relationships between Extracted Substrings 
The Case of Partially Joint Relation (Case 2) can be 
further classified into two sub-cases. But in either 
situation, the m-gram string and n-gram string merely 
overlapped and therefore they are need to be extracted 
separately. 
(2) Necessity of Validity Check for String-words 
When one substring is extracted, in order not to extract 
the absorbed string from the same part of sotlrce text 
where the substring was already extracted (Case 1), 
related records of SPT need to be checked if the record is 
valid or not before extracting the next substring. 
For example, the substring of 6 characters in the String 
-word 3 shown in Fig. 3 was extracted, the substring of 
String-words 3,4,5,...,8 need to be set as invalid for the 
length equal or less than 6,5,4,.-.,1 characters from the 
beginning. 
Source Address: 1 2 3 4 5 6 7 8 9 10 11 • • • r- 6 gram 
Source Text: A B /C D E F G I\[ I J K. • • 
Addres Invalid Range i /-~String-word 
4 < 5 ch I D E F G H I K 5 4oh \[E F C ~-{.,i I 
K 
6 < 3 ch IF G H i \] K 
7 < 2 ch G HI l K 
8 ~ 1 ch \[HI I K 
Fig.3 Example of Validity check 
1 
3.2 Extracting Algorithm 
Here, we propose an algorithm which satisfy Condition 
3 as well as Conditions 1 and 2. 
< Preparation > 
Fields of NSC (Number of Significant Characters) and 
RN (Record Number) are added to SIT-0 (Sorted Pointer 
Table) used for N-gram statistics. 
<Algorithm (See Fig.4)> 
Procedure 1 thr_ough 3: Same as the N-gram statistics. 
Procedure 4: Significant Character Determination 
The length of substrings to be extracted are decided 
from NMC and written in the NSC field of SPT- 0. 
Procedure 5: Preparation of Augmented PT 
After sorting the SPT-0 in the original order, add a 
VP (Validity Flag) field to obtain an PT- 1. 
* 1 Extraction is conducted based on the relation between the values of consecutive NMC. Ddetails are in (Nagao and Mori 1994). 
* 2 Recently, combining the frequencies of related substring, calculation was conducted(Kita, etal 1993) to obtain the frequency 
which satisfy the Condition 3. But accurate results cannot be obtained by this method. 
575 
Procedure 6: Validity Determination 
According to the method shown in 3.1(2), check the 
validity of the suhstring pointed by the records of the 
PT-1 in the order of the record number and write the 
results in the VF field. 
Procedure 7: Resorting of PT-1 
Re-sort the PT- 1 in the order of the values of SP fields 
to obtain a SPT- 1. 
Procedure 8: Extraction and Tabulation 
By referring to the SPT-1, the strings to be extracted 
are determined and their frequencies are calculated. 
An example of the algorithm is shown in Fig.4. In this 
example, the types of substrings extracted by the conven- 
tional algorithm amounted to 24 with the total frequency of 
72. In contrast, in the method proposed in this paper, 
these numbers have reduced to 5 and 10 respectively. 
4. Extraction of Interrupted Collocation 
4.1 Conditions for Extraction 
Here, let's consider combinations of 2 or more un- 
interrupted collocational substrings in different locations 
within a single sentence together with a method of deter- 
mining the frequency of them. In this case, boundary 
conditions of sentences and mutual relationship between 
the extracted substrings need to be considered. 
(1) Boundary Conditions of Sentences 
When considering the collocation of substrings within a 
sentence, combinations of expressions spread over borders 
of sentences need to be excluded. But when a single 
sentence includes other sentences, the extraction of the 
combinations in units of sentences poses complications. 
To simplify matters, we first assume that the sub- 
strings which have any kinds of punctuation mark as a 
part of them are not extracted in the procedure of uninter- 
rupted collocation extraction. This can be easily performed 
by restraining the comparison procedure after finding a 
punctuation mark in Procedure 3. Second, we assume 
that when a left quote character is found within a sentence, 
all characters are ignored until the right quote character 
forming a pair with the former character. 
(2) Relationships between Extracted Substrings 
In extraction of interrupted collocations, substrings that 
are linked to or partially overlap one another are excluded 
from the scope of extraction. Let's consider substrings a 
and ~0 which have been extracted from the same sentence. 
The positioning would be one of the three cases shown in 
Fig.3. Case (c) in which substring a and ~0 are separate 
from one another is a case of extracting interrupted 
collocations, and Cases (a) and Co) are not*3. 
(3) Order of Substring Appearance 
In the case of extracting interrupted collocations, the 
order of appearance of substrings should be considered. 
Hence, collocational substrings are extracted and counted 
taking notice of the order of the appearance of each 
substring. 
Beginning \[- ot .... 
(a) Connected • \[  ............ i End 
L-, T .................................. 
Beginning ! V a- \] End 
(b) Overlapped 
Beginning \[- a ~\] \[" B ............. i End 
(c) Separated ....... 
a, B, 7" : Extracted Substfing 
Fig.5 Relations between Extracted two Substrings 
4.2 Extraction Algorithm 
\[Preparation\] 
Sequential number is given to all of the substrings 
extracted in Chapter 3 in the order of extractions. These 
Number are registered in the NES (Number of Extracted 
Substrings) field of the respective record in SPT- 1. 
Procedure 9: Re-sorting the SPT-1 
The SPT- 1 is sorted in the original order of the values 
of ST' fields. 
Procedure 10: Numbering of the sentences 
SN(Sentence Number) field is added for entering the 
sentence number of original sentence to which one's 
record belongs. 
Procedure !1: Table condensation 
The table obtained is condensed by procedures shown 
in the following to obtain a SPT-2". 
(1) All fields other than the four, Sentence Numbers, 
ESN, NSC and RN are deleted. 
(2) All records with no values in the ArES field are deleted. 
Procedure 12: Extraction of Interrupted Collocation 
Here, k is the number of substrings which compose 
interrupted colocational expressions. Then, all of the com- 
binations of k NESs for every sentence are written down 
into a file and sorted. And the number of the same 
combination of NES are counted. 
Thus, the substring list of interrupted collocations can 
be obtained. If the sentence number is given to every 
combination list of NES, the sentences corresi~onding to 
the extracted interrupted collocation can easily be 
identified. 
The lower part of Fig.4 shows the application of this 
method for k=2. In this case, there are possibility of 25 
combinations for 5 types of uninterrupted collocational 
substrings obtained by chapter 3. Out of these combi- 
nations, 7 combinations were extracted as the combinations 
which collocate twice or more within the same sentence. 
And the total frequency of these amount to 14 times. 
5. Experiments 
5.1 Uninterrupted Collocational Substrings 
Applying the proposed method to the newspaper articles 
of Nikkei Industrial News for three months (8.92 million 
characters), uninterrupted and interrupted coUocational 
substrings were extracted. In this experiments, XEROX 
*3 In the case of (a), there would be a combination of substrings which is regarded as a interrupted collocation. However the 
frequency of such a pair is limitted to 1. Then there is no need to consider. 
576 
'ancient' 'ancient' 'of' 'qtrange' 'ciike' . mukasi mukas~ no oKas\]na oKasl 
\[Source Text\] it-,'la~b it-,'D~bo) 2Sh~btx ~a~bo l~'~fl%b~ la:t~blat 2SD~bt, a~fa~bo 
<)~eaning) This is a story of cakes ill very old day. The story of the cake is strange story. 
tY\[O-O (Pointer Table) 
SP String-Words i 
1 tb70, b?;~ bl 
2 ~ b~5"70~ boo : 
3 b~;70, bo)~ ~ 
4 t270, b© ~s~,: 
5 fl~boO~s70~bi Proc. 
6 bo_)#D~L¢2: Sortb 
7 o)~70, bt~s \[ 
8 a3D, big ~s~),; 
9 ~ bta)s~, bl 
:1 0 btg~s~O, bo : 1 1 ¢~70, bo ~i l:i- 
12 g870~bo ~S70': 
13 ~bo ts'D~bi 
be $S70~ boo : 
t5 o :~a70~bogL!ci Prec. 
\]G *aTe, b©lit2: (;ounti 
t7: ~xbo.)l:tlg b\[ N~C 
18 bo)llfg bI,-t: 
~ 9 ff~lIt2 b~i:~s\[ 
20 late bl~::t-S70",: Prec. z 
21 ta: blJ:fS70~ L I l)eter~ 
22 b~;t~370~b/g: NSC 
23 la~\]o, btg ~ 
24 ~70, btats ~: 
28 la~lita b o ', 
29 Ilat; bo : 
30 limbo 
311 be 32 Lo. .............. ! 
'cake' 'of' :story' . ~s' 'qtrange' 'story' . 
okasi no I~allaSl na oKaslna ol~anasl O:Substring 
S1 
W( 
la 
ta 
a5 
tg a~ 
o 
(®, ®) J 
(@. ®) I 
(®. @) I (®. ®) I Prec. 12 
(®, ®) I Write 
((5), _(D) I down 
(@, ®) I< 
(®, ~)) I -ZCI (®, @)1 
(@, ®)1 (®, ®) I case of 
(®, ®) I k:2 
(®, @)1 (®, 
®)1 
(®, @)1 
(®, @)1 
(®, ®)1 (@, 0))1 
(@, @)1 
(@, @)1 
Prec. 12 
\[ I Sorting 
SPT 0 (Sorted Pointer 'l'r, ble) I~T | 
V N N F S M R N SP 
C C 
1 3 0 30 
0 2 2 10 2 
0 1 1 17 3 
1 3 3 29!4 
Prec. 5 1 3 3 85 
Re-sorting 0 2 2 14 6 
0 1 1 24 7 
1 5 5 1 8 
044 6 9 
0 3 3 12 10 
0 2 2 20 11 
131 412 
0 2 0 11 13 
0 1 0 19 14 
0 0 0 32 15 
\[Prec. 6\] 1 3 3 3 i 6 
Validity 1 3 2 9 17 
Check(VF) 0 2 1 15:18 
0 1 0 25 19 
1 3 3 27 20 
0 2 2 22 
01 1 1622 
1 1 1 26 2:3 
1 5 3 224 
0 4:2 7 25 
0 31 13 26 
0 21 21 27 1!o 
o 28 29 
0 2 0 2330 
0 1 1 18 31 
0 0 0 31 32 
(@, ®) (¢), 0)) \[(®,(b)\] (@, (0) 
(~, @) l~,~'~, IC@,O))I (@, @) ICCD, (4))1 _((--9, (2)) 
(~,~1 I(®, ®)1 
I(@~@)J 
J I Prec. 12 I Coasting 
Interrupted~ollocational 
Former Substring 
SN' 2 
S N VN N 
N F F S M R N 
S C C 
1(5) 13030 
1 0 2 2 10 
1 0 1 1 17 
1 ® 1 3 3 29 
1 ® 1 3 3 8 
1 02214 
1 0 1 1 2d 
1@155 1 
1 0 414 6 
I 0 3~3' 12 
1 0 22 20 
1 ~ 1 3 1 4 
1 02011 
1 0 1 0 1 9 
1 0 0 0 32 
2(2) 1 3 3 3 
2®132 9 
2 021 15 
2 0 i 0 
2@133 
2 022 
2 0 1 1 
,@))I 2 1 1 1 ~)1 
2@i 5 3 ,®)/ 
2 o 42 @~ 
2 o 3 1 
®)1 2 021 ~{~: @1 
2 0 1 0 (®,®) 2 @ 1 3 o 
2 020 
2 0 1 1 
2 000 
Pairs of substring 
SPT 2* 
s N N 
N E S S P 
S C : 
Proc 11 
1 ® 3 1 Condense 
1 @ 3 4 
I ® 3 5 
3 i6 
2 ~);3 17 
2@320 
2 @ 5 <4) 3 
I ¸ : .< ! .... String~ 
8P word 
Latter Substring a}ld Frequency 
i i ~i,'D, bi 
i 2 3 
4 5: 
6: 
7 7 Prec. g 
~70' bi Sorting 
i0 b tg ~s i 
*sD, b~ 
13 #/~L,o : Proc. lO 
14 (_,o ~S! Numbering 
i5 o ~d70~', for SN 
I6 $a~, bl 
17 1/~ be.); 18 b o)l~t ! 
25 19: 0)t~¢g: 
27 20 latabl 
22 2 t tall1: 
16 22 blI~Si 
26 2:3 I~t~)~: 2 2~ ~70~b! 
7 25 ;)~b~: l 3 26 br~i 
5 28 a~I~t~i 
2 8 29 I;lt~b: 
2 3 3o ta bo \[ 
18 3t bo 
3 1 ...... i 
j _. <Sentence List> 
• kJ Sentence list for 
I./7 each pair of 
I interrupted 
/ collocation_ _ 
C4)l±ta::D ' 
Ca)~i U:me ::::e 
SPT 1 
" rOposed Metliod!N-gram Method 
gram "'~'r eqnency \'ttrequency 
SU~str{n'~ Subs!rhtg 
5gram (1)~70, U~2:~2 ffs~o, b tS ~is 2 __~:~ z .L/_Z.L.._: 
dgram ,- ;O~ ba ~3 2 
3gram ~70>U 2 ~D,U d 
= : ~ ..... 70, b'ta: 2 
bg~a 2 @t~ts b 2 {arab 2 
®tYDSU 20D, b 2 
Id : 
:,i, ~ ', ;,:, ~ 4 2 gram ...... ! :\] 70, b 6 
< i = b £ 2 
boo 2 
}-: i }~ 7"g 7i3 2 I 
- t<2 \[1 2 r 
:- -:: l:t ?2. 2 
77 _Z ~'J" 70' 2 
5 
: ..... ~ t~ 4 
F ......... o~ 2 : I,'t: 3 
- - 10 Total 72 
S P : Source Pointer 
a N : Record Number 
N M C : Number of matched Characters 
N S C : Number of Significant Characters 
V F : Validity flag 
N E S : Number of Extractc, d Substring 
S N : Sentence Number 
Fig. 4 Example of Uninterrupted and Interrupted Colh)cational Substring Extraction 
577 
ARGOSS 5270 (OS4.1.3) was used. The memory capacity 
were 48 MB. 
(1) Characteristics of Extracted Substring 
From the view point of the length and frequency, the 
number of extracted substrings are compared with those 
of the N-gram method and summarized in Table 1 and 
Table 2. Some examples of extracted substrings are shown 
in Table 3. And the examples of substrings with high 
frequency are also shown in Table 4. 
Table 1. Length and Number of Extracted Substrings 
t p Proposed Metlgod N-gram Statistics a: Extract b: Total c:Extmct d: Total (;ran Substring Frequency Substring Frequency 
2 ~ 970,203 2.613,704 4,374,141 31,178,897 
5~ 591,901 1,476,922 2,960,487 10,808,458 
10~ 52,214 114,270 673,601 1,550,817 
20~ 1,792 3,692 177,298 359,810 
Ratio 
a/c b/d 
22.2% 8.38% 
20.0%13.7% 
7. 75% 7. 37% 
1.01% 1.03% 
Table2. Frequency and Number of Extracted Substrings 
mp. Proposed Method 
a:Extract b: Total Freq.\ Substring Frequency 
2~ 970,203 2,613,704 
5~ 67,321 551,441 
10~ 12,351 217,934 
20~ 2,288 92,804 
50~ 285 37,850 
100 ~ 76 24,167 
200 ~ 20 16,771 
N-gram Statistics 
c:Extmct d: Total Substring Frequency 
4,377,087 39,588,291 
882,217 31,288,701 
372,291 28,050,199 
169,375 25,871,964 
62,991 22,209,875 
30,316 19,961,961 
14,363 17,759,432 
Ratio 
a/c b/d 
22.2 % B. 60% 
7. 63% 1.76% 
3. 32% 0. 78% 
1.35% 0. 36% 
0. 45% 0. 17% 
0. 25Z 0. 12% 
0. 14% 0. 07% 
From these results, the following observations can be 
obtained. 
@ Compared with the N-gram method, most of fractional 
substring has been deleted, and the types m~d the 
number of the extracted substrings have highly reduced. 
For example, in the extraction of substrings with the 
Table 3 Examples of Extracted Substrings (in the order 
gram Proposed Method 
b~Ct,~7~ (436), ~'J'~t~N~')g (277), C 0)?~), (158), (make it that ~ ), (EC), (for this purpose), 
5 gram dJ'~r~'~ (141), ~1~ (141), ~/Y~-;Z'(133), (market share), (consider that ~ ), (motors), 
&~b\]c_(130), C~l<~,~b(126), c©,~, (112), (enphasized that ~ ), (on the contrary), (subsequently ~ ), 
\[ 190,925 types Total 499,653 times \] 
(to be ~ ing), (second), 
10 gram ~C&~Cf319~-9#2(19), 8 2~-~,/$>~'Y~(17), 
(it seems to do ~ ), (82 Japan shop), 
b-Cb~za ©~b~ (16), 7 -2-'2 b >--)L H~N (14) (wonder if ~ do ~ ), (Washington 19 ), 
\[ 21,155 types Total 47,336 times\] 
length of 2 or more and the frequency of 2 times or more, 
the substring type reduced to 22.2 % and total frequency 
of them reduced to 8.38 %. This effect increases as the 
increase of substring length. In the case of substrings of 
20 or more characters, these number reduced to 1%. 
@ Most of substrings extracted by the proposed method 
forms expressions as syntactic or semantic units and 
there are few fractional substrings. 
(2) Processing Time 
It took about 40 hours to make SPT-O*4. But suc- 
cessive processes were performed very quickly (within one 
hour). 
5.2 Interrupted Collocational Substrings 
(1) Characteristics of Extracted Substrings 
Interrupted collocational substrings were extracted for 
every two substrings which had appeared 10 or more times 
in the source text*5. The results are shown in Table 5. 
And, examples of substrings with high frequency and with 
much characters in total are shown in Table 6. 
Table 5 Number of Extracted Pairs of Substrings 
~----___~_ Results 
Frequency -~--~ 
or more times _ 
5 or more times 
10 or more times 
20 or more times 
No. of Pair of 
Substrings 
6,544 
941 
237 
61 
Total Frequency 
of Pairs 
21,829 
9,057 
4,556 
2,291 
From these results, it can also be seen that expressions 
typical to newspapers have been extracted. Thus, using the 
output results, we can easily obtain interrupted collo- 
cational expressions as well as uninterrupted ones. 
of frequency) (cf.) • • :Fractional substfing 
N-gram Statistics 
7~J:o~C(,~3(3710), ~Cb~'~, (2827), l<&~ &, (2753), (became to be ~ ), (be ~ ing but ~ ), (according to ~ ), 
\[<O (/~ ~ (2721), ~ ~2~2 b ~ ~ (2334), ;5 C & IV- f~ (2286), (speaking about ~ ), (be done), • ..... 
tv-~2o~b~ (2079), & lj~(~ (1997), ~ t t2@\]~ (1849), 
...... , (explain that ~ ), (57 fiscal year), 
\[ 748,172 types Total 3,793,077 times \] 
b It & C ;5 ~. J: ~ &, (273), ~ 7'b IV. b 7~< & C ~5 ~V. & (223), (from what ~ do), • ..... , 
t~)J 5 7~ \[:. b #_ & C 7) IV. (223), t~ t:. b ?c_ & C 7~ t< & ~ (222), 
~V. b?¢. A Y__ 7) tV.ck~ A (222), ,~_~NItg~'}¢~:i~N~ (208), 
(according to that ~ was), (second research party), 
\[ 132,865 types Total 345,232 times \] 
Examples of Substrings 
(frequency > 200 ) 
Table 4 Examples of Substrings with High Frequency 
&b~9(586), &~\]t(512), &b~1,~5(436), ~#_(325) ~35~(324), ~{(315) 
(to say that), (said that), (set as), (again), (is that), (photogralihy), 
bT'j~b, (302), &~o 7~< (283), N~ (281), (~i~ (278), ~J'NJL, N# (277), bT)~b (274), N-f > b (269), (but), (said that), (Tokyo), (Price), (EC), (however), (Point), 
~&~& (264), ~-}'~,~ (259), ~fc, (236), C~t2(220), ~_©T<8) (204), ?<1~, (20I) 
\[ (one word), (sell term) (mere over) (this is) (for this sake), (yet) 
*4 Indirect sorting is conducted. When this process is excuted within a memory by the computer which has a compare 
instruction with indirect adressing for arbitrary length of fields, sorting time will be extremely shortened. 
*5 It is expected that when the frequency of each substring is small, the frequency of their co- occurence is further,small. 
578 
Table 6 Pairs of Substrings with High Frequency 
Collocations 
of 
Compound 
Nouns 
4fllif~ ~ ~'~l#JlJl(257), qZ'~-g)I/~ :E- ,~'--X'(ll7) 
(price ~ sell time), (General ~ Motors) 
(Summit ~ ) (EC ~ the European Community) 
4 ~)> ~ ~.x,/~5"4 i~lgb"/-" (80) 
(lran ~ Japan Oil Industry) ~, 
~& {Z~do~L3~)~&){~_~'cT~ (9), ~oJ~?tl,gc~,P~&)~#_ (9), 
(did ~ but said that), (In the answer to ~ said ~ ) 
(we talke that ~ ), (the contents is such that ~- ) 
Collocations 
of Sentence 
Patterns (moreover the minister said that), (doing ~ said ~) 
(the contents is ~ and so), (did ~ also about ~ ) 
b ~l~l~b ~ (5 bb ~(,1), ::)~ 0 ~:'~'o (4), ~t~ bx~l~ b~ (4) 
(as if ~ looks -~), (ilamely ~ is ~ ), (either ~ or 
(2) Processing Time 
In the case of interrupted collocational substring 
extraction, processing time depend highly on the number of 
components of substrings. In this experiment, the turn- 
around time was 1 or 2 hours where components of 
collocations to be extracted was limited to the substrings 
with the frequency of 10 or more times. 
6. Conclusion 
The methods of automatically identifying and extracting 
uninterrupted and interrupted collocations from very large 
corpora has been proposed. 
First, from the view point of collocational expression 
extraction, the problems of Nagao and Moffs algorithm for 
calculating arbitrary length of N-gram has been pointed out. 
And, under the condition that fractional substrings are 
restrained to be extract, a new method of automatically 
extracting and tabulating all of the uninterrupted collo- 
cational substrings has been proposed. Next, using these 
results, a method for automatically extracting interrupted 
collocational substrings has been proposed. In this method, 
combinations of uninterrupted collocational substrings 
which collocate at different positions within a sentence 
are extracted and counted. 
The method was applied to newspaper articles 
involving some 8.92 million characters. The results for 
uninterrupted collocations were compared with that of N- 
gram statistics. In the case of substring extraction with 2 
or more characters, conventional method yielded substring 
of 4.4 millions types and the total frequency of them 
amount to 31.2 millions. In contrast, the method proposed 
in this paper extracted 0.97 millions types of substrings 
and a total frequency of them has reduced to 2.6 millions. 
In the case of interrupted collocational substring extraction, 
combining the substring with frequency of 10 times or 
more extracted by the first method, 6.5 thousand types of 
pairs of substrinks with the total frequency of 21.8 
thousands were extracted. 
From these results, it can be said that, viewed from 
the point of extraction of collocational expressions (as units 
of syntactic and semantic expressions), substrings obtained 
by conventional methods include a voluminous amount of 
fractional substrings. In contrast, the method proposed in 
this paper reduces many of such fractional substrings and 
condensed into a group of substrings that can be regarded 
as units of expression. As a result, it has been made 
possible to easily calculate interrupted collocations and 
together with phrase templates and other basic data 
regarding sentence structure. 
This paper used Japanese character chains to examine 
the algorithm. Yet this algorithm can be applied to arbitrary 
symbol chains. Various types of applications are possible, 
such as word chains, syntactic element chains obtained 
from results of morphological analysis or semantic attribute 
chains which consist of each word being converted to 
semantic attributes. As shown in this paper, applications for 
Japanese character chains still involve output of some 
amount of fractional stings. But when applications to word 
chains or syntactic element strings are concerued, further 
restriction of unnecessary elements are anticipated. 
References: 
Church, K. W. and Hanks, P. (1990): Word Association 
norms, Mutual Information and Lexicography, Compu- 
tational Linguistics, Vol.16, No.l, pp.22- 29 
Colier, R. (1994): N-gram Cluster Identification during 
Empirical Knowledge Representation Generation, The 
Computation and Language E-Print Archive 
Kita, K., Ogura, K., Morimoto, T. and Ueno,Y. (1993): 
Automatically Extracting Frozen Patterns from Corpora 
Using Cost Criteria, Journal of Information Processing, 
Vol.34, No.9, pp.I937-1943 
Kits, K., Kate, Y., emote, T. and Yano, Y. (1994): A Com- 
parative Study of AUtomatic Extraction of Collocations 
from Corpora: Mutual Information vs. Cost Criteria, 
Journal of Natural Language Processing, Vol.1, No.l, 
pp.21 - 33 
Nagao, M and Moil, S (1994): A New Method of N-gram 
Statistics for Large Number of n and Automatic Ex- 
traction of Words and Phrases from Large Text Data of 
Japanese, The Proceedings of the 15th International 
Conference on Computational Linguistics, pp.611-615 
Shinnou, H. and Isahara, H. (1994): Automatic Extraction 
of Frozen Patterns to Act as a Postpositional Particle 
by Pseudo N-gram, Journal of Information Processing, 
Vol. 36. No.l, pp.32-40 
Smadja, F. A. and MeKeown, K. R. (1992): Automatically 
Extracting and Representing Collocations for Language 
Generation, Proceedings of the 28th Annual Meeting of 
the Association for Computational Linguistics, pp.252- 
259 
Smadja, F. (1993): Retrieving Collocatibns fl'om Text: 
Xtract, Computational Linguistics, Vol.19, No.9, pp.143- 
177 
579 
