CHINESE STRING SEARCHING USING TtIE KMP ALGORITHM 
Robert W.P. Luk 
Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong 
E-mail: csrluk@comp.polyu.edu.hk 
Abstract 
This paper is about the modification of KMP 
(Knuth, Morris and Pratt) algorithm for string 
searching of Chinese text. The difficulty is searching 
through a text string of single- and multi-byte 
characters. We showed that proper decoding of the 
input as sequences of characters instead of bytes is 
necessary. The standard KMP algorithm can easily be 
modified for Chinese string searching but at the 
worst-case time-complexity of O(3n) in terms of the 
number of comparisons. The finite-automaton 
implementation can achieve worst-case time 
complexity of O(2n) but constructing the transition 
table depends on the size of the alphabet, Z, which is 
large for Chinese (for Big-5, Z > 13,000). A mapping 
technique reduces the size the alphabet to at most IPI 
where P is the pattern string. 
1. Introduction 
The alphabet size of Chinese (to be more precise 
Hanyu) is relatively large (e.g. about 55,000 in Hanyu 
Da Cidian) compared with Indo-European languages. 
Various internal codes (e.g. GB, Big5, and Unicode) 
have been designed to represent a selected subset 
(5000-16,000) which requires two or more bytes to 
represent. For compatability with existing single-byte 
text, the most significant bit of the first byte is used to 
distinguish between multi-byte characters and single- 
byte characters. For instance, Web browsers (e.g. 
N etscape) cannot interpret the annotations represented 
by their equivalent 2-byte characters. Thus, Chinese 
string searching algorithms have to deal with a 
mixture of single- and multi-byte characters. 
This paper will focus in 2-byte characters because 
their internal codes are widely used. Two modified 
versions of the KMP algorithms are presented: the 
classical one and the finite-automaton implemenation. 
Finally, we discuss the practical situations in Chinese 
string searching. 
2. The Problem 
Directly using existing fast string searching 
algorithms (Knuth et al.,1977; Boyer and 
Moore,1977) for on-line Chinese text can lead to 
errors in identification as in using the find option of 
Netscape in Chinese window. For example, the pattern 
string, P=~ (i.e. AA,AA in hexidecimal) can 
successfully match with the second and third bytes of 
the text string, T:¥°'7/ (i.e. A4,AA,AA,43 in 
hexidecimal) which is incorrect. The error occurs 
where the second byte of the character in 7' is 
interpreted as the first-byte of the pattern character. 
Thus, it is necessary to decode the input data as 
characters. 
Two well-known string searching algorithms were 
discovered by Knuth, Morris and Pratt (1977) (KMP), 
and Boyer and Moore (1977) (BM). The KMP 
algorithm has better worst-case time complexity 
where as the BM algorithm has better average-case 
time complexity. Recently, there has been some 
interest in improving (Hume arid Sunday, 1991; 
Crochemore et al., 1994) the time complexity or 
proving a smaller bound (Cole, 1994) of the time- 
complexity of the BM algorithm, as well as in the 
efficient construction (Baeza-Yates et al., 1994) of the 
BM algorithm. These algorithms derived from BM 
assumes that knowing the positional index, i, of the 
text string, 7, can access and interpret the data, T\[i\], as 
a character. However, with a text string of single- and 
multi-byte characters, i can point to the first-byte or 
the second-byte of a 2-byte character which the 
computer cannot determine in the middle of the text 
string. It has to scan left or right until a one-byte 
character, the beginning of the text string or the end of 
the text string is encountered. For example, the BM 
algorithm moves to position i : 4 (= lIPID for 
matching in Table 1. At this position, T\[4\] (= A4) does 
not match with P\[4\]. Since the computer cannot 
determine whether T\[4\] is the first or second byte of 
the 2-byte character, it cannot use the delta tables to 
determine the next matching states. Even worst, for 
some internal code (e.g. Big-5), it is not possible to 
directly convert the byte sequc~ce into the 
corresponding character sequence in the backward 
direction. Thus, as a first step, we focus on modifying 
the KMP for Chinese string searching. 
i I 12 3 14 5 \[6 7 \[8 
Til l A4 A3 A4 A0 A4 \]A7 A4 I I)F 
P < ~ > 
P\[i 1 3C A4 I A4 3E 
'Fable I: Matching between the text string, T:L~£~aH§~f3 
and the pattern string, p=<na>. Here, 7'\[\] and P\[\[ shows 
the hexidecimal value of each byte in T and P. 
3. Knuth-Morris-Pratt Algorithm. 
3.1 Searching 
Figure 1 is the listing of the modified version of 
KMP algorithm (Knuth et aL, 1977) for searching 
llll 
Chinese string. Here, i is the positional index of the 
text string but the position is specified in terms of 
bytes. By comparison, j is the positional index of the 
pattern string, P, and the position is in terms of 
characters. Characters in P are stored in two arrays 
PI\[\] and P2\[\]. Here, PI\[\] stores the first-byte and 
P2\[\] stores the second byte of two-byte characters in 
P. If there are single-byte characters in P, they are 
stored in Pl\[\] and the data in corresponding positions 
of P2\[\] are undefined. Here, we assumed that a NULL 
character is patched at those positions. For example, if 
P=<c~£<~£¥i>, then the values in PI\[\] and P2\[\] are 
shown in Table 2. 
1 ,function Chinese_KMP 
{ inti=l;j=l; 
while CO" <= IPO ~ 0 <= li7\]0) { 
(lone-byte-character(Till) 
/* decode single- or 2-byte characters */ 
7 { while (0"!:0) && (T\[i\]/=PI\[j\])) 
/* 1-byte character matching */ 
8 j = next\[j\]; /*failure link */ 
9 i++; /* update iposition */ 
1o } 
11 else { while ((j!=O) && ((F\[iI!=PI\[j\]) II 
(l'\[i+ l\]!=P2li\]))) /* matching */ 
12 j = next\[j\]; /*failure link */ 
13 i+ = 2; /* update i position */ 
14 } 
15 j += 1; /* update j position */ 
16 } /* while-loop ends */ 
17 if (J > IPD then returnO-IIPll); 
/* compute matehed position */ 
18 else return(O); /* no matchedposition */ 
19 } 
Figure 1: A modified version of KMP for Chinese string 
searching. The function, one-byte-character, determine 
whether the current input is a single or 2-byte character, by 
testing whether the converted integer value of T\[i\] is 
positive or negative. If the converted value is negative, then 
7".//.\] is the first-byte of a 2-byte character. Here, J T I and l J 7\]\] 
are the length of the text string, 7; in terms of characters and 
bytes, respectively. 
The program in Figure 1 determines (in line 6) 
whether the current input character is a single- or 
two-byte character. If it is a single-byte character, the 
standard KMP algorithm operates for that single-byte 
character, T\[i\], in line 7 to 10. Otherwise, i is pointing 
at a two-byte character. This implies that: (a) matching 
2-byte characters is carried out where the data in 
T\[i+ 1\] is the second byte of the character (line 11); 
and (b) i is incremented by 2 instead of 1, because it is 
counting in terms of bytes (line 12). Sincej is counting 
in terms of characters, the increment ofj One 15) is 
one whether the characters in P are single or two bytes. 
When the pattern string is found in T, the position of 
the first matched character in T is returned. Since the 
position is in terms of bytes, it is the last matched 
position, i, minus the length of P in terms of bytes (i.e. 
IIPII). 
Character < ~£ < ~£ ¥i > 
P\[Jl 
j 1 2 3 4 5 6 
PI\[j\] 3C A4 3(2 \]' A4 A5 3E 
P2\[jJ N1JLL A3" NULL t- A3 69 NUI,L 
...... f (P \[J l) < a < a b > 
next\[j\] 0 1 0 1 3 O ' 
Table 2: The values of the patterns indexed byj. llere, P\[\] is 
a conceptual array which can hold both single- and 2-byte 
characters. This array is implemented as two arrays: PI\[\] 
and P2\[\] which stores the first and second byte of the 2-byte 
characters, respectively. The function, f(), maps two byte 
characters into single-byte characters, simplifying the 
generation of values in the array, next\[\], and the failure links 
in fl\[\]. 
3.2 Generating nextll 
The array, next\[\], contains the failure link values 
which can be generated by existing algorithms 
(Standish, 1980) for single-byte characters. The basic 
idea is to map the 2-byte characters !:~ ~ to single-byte 
characters and then use existing algorithms. The 
mapping is implemented as an array, f\[\]. Each 
character in P is scanned from left-ro-right. Whenever 
an unseen character is found, it is assigned a character 
value that is the negative of the amount of different 
2-byte characters seen so far. For example, the third 
unseen 2-byte character is mapped to a one-byte 
character, the value of which is (char) -3. 
The mapping scheme is practical. First, the number 
of different characters that can be represened with a 
negative value is 127 and usually IP\] < 128 characters. 
Second, the time-complexity of mapping, O(\] IP\[ D, can 
be done in linear time with respect to IPj and in 
constant time with respect to 17\]. This is important 
because it is added to the total time-complexity of 
searching. To achieve O(1 tPI D, the function, found(), 
uses an array, f\[\], of size 1El (where I2 is the alphabet) 
to store the equivalent single-byte characters. A 
perfect hash function (section 4), hO, converts the 2- 
byte characters into an index off\[\]. After searching, it 
is necessary to clear\]'\[\]. This can be (tone in O(\]IPLD 
by assigning NULL characters to the locations in f\[\] 
corresponding to 2-byte characters in P. 
4. Finite automaton implementation. 
Since \[I 711 is large, reducing its multiplicative factor 
in the time complexity would be mtractive. In Knuth et 
al., (1977), this was done using a finite automaton 
which searches in O(\]IT\]D instead of 0(21171L). 
Standish (1980) provided an accessible algorithm to 
build the automaton, M. First, failure link values are 
computed (similar to computing values in next\[.\]) as in 
Algorithm 7.4 (Standish, 1980) and then the state 
transitions are added as in Algorithm 7.5 (Standish 
1980). A direct approach is to compute the conceptual 
automaton, Me, which regards the 2-byte characters as 
1112 
one-byte and then convert the automaton for multi- 
byte processing. Since the space-time complexity in 
constructing the automaton depends on the size of the 
alphabet (i.e. o(\]ElxlQcD where Qc is the set of states 
of Me) which is large, this approach is not attractive. 
For instance, if IQcl -/0 and I~1 ~ I0,000, then about 
100,000 milts of storage (integers) are needed! I,'urther 
processing is needed to convert the automaton for 2- 
byte processing! 
4.1 Automaton lmplemeutation. 
Another approach uses the different characters in P 
as the reduced alphabet, Er, which is much smaller 
than 121. We use a mapping function as discussed in 
section 3.2 to build a mapping of 2-byte characters to 
one-byte. These one-byte characters and the standard 
one-byte characters (e.g. ASC\[1) fbrm Er. The NULl, 
character, Z, represents all the characters in )..; but not 
in Zr = {X} ( = Z * 02r ~ {)@'). Given that the multi- 
byte string, P, is translbrmed into a single-byte string, 
l", existing algorithms can be used to construct the 
automaton. 
For each pattern string, 1', string searching will 
execute the tbllowing steps: 
(a) convert 2-byte characters to one-byte in P to lbrm 
t" (i.e. lbrm £r) using mapping as in section 3.2; 
(b)compute the failure link values of 1" using / 
Algorithm 7.4 in (Standish, 1980); 
(c) compute the success transitions and store them in 
80 as in (Standish, 1980); 
(d)compute the failure transitions using the failure 
link values using Algorithm 7.5 in (Standish, 1980) 
and store the transitions in 80; 
(e) use the atttomaton, M, with state transition fimction 
80, to search for t" in T; 
(1) output the matched position, if any; 
(g) clear that mapping lhnction that forms Zr using P. 
4.2 Constructing the automaton. 
For step (c) and (d), the operation of Algorithm 7.5 
was illustrated with an example of a binary alphabet in 
(Standish, 1980). Here, we illustrate the use of a larger 
alphabet, Zr, and £ e Er. Suppose the pattern string, 1', 
is as shown in Table 2 which also contains the 
corresponding P' and failure link values, fl\[\]. The 
success transitions are added to 80 as 80'-I, P'\[j\]+- j 
(e.g. 8(0,<)4- l and 8(I,a)<-- 2). The failure transitions 
are computed from 0 to I/"1 becausefl\[j\] <j. For state 
O, 8(0,00+- 0 ifo~ ~ P'\[1\] andcz c Er (i.e. 8(O,a) 4-- O, 
8(0,b)4- O, 6(0,>)4- O, 8(O,X) 4- 0 but 8(0,o 0 ~- I). 
For other states, 8(j, c04- 8(fl\[/\],c 0 ifc~ ¢ P'\[j\] and ¢x 
Zr (e.g. 8(1,a)4- 807\[lJ, a)-8(O,a)=O and 8(I,<)4- 
8(fl\[1\],<)~8(0,<)=1). Effectively, the states in 
8(/l\[/\],.) are copied across to the corresponding entries 
in 8(j,.) except for the successfid transition from j. 
Figure 2 illustrates how a:ro~ of entries in 6(/l\[/\],.) arc 
copied across to compute 80,.). 
I i 
Z 
3 
4 
5 
6 
Figure 
a 
0 
_2 
0 
4 
0/ 
b X 
0 
0 
°° t 5 0 . 
fl\[1\] j Kay: 
0 " ~ copy state trarlsiliolTs from 
0 one location to the other 
o '\ failure link points back 
J ~ to previous state transitions 
1 / for copying 
2 
2: An illustration of c¢mstructing the lhilure 
transitions ofM. I lere,j :: 4 and the failure link oi)' (i.e.Jl\[4/ 
:- 2) is used to determine which of the previous row of the 
state transition ruble, 60, is used for updating the values of 
the current row in 80. The underlined entries are the success 
transistions. 
Figure 3 shows the program that computes the state 
transitions using the faihtre links. The program 
computes for state 0, the last states and the other states 
separately. The last state is distinguished because it 
has no success transitions where as the other has one 
\['or each state. The program for generating failure links 
is not given because: 
(1) it is similar to computing next\[\]; 
(2) a version is available (Algorithm 7.4 in Standish, 
1980) which does not need any modification. 
I void buildtransitions 0 
2 
3{ 
4 int i=O, j=O, k=O; 
5 
6 .)rot. (i=-\[)\]2\[;i<=\]~.l\[;i F ~) /* build lransistions at\] = 0 */ 
7 if((chaO i =-Pll\]) 60,i)=1; 
8 else ~(0,0=0; 
9 jbr(j=lj < \]l'l;j+ 19 \[ /* build other transitions which has 
success (ranistions */ 
to k =fibl; 
It ./o; (i=-IE2l;i<=lZll;i+-t) 
12 if((cha,') i == l'iJ'+l\]) 8(j,i)~j+l, ' 
13 else ~(j,i)-= 8(k,O; } 
14 k :Cfl\[\]l'\[\];/* fldlure U'ansitions forj - \[PI */ 
15 ./'or O=-\]Z2\[,'i<=\[Ell;i+-I) /* there is no success 
transi(ion in (his case */ 
16 8(j,i) = 8(k,i); 
17) 
Figure 3: I~uilding the state mmsitions given ttmt the thiha'e 
links are known. Note that the algorithm assumed that Zr : 
ZIuE2 where ZI and Z2 arc the one-byte (e.g. ASCII) 
clmracter alphabet and the transtbrmed l-byte character 
alphabet representing the different two-byte characters in P, 
respectively. Futhermore, since \[Y,2\[ < 128 and Z2 c Z. A 
multiplicative fimtor of the space-tim,," complexity can be 
reduced if mapping is also carried out for single-byte as well 
as 2-byte characters in 1'. The correctness of the above 
program can be shown by mapping all the characters not in 
:'2r to E because they have idenitical state mmsition wdues 
(i.e. dividing the alphabet into equivalent classes of identical 
transition vahms). 
4.3 Searching. 
1113 
Searching is implemented as state transitions of M 
(Figure 4). Initially, the state of the automaton, M, is 
set to 0. The next state is determined by the current 
character read from the text string, T, at position i and 
the current state. If the current state is equal to IP'I, 
then P is in Tat position i - \[\[Pl\]. 
1 intFAKMP 0 
2{ 
3 int i=l; state=O; 
4 while ((state/= IPO && (i <= IITID) { 
5 ifone-byte-character(7\[i\]) /* decoding front-end */ 
6 input character = (inO Till; 
7 else { input_character =found(T\[i\], T\[i+l\]); 
8 i++}; /* update for 2-byte character */ 
9 state = 8(state, inputcharacter); 
10 i++; 
11 } 
12 if(state == IPD return (i- IlPlL~, 
13 else return(O); 
14 } 
Figure 4: String searching of multi-byte characters using the 
finite automaton. 
5. Practical considerations. 
The KMP algorithm (Knuth et al., 1977) was 
considered to perform better when the pattern string 
has recurrence patterns. Otherwise, it is about the same 
as the brate-force implementation with quadratic 
time-complexity. For Chinese string searching, it is 
not uncommon to search for reduplicating words (e.g. 
~3"'S~.3 and §O§(31AOIAO) (Chen et al., 1992) which has 
recurrence patterns. Such repetition to form words is 
used in making emphasis as well as an essential part of 
yes-no questions. Otherwise, recurrence patterns in P 
occur only incidentally (e.g. nn~j~n~WA~3Aq"t 
translated as the Department of Chinese, Chinese 
University of Hong Kong). 
Apart from recurrence, if there are a lot of backing 
up operations, the KMP algorithm would perform 
better than the brute-force implementation. Such cases 
occur where a proper prefix of the pattern string has 
high occurrence frequency in the text string (e.g. 
function words). In Chinese string searching, this will 
happen for technical terms that have a high frequency 
prefix constituent. For instance, Chinese law articles 
have many terms beginning with the word ~°~ (i.e. 
China). A search through the Chinese law text for 
P:~%~H will require many backing up (or 
committing a false start) in the brute-force 
implementation when words or phrases like ~D%"k<ffS, 
c~%"°>)fi~g, cm°OkDAv and c~c~%',D~k are encountered. 
Sometimes, patterns which are words can match 
with text where the matched string of the text is not 
functioning as a word. For example, nj.\[ (which means 
conference) can be regarded as a word but in the 
phrase, 2"~j.l¶}~@¶i.s"°~Abe, the first character 
(underlined) of the matched string (in italics) is part of 
a name and the second character (in italics) function as 
a verb, Thus, Chinese text is often pre-segmented and 
string searching has to patch delimiters to the 
beginning and end of the pattern, P. However, the 
searching accuracy depends on the segmentation 
algorithm which is usually implemented as a 
dictionary look-up procedure. If a dictionary has poor 
coverage, the text tends to be over-segmented (Luk, 
1994) and the recall performance of searching will 
drop drastically. Such cases occur if a general 
dictionary is used in segmenting technical articles (e.g. 
in law, medicine, computing, etc). 
REFERENCES 
BAEZA-YATES, R.A., C. CIIOFFROT, ¢9. G.H. GONNET (1994) 
"On Boyer-Moore automata", Algorithmica, 21, pp. 268- 
292. 
BOYER, R. & S. MOORE (1977) "A fast string searching 
algorithm", Communications of ACM, 20, pp. 72-772. 
CItF, N, F-Y., R-P. J. MO, C-R. HUANG, K-J. CtaEN (1992) 
"Reduplication in Mandarin Chinese: their formation rules, 
syntactic behavior and ICG representation", Proc. ofR. O. C. 
Computational Linguistics Conference V, Taipei, Taiwan, 
pp. 217-233. 
COLE, R. (1994) "Tight bounds on the complexity of the 
Boyer-Moore string matching algorithm", SIAM Journal of 
Computing, 23:5, pp. 1075-1091. 
COLUSSI, L. (1994) "Fastest pattern matching in strings", 
Journal of Algorithms, 16, pp. 163-189. 
CROCHEMORE, M. A. CZUMAJ, L. GASIENIEC, S. JAROMINEK, 
T. LECROQ, W. PLANDOWSK'{, & W. RYTrER (1994) 
"Speeding up two string-matching algorithms", 
AIgorithmica, 12, pp. 247-267. 
11UME, A. AND D .M. SUNDAY (1991) "Fast string searching", 
Software - Practice and Experience, 21:11, pp. 1221-1248. 
KNUTH, D.E., J. MORRIS & V. PRATt (1977) "Fast pattern 
matching in strings", SIAM Journal of Computing, 6, pp. 
323-350. 
LUK, R.W.P. (1994) Chinese word segmentation using 
maximal matching and bigram techniques, Proc. of R.O.C 
Computational Linguistic Conference VII, Hsinchu, 
Taiwan, pp. 273-282. 
STANDISH, T.A. (1980) Data Structure Techniques, 
Addison-Wesley: Reading, Mass. 
1114 
