LINGUISTIC ERROR CORRECTION OF JAPANESE SENTENCES 
Tsutomu Kawada, Shin-ya Amano and Kunio Sakai 
Information systems Lab., Research and Development Center, Toshiba Corporation 
1 Komukai Toshiba-cho, Saiwai-ku, Kawasaki 210, JAPAN 
Abstract 
This paper describes a newly developed lin- 
guistic error correction system, which can cor- 
rect errors and rejections of Japanese sentences 
by using linguistic knowledge. 
Conventional optical character readers 
(OCR) need human assistance to correct their 
recognition errors and rejections. An operator 
must teach the OCR correct answers whenever an 
illegible character pattern occurs. If this 
error correction operation is mechanized, the 
throughput of the OCR will increase. 
This linguistic error correction system of- 
fers means of automated error correction by an- 
alysing sentences of the OCR outputs linguisti- 
cally. This system grammatically selects legal 
letters from the candidates which can not be 
decided uniquely by pattern recognition only, 
and recommends grammatically and semantically 
meaningful letters for the illegible letter. 
i. Introduction 
More than 2,000 different chinese charac- 
ters are used in Japanese newspapers and pub- 
lications. The large repertory of letters as 
well as the structual complexity of each char- 
acter pattern are the main difficulties to rec- 
ognize letters by OCR. Mutually similar char- 
acter patterns mainly cause recognition errors 
and rejections. The difference between such 
similar letters usually concentrate on a local 
area \[ ~ .... ~ , ~ .... ~ , {~ .... 
, j~ .... j~ , :k .... ~,...:~ , 
eta\]. The output of OCR contains these ambi- 
guities. 
It is necessary to use the contextual in- 
formation contained in every kind of natural 
text. The application of context makes it pos- 
sible to detect errors or even to correct them. 
The main object of this error correction system 
is to resolve these ambiguities on the basis of 
linguistic knowledge. To resolve these ambi- 
guities each ambiguous letter is put in its 
sentence, and the sentence is analyzed syntac- 
tically and semantically. When the sentences 
is acceptable the letter is selected preferably. 
There has been some contextual post pro- 
cessing systems. 1~2 But these systems are de- 
signed to read only postal-addresses. The 
words which are included in the postal-address 
are restricted within a relatively narrow do- 
main. The postal address usually consists of a 
person's name, place name, postal code and num- 
bers. On the other hand this newly developed 
kanji OCR can read actual texts of Japanese 
sentences, and its contextual error correction 
system of it has been designed to deal with 
many kinds of words of various parts of speech 
(noun, verb, adjective, adverb, pronoun, etc.). 
This error correction system has a practical 
25,000 words dictionary, and can correct 53.8 
percent of the errors which are included in the 
outputs of kanji OCR. 
2. Restrictions 
This system imposes two restrictions on 
input data. One is that the input must consist 
of grammatical Japanese sentences in order that 
syntax analysis can be applicable. This system 
is not effective for only numeral data texts or 
a mere list of words. 
The other restriction is that the texts to 
be dealt with must be limited to a special 
field. By this restriction we can limit the 
number of terminologies which are used in the 
field. 
The corpus used for the experiment is 
1,700 claims of patent gazettes of the Japan 
Patent Office. These gazettes concern the 
manufacturing technology of LSI devices for 
thirteen years (1964 - 1976). Figure 1 shows 
an example of them. This corpus includes 
306,000 words. There are about 5 thousand dif- 
ferent words in it. The distribution of the 
various categories are as follows; 
noun 3603 functional word 90 
verb 832 suffix and prefix ii0 
adjective 75 
adverb 61 
conjunction 30 
As twenty thousand common words are added to 
these words, the dictionary contains about 
twenty-five thousand words. 
3. Features of Patent Sentences 
A Japanese written language consists of 
kanji, kana (hirakana and katakana) and alpha- 
numeric letters. Kana is a phonetic symbol and 
kanji is an ideograph. The kana set (either 
hiragana on katakana) consists of 48 letters. 
More than 2,000 different kanji letters are 
daily used. 
Japanese people write a sentence like one 
--257-- 
Q)M 0 S ~# ~'~ ~¢~\[~\]~-~t~ 
@~ /~ fl~48- 336 4 
@~ )~ 1ff~4 7 ( 9 7 2 ) 1 2)J28~ 
.q~ ~ ~h~4 9- 0 89 2 
~.~.~4 9( 19 74) 8~ 30~ 
I,~\]m 
bibliography 
l~ ~o 
claim sentence 
Figure i. Actual Patent Gazette 
H'-------T---\[ Ki l._y_ .~ - ~ ~ ---7 ....... I I" 
,....~, fifo ~ %=--, ,-~, 
kj kj kj kn kj kj Iml kn---kana 
I pause group pause group 
Figure 2. Construction of a Pause Group 
continuous string of letters with no space (see 
Fig. i). Japanese is different from western 
languages in this point. It is firstly impor- 
tant to identify words in the continuous string 
of letters to analyze a Japanese sentence. 
Figure 2 shows the construction of a pause group 
which is the minimum meaningful unit of a Japa- 
nese sentence. The prefix, the independent word 
and the suffix are usually written in kanji or 
katakana letters. The dependent word is written 
in kana letters. Changes of letter types as 
well as punctuation symbols give us useful clues 
to the boundaries where it is possible to sepa- 
rate a long letter string into shorter manage- 
able units (pause group). This correction sys- 
tem detects words by using such conditions for 
these boundaries (Fig. 2). 
Experiments were conducted for the claim 
sentences of the patents. A claim sentence 
has a particular style. Most of the claim 
sentences consist of one sentence. An ana- 
lytical study3 of the claim sentences showed 
that all sentences were categorized into 14 
sentence patterns by coordinate phrases. The 
average count of words for a sentence is 180 
words. The sentence is a big noun phrase and 
is constructed from many coordinate adjective 
or adverbial phrases which modify the same word. 
The claim sentence is so long that it is prac- 
tical to analyze it on the basis of these 
phrases. 
4. Kanji OCR and it's Errors 
The large number of character categories as 
well as structual complexity of each character 
pattern are the dominant difficulties in kanji 
character recognition. A two-stage recognition 
method 4 has been developed to cope with these 
difficulties. This method employs an efficient 
candidate selection prior to a precise individu- 
al recognition. Fig. 3 shows a diagram of the 
two-stage recognition method. 
  Selection~ Unit \] 
First Stage Second Stage 
"Candidate~ 
Selection i 
ndividual 
ecognitior 
Unit 
Figure 3. Two-Stage Recognition Method Diagram 
In the first recognition process stage, 
feature extraction is carried out on the input 
pattern. Candidate characters are obtained ac- 
cording to their geometrical features. In the 
second stage, pattern matchings are carried out 
between the input pattern and each reference 
pattern in selected candidate characters. The 
decision is made on the basis of their similar- 
ity values. 
The mutually similar patterns as well as 
low print quality cause recognition errors and 
rejections. These illegible letters have low 
similarity values. The recognition speed of 
this kanji OCR is i00 characters per second. 
More than 99 percent correct recognition rate 
was obtained for actual data. The average let- 
ter count of the claim sentences is 450 letters. 
Consequently this system encounters an illegible 
letter every second and three or four letters in 
a claim sentence. 
As the illegible letters have low similar- 
- 258 
ity values, this correction system Can find 
doubtful letters easily. If this error correc- 
tion system checks all letters which are con- 
tained in a text, it needs much time to process. 
This error correction system picks out only the 
phrases which contain illegible letters, and 
analyze the grammatical legality of them. By 
this restriction this error correction system 
decreases the processing time and becomes a 
practical one. 
5. Error Correction Method 
The error correction system has three an- 
alysis functions (Fig. 4). 
a) word analysis function 
b) syntax analysis function 
c) wording analysis function 
Two notations are used here. When one let- 
ter can not be recognized uniquely, the candi- 
dates for the letter are enclosed in parentheses. 
A letter which can not be recognized at all is 
expressed by a question mark. 
tSegmentation " 
= ..... H 
t r4Syntax Word ~I Analysis I~ 
Analysxs~ W°rding II 
Figure 4. Error Correction Method Diagram 
5.1 Word Analysis Function 
In case of encountering an ambiguous letter 
in a sentence, the word analysis program search- 
es the dictionary to find a grammatically and 
semantically valid candidate. 
For example, ' \[ /< /~ \] ~ -- ~ ~ 
(\[PA BA\]TANNINSIKI)' shows that the first letter 
is an ambiguous letter. In this case, two can- 
didates are tested. A candidate ' /< ~ -- 
~ (pattern recognition)' is meaningful 
but ' \]~ 9 -- > ~. ' is not. So \]~ is 
determined as the unique answer. 
Some Japanese letters resemble closely. 
Example i. Letter Meaning 
'X \[- .......... 1 74 X dash 
K one 
(3 symbol for long vowel 
The selection from these resembling pat- 
terns depends on their context. '- (a long 
vowel)' is reasonable for ;~ \[ .... \] ~ (BEESU, 
base). 
Two or more words are frequently connected 
without any conjunction or preposition in Japa- 
nese sentences. In this case the word analysis 
program calls the compound word analysis sub- 
program which looks up the word dictionary and 
makes a compound word from two or three words to 
analyze it. The above example is a compound 
word. ' \]~9 --~/~ (pattern Recogni- 
tion)' is a compound word constructed from ' 
\]~--~ {pattern) and m~,~ (recognltlon). 
This subprogram has not only a full-string but a 
sub-string matching ability (Fig. 5). 
KEIRYO GENGO GAKU 
(gomputation) (language) (stud) 
  in  stics  
I,. _ ~I-~_~-~. • I 
(computational linguistics) 
Figure 5. Sub-string Matching of a Compound Word 
When a letter has no candidate letter, the 
word analysis program consults the dictionary 
and searches for words which fill up the illegi- 
ble letter. 
Example 2. F~ inovation noun,verb ~ proof noun,verb 
~ manifestation noun,verb 
~ -f~ civilization noun 
~ lighting noun,verb 
~ transparency adjective verb 
L~ become clear noun,verb 
In this case seven letters fill up the ?. 
The selection from these candidates is perform- 
ed in the next syntax analysis step. This word 
analysis program is not valid for consecutive 
illegible letters. As most of the Japanese 
words are one or two Kanji letters, consecutive 
illegible letters do not give us any clue to 
search the dictionary. When we are given con- 
secutive illegible letters '??', we can hardly 
guess what they are. 
5.2 Syntax Analysis Function 
When the word analysis is unsatisfactory 
to resolve the ambiguities, the syntax analysis 
is applied to them. In example 2, there are 
still seven candidates which were selected by 
the word analysis function. The syntax analy- 
sis program refers the contextual information. 
Example 3. 9 ~'~ (transparent terminal) 
o 
--259-- 
This program first conducts a morphological 
analysis of the given pause group, and analyzes 
the syntactic role of each pause qroup in its 
phrase. A noun or verb does not conjugate like 
' ~- (NA), and only an adjective verb can con- 
jugate like ' ~ (NA)' So ' ~H~ ' is select- 
ed uniquely. 
Example 4. ~ \[~, hx\] ~ KABAN(GA KA)ARU 
_~.~ There is a base. 
In this case ' ~ (KIBAN; base)' in a 
noun, ' ~ (ARU; be)' is a verb, ' ~ (GA)' is 
a particle to indicate the subject, and ' ~% (KA)' 
is a particle to indicate an alternative or 
question. The particle ' ~ ' only makes the 
sentence grammatical and ' ~~O ' is 
the unique answer• 
This syntax analysis program performs the 
morphological analysis to the segmented pause 
groups (Fig. 2). If the segmentation is incor- 
rect, this program can not analyze the phrase 
or sentence. So this program retries the seg- 
mentation of the input string to make successful 
analysis results. 
Example 5. a) ~lJ~\[~K \[{ ~'\] ~1,{~'f~i 
b) ~\]~..~\]~1£ \[ ~'\] ~'~ /~. 
(controZ circuit) (each) (add) 
This example shows the retry process. The 
segmentation program firstly segments a string 
at the point of letter type changing (b). This 
segmentation is not correct. The first pause 
group is not grammatical. This program assumes 
that this pause group may be a compound pause 
group, and searches all possible separations 
from left to right. This program finds a em- 
bedded adverb ' ~ %%~L (SOREZORE; each)', 
and by this segmentation this sentence can be 
analyzed successfully. The other candidate ' ~ ' 
can make no grammatical sentence. 
5.3 Wording Analysis Function 
In the sentences of patent gazettes, impor- 
tant words or key words are repeatedly used with 
anapholic pronouns. This fact is a very impor- 
tant clue to find an anaphola or to guess the 
ambiguous letter. The arrows in Fig. 6 show the 
anapholic relations of words in a text. Some 
kinds of particular anapholic pronouns appear . 
. in patent texts (' ..J.~ (above-mentloned), 
' ~ (GAI; such)', ' ~ (DOU; same)' and 
' ~ (KONO; this\]). When an illegible let- 
ter occurs in an anapholic words, the wording 
analysis program searches the indicated word 
and correct the illegible letter by the matched 
letter. In Fig 6, ' ~.~:l~,9~ 
• ..£-fl51~ , 
~ ",~1 ~ ' ,, I 
~.~ z ~ ~ 
MO S~'4 7~- H'o 
Figure 6. Anapholic Relations in a Sentence 
(above-mentioned connected area)' has an ana- 
pholic pronoun ' ~,~i '. ' ~'~ ' is 
compared with the indicated word ' ~ ' 
, and _? is corrected to ' ~j~ '. The word- 
ing analysis program automatically prints out 
a glossary of texts. This glossary is used to 
augment the dictionary of the error correction 
system• 
Numeric expressions are also used frequent- 
ly. Numeric expressions are analyzed by using 
semantic relations of words in their vicinity. 
As the bibliography of a patent contains the 
name of a person, place and affillated organi- 
zation, the correction system needs to change 
the dictionary from a common dictionary to a 
proper noun dictionary. In a proper noun pause 
group, it is more important to analyze the 
semantic relation among the words. 5 
Example 6. (KAWASAKI city 
~l~#~lJl,~ , ?~i KANAGAWA prefecture) 
KANAGAWAKEN KAWASAKI SHI 
£ /~ ~J (name of city) ~r~ I% (person's name) 
This phrase describes an address, and ' ~\] 
(city)' is a suffix for the name of a place 
which does not connect with a person's name. So 
the illegible letter can be decided uniquely. 
6. System Configuration and Experimental Results 
Fig. 7 shows the kanji OCR and linguistic 
error correction system. Fig. 8 shows the con- 
figuration of this system. The error correction 
system is programmed on a mini-computer (TOSBAC- 
40). The text editing terminal is a newly de- 
veloped Japanese word processor. The operator 
of this system can confirm the error correction 
results on the CRT display, change the form of 
the text by versatile editing functions, store 
and transfer them to the host machine. 
The experimental results for actual 250 
-260 - 
Figure 7. Overall View of Error Correction System 
t 
OCR ~ Error I 
Correction 
Text Editing y 
Terminal _~ 
Dictionary 
Figure 8. System Configuration 
patent texts were as follows; 
effective correction ....... 53.8 percent 
ineffective correction ..... 38.5 percent 
wrong correction ........... 7.7 percent 
The ineffective correction rate shows the per- 
centage of letters which this system can not 
correct. 
Example 7. Wrong correction 
? ~-@~ i)J~ 69 -'" (What we claim is---) 
~I~ \] ~N'~0) --wrong 
(Yg_U $_~ JYOI/~ BUTU)KEN SEI KYU NO 
~,~ ~9 ~ 5~ e ................... Tight 
This example shows a case of wrong correc- 
tion. The first letter was illegible. And the 
next letter ' ~ (KEN)' was misread. The cor- 
rect letter is ' ~q (KYO)' The kanji~OCR has 
made an error. This error correction system 
tried to correct the ? letter by using ' ~'I ' 
which was a wrong letter as the clue for cor- 
rection, and made a wrong correction. 
7. Conclusion 
This error correction system can correct 
about fifty percent of the errors and rejec- 
tions of kanji OCR outputs and was effective to 
increase the total throughput of the kanji OCR. 
The kanji OCR reads letters according to 
their geometrical feature, and this linguistic 
error correction system reads a sentence 
according to the linguistic knowledge. The com- 
bination of the kanji OCR and linguistic error 
correction system realizes a practical Japanese 
text reader and can cope with the increasing 
demands for input of Japanese document informa- 
tion. The throughput rate of the OCR, combined 
with this linguistic error correction system, is 
about i0 times higher than that of a conventional 
manual data entry. 
8. Acknowledgments 
Parts of the research, and development of 
the system were made under Contract with the 
Ministry of International Trade and Industry on 
the Pattern Information Processing System (PIPS) 
Project. 

References 

(i) S. Viresh, "An Approach to Address Identifi- 
cation from Degraded Address Data", Proc. 
NCC pp.779-783, 1977. 

(2) E. M. Riseman, A. R. Hanson, "A contextual 
Postprocessing System for Error Correction 
Using Binary n-Grams", Trans. on COMPUTER 
IEEE, Vol. C-23, No.5, MAY, pp.480-493, 1974. 

(3) H. Saiito, M. Noyori, "Patterns of Claim of 
Japanese Patent Sentences" computational 
linguistics, IPSJ, pp.l-10, Feb. 1978. 

(4) K. Sakai, S. Hirai, T. Kawada and S. Amano, 
"An Optical Chinese Character Reader", Proc. 
Third IJCPR, Dp.122-126, 1976. 

(5) T. Kawada, S. Amano, K. Mori and K. Kodama, 
"Japanese Word Processor JW-10", Proc. 
COMPCON'79, pp.238,242, Sept. 1979. 
