CRITAC- A JAPANESE TEXT 
PROOFREADING SYSTEM 
Koichi Takeda 
Tetsunosuke Fujisaki 
Emiko Suzuki 
Japan Science Institute 
IBM Japan, Ltd, 
5-19 Sanban-cho, Chiyoda-ku, 
Tokyo 102, Japan 
Abstract 
CRITAC (CRITiquing using ACcumulated knowledge) is an 
experimental expert system for proofreading Japanese text. 
It detects mistypes, Kana-to-Kanji misconversions, and 
stylistic errors. This system combines Prolog-coded heuristic 
knowledge with conventional Japanese text processing tech- 
niques which involve heavy computation and access to large 
language databases. 
1. Introduction 
Current advances in Japanese text processing are mainly due 
to the remarkable growth of the word processor market. 
Machine readable Japanese text can now be easily prepared 
and distributed. This trend spurred the research and devel- 
opment of further text processing applications such as ma- 
chine translation and text-to-speech conversion \[SAKAS8310\] 
\[MIYAG8310\]. However, some fundamental text processing 
procedures are missing for Japanese text. For example, 
counting the number of words in text is a difficult task since 
words are not separated by blanks. 
Our experimental system CRITAC (CRITiquing using 
ACcumulated knowledge) tries to overcome Japanese lan- 
guage problems. Proofreading (or critiquing, to some extent) 
\[CHER80\] \[HEIDJ82\] has been chosen as our research do- 
main because it involves many text processing techniques and 
is one of the most important functions currently required and 
lacking. In this paper, we introduce CRITAC concepts and 
facilities including a conceptual representation of Japanese 
text called "structured text" to handle meaningful objects 
(e.g., sentences and words) and proofreading using heuristic 
rules for the structured text. The structured text consists of 
a set of Prolog \[CLOCMS1\] facts and predicates, each of 
which represents an object or a class of objects in the text. 
Because of this high-level representation, human proofreading 
knowledge can be easily mapped into Prolog rules. Two 
user-friendly representations of text, called "source" and 
"KWlC" (Key-Word-In-Context) views, are derived from the 
structured text. CRITAC provides users with editing and 
proofreading functions defined over these views. 
The notion of structured text, we believe, is not restricted only 
to the Japanese language. Discussions on our approach for 
languages other than Japanese will be given in the Conclu- 
sion. 
2. CRITAC System Overview 
In this section we discuss the outline of CRITAC and its 
underlying concepts. As shown in Figure 1, the heart of 
CRITAC lies in its architecture. It consists of three major 
components: a user interface, a preprocessor, and a proof- 
reading knowledge base. 
Word Processor F Text Compiler 
Machine Readable\] - SQL/DS Dictionary Server -- Text \[ \['- 
II FText Editor 
file~J transfer /~~ 
PRIPROCESS;R /I View \[View _J 
• I USER INTERFACE 
- segmentaUon I \[ 
- word recognition \[ I / 
~ I PROOFREADING ~,k I KNOWLEDGE BASE 
stored %~. I' "--I "~"-~\[ Structured Text\] 
Figure 1. CRITAC Configuration: The preprocessor generates the struc- 
tured text from given text. The proofreading knowledge base 
currently consists of about 30 Prolog proofreading rides for the 
structured text. The user interface handles two external views 
and facilitates the SQL/DS online dictionary server and text 
compiler. 
User Interface 
The user interface is built upon an editor, an SQL/DS online 
dictionary server, and a text compiler. 
The Editor and External Views 
The editor provides source and KWIC views as user-friendly 
external text representations. It facilitates the modification 
of the text through these views. During the modification, 
when a user asks the system to apply the proofreading rules, 
diagnostic messages will appear in the screen with possible 
errors underlined in the text. 
* * * TOP OF FILE * * * 
9 
Figure 2. CRITAC Source View 
412 
The source view is just a replica of the original text except 
that it is not fi}rmatted. That is, the original format such as 
paging and indentations are dropped. Figure 2 shows an 
example of a source view screen. 
The KWIC view displays text in the KWIC format which 
extracts all the content words as keywords from the text and 
arrange them in their phonetic (pronunciation) order along 
with their contexts. This is an extremely useful tool for users 
to find homonym errors \[YAMAS8310\] caused by a rniscon- 
version ~ and the lack of conlbrmity (e.g., "center" and 
"centre" in English) in the text \[FUJIM8504\]. Since external 
views are virtual views of the structured text, updates made 
by the user are checked and reflected in the structured text 
through the synchronizing mechanism of updates in two dif- 
ferent views. 
Figure 1 shows a KWIC view. Each line consists of one 
keyword in the middle, preceding and succeeding contexts are 
shown on the left and right. 
625 ~B- )JJ, Illl ~ P r o 1 o a 03 )b--~b 
626 ab;b'R'~CNN2 0N~D~iE ~IIU ~¢I~iF~IN-'K~-.X k b~ 
628 ~,~ b tz b kqtgaJ b ~ 9 ~¢ ~ /~ kl: lt,J ~ b "-) --0 ib :~ k 5 ~ 
630 fc 4j 0) ~ ~t ,G t2 ~ gg o3 ~ ~ ~h~ k~ t,~ g E 1,7- IN~ h, ~ ~'C 
634 KW I Cm¢'4 :~ o~02iE t~\[~ ~k b'C~N~Tx~J~ 
535 a:--~go~,'~IR'~'5:~o){l~o) tt~/ll ~b"c, 5a~b .:~> 
Figure 3. CRITAC KWIC View 
SQL/DS Dictionary Server 
Online access to systmn dictionaries or an encyclopedia 
\[WEYEB8501\] is one of the most user-fi'iendly facilities in an 
advanced text processing system. CRITAC is connected to a 
dictionary server implemented on SQL/DS. SQL/DS (Struc- 
tured Query Language/Data System \[IBM8308\]) is a rela- 
tional \[CODD7006\] database management system. It has 
been mainly used for business data processing purposes like 
purchase-order files. The excellent user language of SQL/DS 
is based on the relational calculus which can be easily incor- 
porated into Prolog \[IBM8509\]. Furthermore, SQL/DS can 
support multiple access to tables. For example, a user can 
access tables in terms of KANt1 values or PRONUNCIA- 
TION values. This greatly enhances the "associative 
memory" access of online dictionaries. That is, we can re- 
trieve all the related information by giving some known val- 
ties. 
The Japanese word processors currently have limited the nmnber of keys on 
the keyboard by using the phonetic character set, Hiragana or Katakana, to 
enter text. Once the pronunciation of a word or a phrase is given, it will he 
converted to the most-likely Kanji expression. This process is a Kana-to- 
Kanji conversion. Because of the large number of homonyms \[TANA8310\] 
in the Japanese language, this conversion is liable to generate an unintended 
Kanji expression. This is referred to as misconversion. 
We include some "canned" queries to the dictionary. A typ- 
ical access pattern is searching for homonyms as shown in the 
section "A Sample CRITAC Session" (Figure 10), which 
helps users correct misconversions. 
Canned queries also include synonyms, antonyms, related 
words, and upper/lower concept of the given words. The 
conceptual hierarchy of words is obtained from the following 
combined SQL queries. 
SELECT 
FROM 
WHERE 
X.CATEGO RY -- NUMBER 
SEMANTIC - CLASSIFICATION - TABLE 
X.KANJI = givenWord 
X 
Get the stem number of the returned 
category - number. 
X.CATEGORY-NUMBER is like "1.2.39". 
stemNum -- STEM(X.CATEGORY- NUMBER) 
...................................................................... 
SELECT X.KANJI,X.YOMI, Y.CATEGORY-NUMBF, R 
FROM KANJI - PRIMITIVE - WORD - TABLE X, 
SEMANTIC-- CLASSIFICATION-- TABLE Y 
WHERE X.KANJI = Y.KANJI 
AND Y.CATEGRY - NUMBER lAKE 'stemNum%' 
ORDER BY CATEGORY2qUMBER 
The first query returns the category number, say "1.2.39", 
from a table called 
"SEMANTIC-CLASSIFICATION-TABLE" We need a 
stem number ("1" in this case) of this category. The stem 
nmnber is then used to find the words whose category number 
has the same stem number. This is what the second query re- 
trieves. The result is arranged by the category number. 
Note that users can make use of not only canned queries but 
also ad hoc queries. Since the SQL query language and the 
conceptual scheme of the dictionaries are easy to understand, 
users can make queries like those above to get ad hoc infor- 
mation. Such a set of customized queries is also stored in 
SQL/DS. Users can also define views for dictionaries to re- 
name and select columns as well as records. 
The Text Compiler 
CRITAC also provides a "text compiler" facility. It is analo- 
gous to the compilers of programming languages. The user 
gives text to this text compiler and the compiler can provide 
the following. 
@ 
@ 
@ 
A source list of text and diagnostic messages with line 
reference nmnbers. 
A list of segments and Kanji primitive words with their 
occurrence counts and pronunciation. A cross-reference 
list of words and text (KWOC: Key Word Out of Con- 
text) and other useful statistics can also be obtained. 
Various formats of the text including KWIC format. Ad- 
ditional information (word boundaries, pronunciation, 
etc.) may be added to the text. If an application uses a 
specific format, the text compiler can be made to generate 
application input in this tbrmat. 
413 
Preprocessor 
The preprocessor decomposes continuously typed Japanese 
text into a sequence of tokenized primitive words and parti- 
cles. A basic object of Japanese sentence is a content word 
followed by zero or more function words, which is called a 
segment or a phrase (see \[MIYAG8310\], for example, for 
more details). This preprocess is required because Japanese 
text has no explicit delimiters (blanks) between words. 
The preprocessor also gathers such information as pronun- 
ciation 2, parts-of-speech, total number of occnrrences in the 
text and base form, etc., associated with each of the words 
and particles. This information together with the original text 
is stored as a set of Prolog facts to be used for later process- 
ing. We illustrate the steps of the preprocess (see Figure 4). 
\[Japanese Text Preprocessing\] 
1. Japanese text is a collection of sentences. Each sentence 
is just a continuous string of characters. 
2. A segmentation algorithm is applied to the sentence. This 
algorithm contains about 100 heuristic rules each of 
which specifies the cases where a segment boundary usu- 
ally appears. The accuracy of this segmentation algo- 
rithm is about 97.5%. 
3. Content words in the segments are recognized by looking 
them up in a primitive word dictionary. If a content word 
is a compound word, it is decomposed into primitive 
words. Since many Kanji compound words have ambi- 
guities of decomposition, we apply a stochastic estimation 
algorithm and a Kanji primitive word dictionary with 
statistics \[FUJI8509\] to find the most likely decompos- 
ition. Our algorithm is obtained from stochastic esti- 
mation algorithms in \[FORN7303\], \[BAHLJ8303\], and 
\[FUJI8407\] with slight modification. The accuracy of our 
algorithm is about 96.5%. 
4. Function words in each segment are identified. The 
connectivity of these function words is described by an 
automaton \[OKOC8112\]. A correct sequence of function 
words is obtained by observing the transition over this 
automaton. 
""0:o,t j, .... 't 
(ease:OBd) (verb.eonj,par tiele) (ease:OBJ) (verb. con j) 
w : primitive word, p : prefix, s : suffix 
Figure 4. Preprocessing Japanese Text 
2 Roughly speaking, Katakana and Hiragana characters are phonetic and each 
of them corresponds to one phoneme. Kanji characters are ideographic and 
there is a many-to-many mapping between the Kanji character set and a set 
of phoneme sequences. 
The preprocess starts with given text at step 1 above. The text. 
is gradually analyzed and decomposed into fragments at the 
succeeding three steps. Details of the algorithms used here 
are beyond the scope of this paper. 
The above high level objects (segments and words) of 
Japanese text are conceptually expressed in terms of four 
types of facts and three types of predicates as shown in 
Figure 5. We map < segment >, < content word >, < func- 
tion word>, and punctuations into the facts: seg0, head(), 
tail(), and punc0. Other fragments of text can be defined 
from those basic facts. This is called structured text. 
seg(1,J,K,X) A character string X is the K-th segment 
in the J-th sentence of the I-th paragraph. 
I,J and K denote the same indexes below. 
head(I,J,K,U,Y,G,L) U is a content word (possibly a Kanji 
compound word) of the segment X above, 
with pronunciation Y and part-of-speech 
G. L is a list of labels to denote prefixes, 
primitive words and suffixes in U if U is 
a Kanji compound word. 
tail(I,J,K,V,H) 
pune(I,J,K,D) 
sent(I,J,S) 
para(l,P) 
text(T) 
V is a list of function words in the segment 
X and the part-of-speech of the last func- 
tion word is H. 
D is either a period or a comma of the 
segment X if any. 
S is a - sequence - of segments 
seg(I,J,K,X). 
P is a- sequence- of sentences sent(I,J,S). 
T is a - sequence - of paragraphs 
para(I,P). 
Figure 5. Types of Facts and Predicates for tile Structured Text : First 
fbur predicates are facts tbat represent basic objects in Japanese 
text. The rest of predicates represent derived fragments of the 
text. 
The structured text enables us to generate two external views 
in the previous subsection (Figure 6). The mapping between 
the structured text and source view is straightforward. The 
KWIC view can be denoted by the following symbols. 
Segl.l Segl.2 ... Segl.kL t Keyt Taill Segt,kl-l l ... SegI,wI 
Seg2,t Seg2.2 ... Seg2k2 t Key2 Tail2 Segze2+ l ... Segzw2 .. • .. .. .. .. 
Seg~.t L+eg,,,2 ... Seg,,.k~-1 Key,, Tail, Seg,,,k,,+l ... Seg, jv, , 
The i-th row of the KWlC view is a sentence which has W,, 
segments. Segij is a shorthand notation for a segment X ap- 
peared in the i-th row of the KWIC view satisfying 
seg(pi, s,j, X) for some p~ and s~. Each sentence of the text 
appears in the KWIC view as many times as tile number of 
its segments because each segment has one keyword to appear 
in a distinct row in the view. For example, the first sentence 
appears three times in the KWIC view of Figure 6. Keyx 
equals the content word of Seg.~.k.~ and T~. is the remaining 
function words of Segx.kx. If Keyx is a compound word con- 
sisting of y primitive words p,, P2 ..... py, they are shown in y 
separate rows instead of a single row with Keyx. 
414 
seg (1,1,1, giO"/1,--7"~C kt ) 
head(1,l,l,(~q/b--'2"}, { L -5 C ~--.~Q, {noun}, (p,l,2,3,4}) 
ta H (1,1,1, {'O, {~:),85) 
seg(t,l,2, C R I T AC ~) 
seg (1,2,8. g \[1'~ "O'~') 
head (1,2,6, { H till ), { ~o ( "C ~" }, {noun }, {1,2} ). 
tai 1 (1,2,6, {~', ~J" },22) 
punt(l,2,6, o ) 
~--~_.,,~.,,~_~_~ 
Structured Text 
Figure 6. The relationship Imtween the structured text and the two external 
views 
~'\[~Jb+-~/~ltt CR I "L'AC,2 
C R I 'PAC~ \[l~.q~'To 
\[lg ff:q:-c'0 o 
I . 
I I 
CRITAC ~2 
Io ~_ }:hg 
\[-I tt,~ ~-t-o 
HI\],'J~-r o 
pl'eced i ng uords succeed ~ ng 
context context 
Proofi'eading Knowledge Base 
CRITAC proofreading rules are written in terms of the facts 
and predicates including those used for the structured text, 
and built-in predicates \[IBM8509\]. There are two types of 
rules: source rules and KWIC rules. By source rules we mean 
those rules which involve qualification over segments, content 
words, and function words in the source view of text. The 
KWIC rules inw)lve the qualification of adjacent keywords 
as well as their ordering. 
Source Rules 
One category of source rules aims at finding excessively 
complicated or ambiguous noun phrases. This is effective in 
the Japanese language environnrent because the Japanese 
grammar allows nouns and some particles to be strung to- 
gether into a long sequence, and such sequences are used fre- 
quently. Basically we have two types of phrases to detect: 
one is repeated noun modifiers of the same kind and the other 
is an ambiguous dependency among segments. For example, 
phrases like 
My Mother's Father's Company's Location is... 
(repetition of possesive noun phrase) 
Applying equations/1 or B and C, 
(too anrbiguons modification) 
are detected. 
Another category of source rules aims at detecting 
typographical errors and inconsistent use of words. Some 
sample proofreading rules are 
A rule for detecting incorrect ending of sentences 
rule('terminatiw;') <- tail(I,J,K,T,H) 
& end - of- sentence(E) 
& category('terminative',H) 
& NOT punc(I,J,K,E) 
& warning 1 ('missing punctuation',I,J,K). 
This rule scans the function words of text. If a segment ends 
with a flmction word (the last: element of the list T in "tail"), 
which usually implies the "end-of--sentence', but is not 
followed by a punctuation mark (period) E, then give a 
warning to the author. 
A rule for detecting an inconsistent use of numeric prefixes - 
One is Alphabetical prefix and another is Chinese numeric prefix 
rule('numbers') 
<- head(ll,J I,K I,U1,YI,G I,L1) 
& number -- prefix(K I,L 1 ,P l,W 1 ) 
& head(I2,J2,K2,U2,Y2,G2,L2) 
& number -prefix(K2,L2,P2,Wl) 
& char-- type(P1,T1) 
& char- type(P2,T2) 
& ne(Tl,T2) 
& warning2('numbers',II,Jl,Kl,I2,J2,K2). 
This rule detects the inconsistent usage of Kanji and Roman 
numeric prefixes for some content word. if some primitive 
word Wl is preceded by numeric prefixes P1 and P2 denoting 
the same numbers but not of the same character type, then 
give a warning to the author. "Number-prefix(K,L,P,W)' 
succeeds if there is a number prefix P preceding a primitive 
word W in the compound word K, where the constituent 
types are listed in L. "Char- type(P,T)" succeeds if a 2-byte 
character string P consists of only one character type 
("Kanji", "Hiragana", "Kanakana", or "Roman") or if P is 
"mixed". 
KWIC Rules 
KWIC rules, in contrast with the source rules, refer to 
keywords of the KWIC view rather than the "segment" or 
"head" above. 
In the KWIC view, as explained in the previous subsection, 
if Keyj ..... Keyk is the phonetic ordering, homonyms are ar- 
ranged adjacently. This greatly reduces the time to detect 
homonym errors (conversion errors) because the system only 
has to scan the keywords Keyi once to examine the pronun- 
ciation of Key~ and Key~, and possibly some other local con- 
ditions. For example, we can express possible conversion 
errors as follows. 
415 
ruleKWIC('misconversion') 
< - is- ordered('pronunciation') 
& key(I1,U1,YI,G1) 
& next- key(I1,U2,YI,G2) 
& ne(UI,U2) 
& pred(I1,P) & pred(I2,P) 
& succ(I1,S) & succ(I2,S) 
& wamingKWIC('misconversion',I 1). 
If there are distinct keywords U1 and U2 with the same pro- 
nunciation Y1 in the same context P (preceding primitive 
word) and S (succeeding primitive word), then one of them 
is possibly a misconversion. "Key(I,U,Y,G)" is Key~ = U 
with pronunciation Y and lexical category G. 
"Next-key(I,U,Y,G)" unifies with key(I1,U,G,Y) such that 
II=I+l. 
If Key~, ..., Keyk are ordered either by their preceding or suc- 
ceeding word, we can detect some lack of conformity in word 
usage. For example, if keywords are ordered by their suc- 
ceeding words, the following case will be detected. 
... be a spelling error in such cases .,. 
... can find style errors as well as ,.. 
... to detect stylistic errors in the text ... 
ruleKWIC('inconsistency') 
<- ordered('successor') 
& key(II,U1,Yl,Gl) 
& next- key(II,U2,YI,G2) 
& ne(Ul,U2) 
& stem(U1,S) & stem(U2,S) 
& warningKWIC('inconsistency',II). 
Here we assume both stem('style','style') and 
stem('stylistic','style') are successful. 
3. A Sample CRITAC Session 
This section illustrates a typical CR1TAC session consisting 
of four actions. That is, 
1. Detect an error in the source view (Figure 7). 
2. Display the explanation of the error (Figure 8). 
3. Locate the error candidate in the KWIC view (Figure 9). 
4. Get the homonyms of a keyword by invoking the dic- 
tionary server (Figure 10). 
These actions are implemented as basic functions of the sys- 
tem. Locating a word of interes~ in different views will effi- 
ciently help users judge the system-detected errors to be real 
errors or not. 
416 
"CRITAC S0tlRCE At F 72 TRUItO=?2 $IZg=lSiLINg=102 C0£=i ^LT=O 
97 
98 ?~I~N~t~t~":;K:W~o~W~ (1) 3~Jb (E) N~k{~lll{gllo~3d 
99 ~ (3) N~N~No)N~tI~_~OJ (4) {~JllNo)i~N~¢a-~c \[5\] ~I~ 
101 I~T~o~/tI~PA~CR I TAC~'r~¢~ tg~kl:?k~O)~O)~ktNt 
102 5Z~Ztl£, ~¢:.ltl~'~©ll~-~b-Cttt~ • ~lgY~'~'e 
103 ~:5~--c~,WOo F-o)g0,~'~CR 1TAC'P~'~t~o)3~)~o3 
...... 311 ........ 
104 l~G~'zt)l~09~lJw,is~J:~,o tzt-b, ~o?3~tzt~t~< N5~,0"3~, !~ 
15======== 24==== 
105 ~{,%Ig~.o~l~dt.~g~.o~h¢~=~"ct,~o 
106 
• *CRITAC**I 1 \] 2 \[ U I 4 I**\] 5 \[ 6 I 7 \[ 8 \[*.1 9 \] 10 \]ll \[ 12 \[ 
PROOF ISreelttextl0uitlF, xpll**lSCH01 ? IBacklForwl**l = l}lomell/ordlKVIC\[ :==%> 
., 
Figure 7. A Soorce View with Errors Underlined: Erroneous segments are 
underlined. Tim numbers in tim underlines are classification code 
of errors. 
CRITAC SOURCE AI F 72 TRUNC=72 SIZE=ISI LIRE=I02 COL=I ALT=O 
+ ......................................................................... + 
÷ ........................................................................ ÷ 
101 ~T~I~I#.I~.i~ECR I TAC'~'Cb~/gI,~. ~il~k't3~-o)~jook~ 
15 ........ 24 .... 
106 
**CRITAC**\[ 1 \[ 2 I 3 \[ 4 \[**1 5 \[ \[ 7 I 8 \]**l 9 \[ 10 \[ 11 \[ 12 \[ 
PROOF \]Srce}Eext lQuit \]Expl 1** ISCEG\[ \]gack\[Forw\[**\[ = lllo~el~ordlKVICl ====) 
Figure 8. The Sere'co View Explaiuiug One of the Errors: The system 
explains that the error code 36 says that the underlined segment 
has a Kanji compound word with very rare combination of 
primitive words. 
CRITAC KVIC AI F 116 TR\[INC=116 SIZE=I207 LIRE=970 COL=I ALT=O 
900 , J:91~i~Iz~l~-¢>$ 2~ t~'~A~.\[~'~o 
963 0~/tI~5~ t, "~O& SrS Nfl~. W.:-o)~N~N{gW~ 
969 ~ • t~iE • }~ b /z 9 ~" ~ ~ ~t~(~-¢.~Iz~--> 
970 ~,.~,-eCR I "FAC~2b~5 ~ o~NI$~(:I~o)NNG 
971 9, NAllC-,~i~/'&5~a,2~N , ,~1\]~j~499-7:c4 
974 , NJe.~N~No)NN~N~ N tcbT;5o)tz{~t)n~o 
075 "ct6b, l~,pl~---~ ~ t~/)~G~kz~f~ 
¢*CRITAC**\[ 1 \[ 2 I 3 I 4 I**l 5 I 6 I 7 \] 8 I**\[ 9 I 10 I 11 I 12 
KVIC IPrf \]Yomi\[QultlExpll**ISChg\[ ? \]Back\[Forw\[**l = \[Homel~ordIgreE ===:> 
Figure 9. A KWIC View Switched from the Source View: This KWIC 
view shows the number of occurrences of a primitive word. The 
user wonders whether the primitive word caused the previous 
error 36 of the Kanji compound word. 
CRITAC KWIC At F 116 TRUNC=ll6 SI7.E=1202 I, IiqE=1007 C01.=1 AL'f=2 
-;~-7 -7;-~ ....................... :000820 l 
) ~ : ,!, 5 ~ :000020 I + ........................................ + 
1001 ~.gIKWICb'zS~YAA.~ I~ 4-') j Lt~ql:;3~a){~ 1002 4# (')--X.:r.~'4~k I~ ~) (:-.k-~-CN~7.'~, 
1003 W_titJi\[iRl~,~O:c~-'.¢#h,~ t~. O:/f~oPl'o 1 o 8 
1008 a37'~5<U, ,~,6h, ts~ ~ ~t.ttcNl~(fz~t~o)~j:)~ 
1008 :tt L: 51 ~f o) iF aqJ ,tl\] "~ ~112d ~ It_ ~ ~ ~, o 
1010 A, ~:y::tBgj~h~o~No~ BJi,q ~c-fSb-cg't~zt~TE • 
1012 \[,I,\[~- ~ a.~ ab ~R- T t~¢£ 2 0 tNo) 1013 -- .X -.> 7, 5"-/a 0) :6 -'9 )< ~ g ~1,',',( -ea~o 
**CRITaC**I 1 I 2 I 3 I 4 I**1 5 I 0 \[ 7 I 8 I*.1 0 \] 10 I 11 \[ 12 KVIC IPrf IYomilQuitlgxMl**lSChgl ? \[BacklForvl**l :: IJlo~el~ord\[Sree 
====) 
Figure 10. The lnvokatio, of the Dictionary Server: Given a prinlitive 
wind as a key, the dictionary server disphlys a list of its 
homonyms. Then the riser nfight find the correct primitive 
word m the list. 
4. Conclusion 
We have introduced a Japanese text proofi'eading system 
CRITAC based on the text preprocessing and Prolog-coded 
knowledge representation. The user interface of CRITAC is 
developed based on the following concepts. 
0 support of external views of text for users 
® SQL/DS ottline dictionary server 
• text compiler 
Although CRITAC is initially designed for the Japanese lan- 
guage, the authors believe it can be applied to other languages 
as well. For example, the KWIC view can be applied to detect 
conversion errors in any kind of text prepared by phonetic- 
toqdeographic conversion or voice-input methods. The 
notions of structured text and preprocessing not only add 
textual information to the original text to resolve the over- 
head of applications but also give a high-level view of the 
text. The application designer no longer has to code his/her 
own lexical analyzer that deals with character strings. They 
are replaced with higher objects such as segment, sentence, 
or paragraph. 
We also believe that CRITAC as a domain-specific (or 
application-specific) interface to applications (e.g., document 
retrieval or machine translation) will help users define a set 
of valid input by heuristic rules in addition to a grammar. 
Rules of CRITAC can filter out or rewrite some portions of 
the text and guarantee appropriate text to be given to the 
applications. 
Future work includes 
O Evaluation of the system with respect to accuracy, usa- 
bility and so on. 
@ A simple grammar checking based on case grammar 
\[FILL68\]. 
SQL/DS cnhancement of dictionary server facilities. 
SQL/DS can also be used to manage the document 
\[MISE8103\] itself. 
® Interface to other applications such as a machine trans- 
lation system or a text-to-voice generation system. 
Acknowledgments 
We are grateful to Hiroshi Maruyalna for building a proto- 
type of the CRITAC knowledge base and for many valuable 
discussions. We also wish to thank Tetsuro Nishino for im- 
plementing a complicated noun analyzer on a CRITAC pro- 
totype, M. Arthur Ozeki for his assistance in developing 
proofreading rules, and Linore Cleveland and Kaoru 
Hosokawa for their helpful comments on the earlier draft of 
this paper. 
References 

\[BAHLJ83031 l\]ahl,I,.R., Jelinek,F. and Mercer, R.l,.: "A Maximum 
Likelihood Approach to Continnous Speech Recognition", IEEE trans. 
on PAMI, VoI.PAMI-5, No.2, pp.179-190, Mar. 1983 

\[CfIER80I Cherry,L.L.: "Writing Tools - Tile STYI,E and DICTION 
progranas', l)cll Laboratories Cmnputing Science Technical Report, 
No.9, t980 

\[CLOCM81\] Clocksin,W.F. and Mellish,C.S.: Programming in Prolog, 
Springer-Verlag, 198l 

\[CODDT006\] Codd,E.F.: "A Rehuional Model of l)ata for Large Shared 
Data Banks", CACM, Vol.13, No.6, pp.377-387, June 1970 

\[FILL68\] Filhrtore,C.: "The Case for Case" in Uniw~rsals in Linguistic 
Theory (Eds. Bach,E. and ltarnrs,R..T.), Holt, Rinehart and Winston, 
New York, pp.l-88, 1968 

IFORN7303\] Forney,G.D.Jr.: "The Viterbi Algorithm", Prec. of the IEEE, 
Vet.61, No.3, pp.268-278, March 1973 

\[FUJI84071 F@saki,'l'.: "A Stochastic Approach to Sentence Parsing", 
Prec. of Coling84, pp.16-19, J'uly 1984 

\[FUJlM851141 Fujisaki,T. and Morohashi,M. : "Text Processing in 
Kotodama" (in Japanese) in Word Processors and Japanese Language 
Processing (Eds. \]shida,H. ct al.), "bit" specia) issue, Kyoritsu punish- 
mg co., pp.96-107, April, 1985 

\[FUJI8509\] Fujisaki,T.: "Studies on Handling of Ambiguities in Natural 
Languages" (in Japanese), Doctral dissertation, Tokyo University, 
Sept. 1985 

\[lIEIDJ82\] tleidorn,J, Jensen,K., Miller, L.A., Byrd,R./. and 
Chodornw,M.S.: "The EPISTLE text-critiquing system", IBM Systems 
Journal, Vol.21, No.3, pp.305..326, 1982 

\[1BM83081 IBM Corp.: SQL/Data System Concepts and l;?lcilities (2nd 
Ed.), GII24-5013, Aug. 1983 

\[IBM8509\] IBM Corp.: V~,4/Programming in Logic ~ Program 
Description~Operations Manual Sept. 1985 

\[MIYAG8310\] Miyazaki,M.,Goto,S. and Shirai,S.: "Linguistic Processing 
in a Japanese-Text-to-Speech.-System", Proc. of Intl. Conf. on Text 
Processing with a Large Charactcr Set, pp.315-320, Oct. 1983 

IMISE8103\] Misek-Falkoff, L.D.: "Data-Base and Query Systcms: New 
and Simple Ways to Gain Multiple Views of the Panerns in Text", IBM 
P, esearch Report, RC8769, March 1981 

IOKOC8112\] Okochi,M.: "Japanese Morphological Rules for Kana-to- 
Kanji Conversion: Concepts" (in Japanese), TSC Technical Report, 
N:G318-1560, IBM Japan, Dec. 1981 

lSAKAS8407\] Sakamoto,Y., Satoh,M. and Ishikawa,T.: "Lexicon Fea- 
tares for Japanese Syntactic Analysis in Mu-Pro.iect-JE", Proc. of 
Coling84, pp.42-47, July 1984 

ITANA83101 Tanaka,Y.: "Analysis of l-hnnonyms in Individual Fields", 
Proc. of Intl. Conf. on Text Processing with a Large Characler Set, 
pp.397-402, Oct. 1983 

IYAMAS83101 Yamashita,M., Shiratori,Y. and Obashi,F.: "Kana-to- 
Kanji Translation Method Using Pattern Characteristics of Ka@ 
Characters", Proc. of Intl. Conf. on Text Proccssing with a Large 
Character Set, pp. 113-117, Oct. 1983 

\[WEYEBSS01\] Weyer,S.A. and Borning,A.H.: "A Prototype Electronic 
Encyclopedia", ACM Trans. on Office Information Systems, Vol.3, 
No.l, pp.63-88, Jan. 1985 
