Lexicon Features for Japanese Syntactic Analysis in Mu-Project-JE 
Yoshiyuki Sakamoto 
Electrotechnical 
Laboratory 
Sakura-mura, 
Niihari-gun, 
Ibsraki, Japan 
Masayuki Satoh 
The Japan Information 
Center of Science and 
Technology 
Nagata-cho, Chiyeda-ku 
Tokyo, Japan 
Tetsuya Ishikawa 
Univ. of Library & 
Information Science 
Yatabe-machio 
Tsukuba-gun. 
Ibaraki, Japan 
O. Abstract 
In this paper, we focus on the features of a 
lexicon for Japanese syntactic analysis in 
Japanese-to-English translation. Japanese word 
order is almost unrestricted and Kc~uio-~ti 
(postpositional case particle) is an important 
device which acts as the case label(case marker) 
in Japanese sentences. Therefore case grammar is 
the most effective grammar for Japanese syntactic 
analysis. 
The case frame governed by )buc~n and having 
surface case(Kakuio-shi), deep case(case label) 
and semantic markers for nouns is analyzed here to 
illustrate how we apply case grammar to Japanese 
syntactic analysis in our system. 
The parts of speech are classified into 58 
sub-categories. 
We analyze semantic features for nouns and 
pronouns classified into sub-categories and we 
present a system for semantic markers. Lexicon 
formats for syntactic and semantic features are 
composed of different features classified by part 
of speech. 
As this system uses LISP as the programming 
language, the lexicons are written as S-expression 
in LISP. punched onto tapes, and stored as files 
in the computer. 
l. Introductign 
The Mu-project is a national project 
supported by the STA(Science and Technology 
Agency), the full name of which is "Research on a 
Machine Translation System(Japanese - English> for 
Scientific and Technological Documents.'~ 
We are currently restricting the domain of 
translation to abstract papers in scientific and 
technological fields. The system is based on a 
transfer approach and consist of three phases: 
analysis, transfer andgeneration. 
In the first phase of machine translation. 
analysis, morphological analysis divides the 
sentence into lexical items and then proceeds with 
semantic analysis on the basis of case grammar in 
Japanese. In the second phase, transfer, lexical 
features are transferred and at the same time, the 
syntactic structures are also transferred by 
matching tree pattern from Japanese to English, In 
the final generation phase, we generate the 
syntactic structures and the morphological 
features in English. 
2. Coac_~pt of_~_Deoendencv Structure based on 
Case Gramma\[_/n Jap_a_D~ 
In Japan, we have come to the conclusion that 
case grammar is most suitable grammar for Japanese 
syntactic analysis for machine translation 
systems. This type of grammar had been proposed 
and studied by Japanese linguists before 
Fillmore's presentation. 
As word order is heavily restricted in 
English syntax, ATNG~Augmented Transition Network 
Grammar) based on CFG~Context Free Grammar ) is 
adequate for syntactic analysis in English. On the 
other hand, Japanese word order is almost 
unrestricted and K~l!,jlio--shi play an important role 
as case labels in Japanese sentences. Therefore 
case grammar is the most effective grammar for 
Japanese syntactic analysis. 
In Japanese syntactic structure, the word 
order is free except for a predicate(verb or verb 
phrase) located at the end of a sentence. In case 
grammar, the verb plays a very important role 
during syntactic analysis, and the other parts of 
speech only perform in partnership with, and 
equally subordinate to. the verb. 
That is. syntactic analysis proceeds by 
checking the semantic compatibility between verb 
and nouns. Consequently. the semantic structure of 
a sentence can be extracted at the same time as 
syntactic analysis. 
3. __ca.$_e_Er ame .~oYer n~ed ..by_ J:hu~/C_ll 
The case frame governed by !_bAag_<tn and having 
l~/_~Luio:~hi, case label and semantic markers for" 
nouns is analyzed here to illustrate how we apply 
case grmlmlar to Japanese syntactic analysis in our 
system. 
}i~ff.TCil consists of vet b. 
~'~9ou _.s'hi ~adjec:tive and L<Cigo~!d()!#_mh~ adjectival 
noun.. L~bkujo ,~hi include inner case and outer' 
case markers in Japanese syntax. But a single 
Iqol,'ujo ~/l; corresi:~ond.~ to several deep cases: for 
instance, ".\'I" indicates more than ten case labels 
including SPAce. Sp~:ee TO. TIMe, ROl,e, MARu,-:I . 
GOAl. PARtr,cu'. COl'~i,or~ent. CONdition. 9ANge ...... 
We analyze re\]atioP,<; br:twu,::n \[<~,kuj~, ,>hi anH cas,:, 
labels and wr.i..i,c thcii~ out, manu~,l\]y acc,.:,idii~, t,:, 
the ex~_m,;:\]e.s fotmd o;;t ill samr, te texts. 
................................... 
* This project is being carried out with the aid of a specia\], gro~H for the promotion of scien,:.c ah,! 
technology from the Science and Techno\]ogy Agency of the Japane:ze GovoYf~: ~,t. 
42 
As a result of categorizing deep cases, 33 
Japanese case labels have been determined as shown 
in Table I. 
T~_bi~_..!~__Ca_s~_Lahe~._fo_~_Ve~bal_Ca_se~_rames 
English Label Examples 
~~- 
1980 ~£(c 
~\[T~n. ~9, %99,,5 
• ~;, ~)\] I~. 10 m/sec. "C 
.~....~,a~ -~ ~ ,5 
~ <--9 ~ ,~', - lr r~\] b-u 
Japanese Label 
(2) ";H~ OBJec~ 
(3) ~-~- RECipient 
(4l ~-Z.~ ORigin 
(5) ~.~- i PARmer 
(6) ~-~ 2 OPPonent 
{7) 8-~ TIMe 
(8)" ~ • ~i%,~,, Time-FRom 
(9) B@ • ~.~.,~, Time-TO 
leO) ~ DURatmn 
(l I ) L~p)~ SPAce 
02) ~ • ~.,~,, Space-FRom 
(13) h~ • $~.,~., Space-TO 
(14") hP~ - ~ Space-THrough 
(15) ~Z~ ~.~, SOUrce 
(16) ~,~,~. GOAl 
(17) \[~ ATTribute 
(18) ~.{:~ • iz~ CAUse 
(19) ~ • ii~. ~. TOO~ 
(20) $~ MATerial 
(21) f~ ~- '~ COMponent 
(22) 7\]~ MANner 
(23) ~= CONdition 
(24) ~\] ~ PURPOse 
(25) {~J ROLe 
(26) \[-~ ~ ~.~ COnTent 
(27) i~ \[~l ~. ~ RANge 
(28) ~ TOPic 
(29) \[Lg...~,, VIEwpoint 
(30) ,L'~ tt~ COmpaRison 
(32) ~ DEGree 5%~/~-@. 3 ~0@-~/-,5 
(33l P~\]~ '~ PREdicative ~ "~,.~ 8 
Note: The capitalized letters form 
English acronym for that case label. 
the 
When semantic markers are recorded for nouns 
in the verbal case frames, each noun appearing in 
relation to l/2u(~'n and Kclkuio-shi in the sample 
text is referred to the noun lexicon. 
The process of describing these case frames 
for lexicon entry are given in Figure \]. 
For each verb, l<ctkuio-Mtt and Keiuoudoi~-_.shi, 
Koktuo-shi and case labels able to accompany the 
verb are described, and the semantic marker for 
the noun which exist antecedent to that Kokuio-shL 
are described. 
4. Sub-cat~or_ies of Parts of SDeech 
accordiDg to their Syntactic Features 
The parts of speech are classified into 13 
main categories: 
nouns, pronouns, numerals, affixes, adverbs. 
verbs. ~eiy_ou--~h~. Ke~uoudou-shi. 
Renlcli-shii~adnoun), conjunctions, auxiliary verbs, 
markers and ./o~shi(postpositional particles;. Each 
category is sub-classified and divided into 56 
sub-categories(see Appendix A); those which are 
mainly based on syntactic features, and 
additionally on semantic features. 
For example, nouns are divided into 11 
sub-categories; proper nouns, common nouns, action 
nouns I (S~!tC~!--~jc i sh i ), action nouns 2 (others }. 
adverbial nouns. ~bk:±tio-shi-teki-i,~ishi (noun with 
case feature ~, ~l~:okuio-shi-teki-i~i~hi (noun 
with conjunction feature), unknown nouns, 
mathematical expressions, special symbols and 
complementizers. Action nouns are classified into 
,~lhc(~-mc'ishi ia noun that can be a 
noun-plus-St~U,,doing> composite verb) and other 
verbal nouns, because action noun \] is also used 
as the word stem of a verb. 
Identify taigee-buusetsu I 
(substantive phrase) I 
governed by yougen J 
active vo 
Other thau active voice 
converted to active .,\[ 
~ephce kakarijo-sh~('~A'. / 'NOMISHIKA', 'NO', 'NO')wit~ 
kaku~o-nhi \[ 
voice 
*ACTIVE, PASSIVE, CAUSATIVK POTENTIAL 
\[TEkREJ 
--->.'y-- :e ,~= ~, ~.':, --9 "-~8 
ffi I~'~,DII~) ....... ¢.,~1= J: 8t¢ 
~T~. 
NG 
'\[ Fill kakujo-shi enteceden~ 
noun for verb phrase | 
in relative clause } 
{ 
I ,.°__o.o.=,,, ..... t 
l 
i i Coustruct case frue forset J \] 
f~- F-~ ~'~' ~- ~'l: 
E~gure_._ ! .... Bho~_.k~___Dia_gr_am of Pro~ess___o..f 
\[~s_c_rJ._b_in~Yerb_al .Case Frames_ 
43 
Adverbs are divided into 4 sub-categories for 
modality , aspect and tense. In Japanese, the 
adverb agrees with the auxiliary verb. 
C~in~utsu-futu-shi agrees with aspect, tense 
and mood features of specific auxiliary verb, 
Joukuou-fz~u-shi agrees with aspect and 
tense, 
Teido-fuku-shi agrees with gradability. 
Auxiliary verbs are divided into 5 
sub-catagories based on modality, aspect, voice, 
cleft sentence and others. 
Verbs may be classified according to their 
case frames and therefore it is not necessary to 
sub-classify their sub-categories. 
5. Semantic Markimz of Nouna 
We analyze semantic features, and assign 
semantic markers to Japanese words classified as 
nouns and pronouns. Each word can give five 
possible semantic markers. 
The system of semantic markers for nouns is 
made up of tO conceptual facets based on 44 
semantic slots, and 38 plural filial slots at the 
end (see Figure 2 ). 
I,~ ~ ' \[~3 N. J~l • ~1~ • O (Natiom-Organ|Zatlo.) 
(Thing. 
/ '='" =,.t)I 
(PLant) (~nilet) 
(¢nanlsate I r----- (NaturaL) 
(~'tlfl¢laL) 
(~lty 
-Mare) 
I J-~ J~J'll~. (Hlterfat) 
CP 14:"t~b.4:'i'~4~ (Product) 
5.1 Concept of semantic markers 
The tO conceptual facets are listed below. 
I) Thing or Object 
This conceptual facet contains things and 
objects; that is, actual concrete matter. This 
facet consists of such semantic slots as 
Nation/Organization, Animate object, Inanimate 
object, etc. 
2) Commodity or Ware 
This conceptual facet contains commodity and 
wares; that is, artificial matter useful to 
humans. This facet consists of such semantic slots 
as Material. Means/Equipment, Product. etc. 
3) Idea or Abstraction 
This conceptual facet contains ideas and 
abstractions: that is. non-matter as the result of 
intellectual activity in the human brain. This 
facet contsists of such semantic slots as Theory, 
Conceptual object. Sign/Symbol, etc. 
4) Part 
This conceptual facet contains parts: that 
is, structural parts, elements and contents of 
things and matter. 
PA tA.Z~lf~.~li(~-tfffcl|L PMnoB¢~ 
.Em~ilemt ) 
(Social I 
,~ (Pot I t Ica t -Eco~liclt ) 
(~tom-SO¢| ~L COmamt Ion) 
(Po~r -Ener~w. Physl ca t ObjKt) 
(Doing. t 
~¢tlo.) ~,OH I~@. ~ (~t-Roaction) 
/ L~ OE t~- ~ (Effect-O~eratfo~) 
(\]du. 
~=tract 1o.) 
~4e~ • ~ - ~11 - ~ (mlery) 
~D. ~ (Slgn-SxW~ot) 
(Sentllent • I', 
HentlL ~¢tfulty)~,~ (Emotion) 
ST j~l~. ~lJ (Recognition-Thought) 
(Part) 
(Attrl~te) 
~ m@ (Part) 
• t " ~ (ELlee.t-Contemt) ~ 
~1 (Property-Character t st Ic) 
)Bt~-----~ AF i\]BS (For=.S~tpe) 
(Status- I ' ' 
Figure) ~ ~C \[:h~lB (State-Cofldftion) 
Figu~ 2, Sy.a_t~m__of 
~ Wl , ~--\]1~ (Nu=her) I, 
(l~alure) ~-~ HU \]Jll~. RJ~ (Unit) 
I, \[-I,-~1~= • aim (standard) 
• l TO I~ I ! T$ II~J~f" ~f~" ~h~. (Space-Topography) 
(Tile-SPace) I 
~'~1~-~1 I TP 'iB~J~ (Tile Point) 
(Tile) / 
TO ~l~mm u (Tile Ouration) I' J 
-- TA ,1~ (Tile Attrtbute~ 
Sem~nt~g__M~r ke~a_fo r _Np_u ns 
44 
5 Attribute 
This conceptual facet contains attributes: 
that is, properties, qualities or features 
representative of things. This facet consists of 
semantic slots such as Property Characteristic. 
Status Figure, Relation, Structure, etc. 
6 Phenomenon 
This conceptual facet contains phenomena: 
that is, physical, chemical and social actions 
without human activity. This facet consists of 
semantic slots such as Natural phenomenon, 
Artificial phenomenon Experiment. Social 
phenomenon, Power Energy, etc. 
7, Doing or Action 
This conceptual facet contains human doing 
and actions. This facet consists of such semantic 
slots as Action Deed. MovementReaction, 
Effect Operation, etc. 
8: Mental activity 
This conceptual facet contains operations of 
the mind and mental process. This facet consists 
of semantic slots such as Perception. Emotion. 
RecognitionThought, etc. 
9.! Measure 
This conceptual facet contains measure: that 
is, the extent, quantity, amount or degree of a 
thing. This facet consists of semantic slots such 
as Number. Unit, Standard, etc. 
10i Time and Space 
This conceptual facet contains space, 
topography and time. 
5.2 Process of semantic marking 
The semantic marker for each word is 
determined by the following steps. 
1) Determine the definition and features of a 
word. 2, Extract semantic elements from the word. 
3) Judge the agreement between a semantical slot 
concept and extracted semantical element word by 
word, and attach the corresponding semantic 
markers. 4; As a result, one word may have many 
semantic markers. However, the number of semantic 
markers for one word is restricted to five. If 
there are plural filial slots at the end. the 
higher family slot is used for semantic 
featurization of the word. 
It is easy to decide semantic markers for 
technical and specific words. But, it is not easy 
to mark common words, because one word has many 
meanings. 
~..__Lexicon Z_Qr na,t .f_o_r. _$yn_tactic_ Ana!ys_is 
Lexicon formats for syntactic and semantic 
features are composed of different features 
classified by part of speech. 
I > Features of verb: 
Subject code: verb used in specific field. 
only electrical in our experiment 
Part of speech in syntax: verb 
Verb pattern: classifing the verbal case 
frame, a categorized marker like Hu{nby's case 
pattern is planned to be used. 
Entry to lexieal unit of transfe~ lexicon 
Aspect: stative, semi-stative, continuative, 
resultative, momentary or progressive/transitive 
Voice: passive, potential, causative or 
"7~l~RU'<perfective/stative) 
Volition; volitive, semi-volitive or 
volitionless 
Case frame: surface case, deep case, semantic 
marker for noun and inner-outer case 
classification 
Idiomatic usage: to accompany the verb(ex. 
catch a cold> syntax, verb pattern, 
2i Features of Keillo~t-$h~ and lieiuoudou-shi: 
both syntactic features are described in 
almost the same format. 
Sub-category of part of speech; emotional, 
property, stative or relative 
Gradability: measurability and polarity 
Nounness grade: nounness grade for 
Keiuou-shi!++. +, -, --) 
3) Features of noun: sub-category of 
nounCproper, common, action, adverbial, etc), 
lexical unit for transfer lexicon, semantic 
markers, thesaurus code, and usage. 
4) Features of adverb: sub-category of 
adverb(/ouk~, Teido, (~2~iaiufSU, S~mr~10~¢) 
considering modality, aspect, tense and 
gradability 
5) Features of other taigen: sub-category of 
Rcnluj_z~hi( demonstrative, interrogative, 
definitive, or adjectival) and conjunction(phrase 
or sentence 
6i Features of/~k~l=~L*i(auxiliary verb): 
Jodo~=%bi are sub classified by sub-category 
on semantic feature: 
Modality~negation, necessity, suggestion, 
prohibition ....... ) 
Aspect~past. perfect, perfective stative, 
progressive, continuative, finishing, 
experiential .... ) 
Voice(passive or causative) 
Cleft sentence(purpose and reason> 
etc('T~WlRlr . "TENISEI~U" , "TEOhLi" , "SOKQ\;Ri" 
and "TEII@2~U" ) 
7} Features of /9n$lli: 
Subcategory of /~==5~.(: case, conjunctive, 
adverbial, collateral final or 2_Ill~li 
Case: features of surface case(ex. "Gd" "I¢0" 
"NI' "TO'. .... ), modified relation~iu!!ui or 
~B~o!t modification) 
Conjunctive: sub-category of semantic 
feature(cause/reason, conditional/provisional, 
accompanyment, time/place, purpose, collateral, 
positive or negative conjunction, ere) 
_7.., Data Base St.r_u._c.tur_e Qf~_h_e Lex, icon 
As this system uses LISP as the programming 
language, the lexicons are punched up as 
45 
S-expressions and input to computer files (see 
Figure 3 ). 
For the lexicon data base used for syntax 
analysis, only the lexical items are hold in main 
storage; syntactic and semantic features are 
stored in VSAM random acess files on disk(see 
Figure 4 ). 
( cs~.,~at~ -v o o o ~ 5 o o- o z -~ 
( $ R:~R fl,li 
c s{~{~ 64)) 
C Sg~::,- v t~) 
V\] 
( S Kea~ W) 
(($~ M) C$~JI~ SUB) ($~=-F OF OH) ($~4jl~ I)) 
v2 
(s~ W) (${~ ,,~'-~ - ) 
(($~z~ ~() (s~JE~ SUB) c$~i~9~=-y OF OH) ($,~1~ 1)) 
( $ ~J~v60BJ) (S~J~:-~' IT IC CO) 
($~ PAR) ($~|~=--v IT IC CO) 
( $#Z~ O)))) 
V3 
($I:~ W) ( $ ~3~J1111 
(c$~ ~) ($~Im~ SUB) ($~=-~' OF OH) C$~11~ 1)) (($~I~ I:) ($~%~ REC) ($~J~=--~" xx) (S~4Ji~ 1))) 
(S~flt~ ¢$~,~ ".~t~"))))) 
Figure 3. Lexicon File Format__in LISP 
S-express " otoj~ 
Kn~ty-v~ct~r 
~ia&er -li~t 
o \] /~(OoO ....... ) 
• 3 ~ MFR;mor~aol~cal feature 
• for ~ZtiOn t~r¢l; 
~Olmorm%ol~cal f~we 
for ~ for ~&~tio~ v(m'd 
e~leom for syntactic am~lysLs 
Fimure 4. Lexicon Data Base Structure for Analvsis 
The head character of the lexical unit is 
used as the record key for the hashing algorithm 
to generate the addresses in the VSAM files. 
8. con__cJJ~i_o_n 
We have reached the opinion that it is 
necessary to develop a way of allocating semantic 
markers automatically to overcome the ambiguities 
in word meaning confronting the human attempting 
this task. 
In the same thing, there are problems how to 
find an English term corresponding to the Japanese 
technical terms not stored in dictionary, how to 
collect a large number of technical terms 
effectively and to decide the length of compound 
words, and how to edit this lexicon data base 
easily, accurately, safely and speedily. 
In lexicon development for a huge volume of 
You(~n , it is quite important that we have a way 
of collecting automatically many usages of verbal 
case frames, and we suppose it exist different 
case frames in different domains. 
Ackn_o_Ki~Lgm~_ 
We would like to thank Mrs. Mutsuko 
Kimura(IBS~, Toyo information Systems Co. Ltd., 
Japan Convention Sorvice Co. Ltd., and the other 
members of the Mu-projeet working group for the 
useful discussions which led to many of the ideas 
presented in this paper. 
Rcf_c~.¢ng_e_a 
(I) Nagao. M., Nishida, T. and Tsujii, J.: 
Dealing with Incompleteness of Linguistic 
Knowledge on Language Translation, COTING84, 
Stanford, 1984. 
(2) Tsujii. J., Nakamura, J. and Nagao, M.; 
Analysis Grammar of Japanese for Mu-project. 
COTING84. 
{3) Nakamura. J.. Tsujii. J. and Nagao. M.: 
Grammar Writing Syst~n (GRADE, of Mu-Machine 
Translation Project. COTING84. 
(4; Nakai, H. and Satoh, M. : A Dictionary 
with Taigen as its Core, Working Group Report of 
Natural Language Processing in Information 
Processing Society of Japan, WGNL 38 7, July, 
1983. 
(5 Nagao. M. ; Introduction to Mu Project. 
WGNL 38 2, 1983. 
6 Saka!roto. Y. : Yougcn and Fuzo'=:u- go 
Lexicon in VerbJa! Case Frame. WGNL 38 8. 1983. 
!7 ',. Sak~,r,!oLo. Y. : Japanese SyntaetLc Lexiccm 
in Mu project. Proc. of 28th Conference of IPSJ, 
1984. 
'.8 Ishik~,~'._,, T., Sat,.>h. M. and Tal:aJ, S. : 
SemantJ caI FulicLJ o:i on Natural \[.~q~;S~.-~s, ~' 
Processing, Proc. o.r" 28Lh CIPSJ. 1984. 
46 
Xi 
r 
£ 
U 
n 
0 
CO 
L 
Z 
~a 
I~1 I w ~ ~ ' 
i~ ~i~ ~ ..3 ,i 
• m!-- .'- - 
i-~l, 
r 
I :1 
t 
o I i I i m ~ ...1 
'~:t ~i: I ~ : f.: ® : : ~ a :i 
l 
|| 
l@ 
: E 
"~i ~.~ ,~ I^ ~J~ 
~ ~ v1~ ~ ~ ~i ~ ~ ~ ~i ~ ~ ~ ~i~ 
I ~- ~ z i N i I i@ E 
E~ EE 
47 
