Structural disambiguation of morpho-syntactic categorial parsing 
for Korean * 
Jeongwon Cha and Geunbae Lee 
Department of Computer Science & Engineering 
Pohang University of Science & Technology 
Pohang, Korea 
{himen, gblee}@postech.ac.kr 
Abstract 
The Korean Combinatory Categorial Grammar 
(KCCG) tbrmalism can unitbrmly handle word 
order variation among arguments and adjuncts 
within a clause as well as in complex clauses 
and across clause boundaries, i.e., long distance 
scrambling. Ill this paper, incremental pars- 
ing technique of a morpheme graph is devel- 
oped using the KCCG. We present techniques 
for choosing the most plausible parse tree us- 
ing lexical information such as category merge 
probability, head-head co-occurrence heuristic, 
and the heuristic based on the coverage of sub- 
trees. The performance results for various mod- 
els for choosing the most plausible parse tree are 
compared. 
1 Introduction 
Korean is a non-configurational, t)ostpositional, 
agglutinative . Postpositions, such as 
noun-endings, verb-endings, and prefinal verb- 
endings, are morphemes that determine the 
fnnctional role of NPs (noun phrases) and VPs 
(verb phrases) in sentences and also transform 
VPs into NPs or APs (adjective phrases). Since 
a sequence of prefinal verb-endings, auxiliary 
verbs and verb-endings can generate hundreds 
of different usages of the same verb, morpheme- 
based grammar modeling is considered as a nat- 
ural consequence for Korean. 
There have been various researches to dis- 
ambiguate the structural ambiguities in pars- 
ing. Lexical and contextual information has 
been shown to be most crucial for many pars- 
ing decisions, such as prepositional-phrase at- 
tachment (Hindle and Rooth, 1993). (Charniak, 
1995; Collins, 1996) use the lexical intbrmation 
* This research was partially supported by KOSEF spe- 
cial basic resem'ch 1)rogram (1997.9 ~ 2000.8). 
and (Magerman and Marcus, 1991; Magerman 
and Weir, 1992) use the contextual information 
for struct;nral disambiguation. But, there have 
been few researches that used probability intbr- 
marion for reducing the spurious ambiguities in 
choosing the most plausible parse tree of CCG 
formalism, especially for morpho-syntactic pars- 
ing of agglutinative . 
In this paper, we describe the probabilistic 
nmthod (e.g., category merge probability, head- 
head co-occurrence, coverage heuristics) to re- 
duce the spurious atnbiguities and choose the 
most plausible parse tree for agglutinative lan- 
guages such as Korean. 
2 Overview of KCCG 
This section briefly reviews the basic KCCG for- 
malism. 
Following (Steedman, 1985), order-preserving 
type-raising rules are used to convert nouns in 
grammar into the functors over a verb. The 
following rules are obligatorily activated during 
parsing when case-marking morphemes attach 
to nora1 stems. 
• Type Raising Rules: 
np + case-marker 
feature\]) 
v/(v\np\[case- 
This rule indicates that a noun in the pres- 
ence of a case morpheme becomes a functor 
looking for a verb on its right; this verb is also 
a flmctor looking for the original noun with the 
appropriate case on its left. Alter tile noun 
functor combines with the appropriate verb, the 
result is a flmctor, which is looking for the re- 
maining arguments of the verb. 'v' is a w~ri- 
able tbr a verb phrase at ally level, e.g., the 
verb of a matrix clause or the verb of an em- 
bedded clause. And 'v' is matched to all of 
1002 
the "v\[X\]\Args" patterns of the verl, categories. 
Since all case-marked ilouns in Korean occur in 
front of the verb, we don't need to e, mploy the 
directional rules introduced by (Hoffman, 1995). 
We extend the combinatory rules ibr uncm'- 
ried flmctions as follows. The sets indicated by 
braces in these rules are order-free. 
• Forward Application (A>): 
x/(args u {Y}) Y X/Args 
• Backward Application (A<): 
Y X\(Args U {Y}) ==4- X\Args 
Using these rules, a verb can apply to its 
arguments in any order, or as in most cases, 
the casednarked noun phrases, which are type- 
raised flmctors, can apply to the, al)t)roi)riate 
verbs. 
Coordination constructions are moditied to 
allow two type-raised noml 1)hrases that are 
looking tbr the saxne verb to combine together. 
Since noun phrases, or a noun phrase and ad- 
verb phrase, are fimctors, the following compo- 
sition rules combine two flmctions with a set 
vahle al'gulnents. 
• Forward Composition (B>): 
X/(X\Ar.q.sx) Y/(Y\Arg.sy) ==~ 
x/ (X\ ( A,.:j<,: u ) ), 
Y = X\Arqsx 
• Backward Comi)osition (B<): 
Y\Arg.sy X\(Ar.q.sx U {Y}) ===> 
X\(A'rgsx U Arosy) 
• Coordination (~): 
X CONJX ~ X 
3 Basic morph-syntactic chart 
parsing 
Korean chart parser has been developed based 
on our KCCG modeling with a 10(},0()0 mor- 
pheme dictionary. Each morpheme entry in 
the dictionary has morphological category, mor- 
photactics connectivity and KCCG syntax (:at- 
egories tbr the morpheme. 
In the morphological analysis stage, a un- 
known word treatment nmthod based on a mor- 
pheme pattern dictionary and syllable bigrams 
is used after (Cha et al., 1998). POS(part-of 
speech) tagger which is tightly coupled with 
the morphological analyzer removes the irrele- 
wmt morpheme candidates from the lnorpheme 
graph. The morpheme graph is a compact 
representation method of Korean morphologi- 
cal structure. KCCG parser analyzes the mor- 
pheme graph at once through the morpheme 
graph embedding technique (Lee et al., 1996). 
The KCCG parser incrementally analyzes the 
sentence, eojeol by eojeol :1 Whenever an eo- 
jeol is newly processed by the morphological an- 
alyzer, the morphenms resulted in a new mor- 
pheme graph are embedded in a chart and an- 
alyzed and combined with the previous parsing 
results. 
4 Statistical structured 
disambiguation for KCCG parsing 
Th(' statistics which have been used in the ex- 
perinlents have been collected fronl the KCCG 
parsed corpora. The data required for train- 
ing have been collected by parsing the stan- 
dard Korean sentence types 2, example sentences 
of grammar book, and colloquial sentences in 
trade interview domain 3 and hotel reservation 
domain 4. We use about; 1500 sentences for 
training and 591 indq)endent sentences for eval- 
uation. 
The evaluation is based on parsewfl 
method (Black el, a\]., 1991). In the evalu- 
ation, "No-crossing" is 1;11o number of selltellces 
which have no crossing brackets between the 
result and |;tie corresponding correct trees of 
the sentences. "Ave. crossing" is the average 
number of crossings per sentence. 
4.1 Basic statistical model 
A basic method of choosing the nlost plausible 
parse tree is to order the prot)abilities by the lex- 
ical preib, rences 5 and the syntactic merge prob- 
ability. In general, a statistical parsing model 
defines the conditional probability, 1"(71S), for 
each candidate tree r tbr a sentence S. A gener- 
ative model uses the observation that maximis- 
ing P(% S) is equivalent to maximising P(rIS) 6. 
1Eojeol is a spacing unit in Korean and is similar to 
an English word. 
2Sentences of length < 11. 
aSentences of length < 25. 
4Sentences of hmgth _< 13. 
5The frequency with which a certain category is as- 
sociated with a morpheme tagged for part-of-speech. 
c'P(S) is constmlt. 
1003 
Thus, when S is a sentence consisted of a se- 
quence of morphemes tagged for part-of-speech, 
(w~, t~), (w2, t2), ..., (w,,, tu), where wi is a i th 
morpheme, ti is the part-of-speech tag of the 
morpheme wi, and cij is a category with rela- 
tive position i, j, the basic statistical model will 
be given by: 
r* = arg,~x P(rl,S' ) (1) 
(2) = argn~x P(S) 
,~ argmaxP(T,S ). (3) T 
The r* is the probabilities of the optimM parse 
tree. 
P(r, S) is then estimated by attaching proba- 
bilities to a bottom-up composition of the tree. 
P(r,S) = II P(cij) (4) 
cij ~T 
= H (P(eiilcik'ck+'J) 
cij ET 
xP(cik)P(cl~+lj)), (5) 
i<k<j, 
if cij is a terminal, 
the,  P(c j) = 
and 
frcquency(cij, ti, wi) 
frequency(ti, wi) ' (6) 
frequency(eli, cik, Ch+lj) (7) 
P(eijleik, C~+lj) ~ frequency(cik, ck+lj) 
The basic statistical model has been applied 
to morpheme/part-of-speech/category 3-tuple. 
Due to the sparseness of the data, we have 
used part-of-speech/category pairs 7 together, 
i.e., collected the frequencies of the categories 
associated with the part-of-speeches assigned to 
the morpheme. Table 1 illustrates the sample 
entries of the category probability database. In 
table, 'nal (fly)' has two categories with 0.6375 
mid 0.3625 probability respectively. Table 2 il- 
lustrates the sample entries of the merge prob- 
ability database using equation 7. 
frequency (old ,tl ) 7We define this as P(cljltl) ~ fvcq ...... y(tD " 
Table 3: 
Model 
Results fl'om the Basic Statistical 
Total sentences 
No-crossing 
Ave. crossing 
Labeled Recall 
Labeled Precision 
591 
74.62% 
1.00 
77.02 
79.15 
Figure 1: Sub-constituents for head-head co- 
occurrence heuristics 
Table 3 summarizes the results on an open 
test set of 591 sentences. 
4.2 Head-head co-occurrence heuristics 
In the basic statistical model, lexicM depen- 
dencies between morphemes that take part in 
merging process cannot be incorporated into the 
model. When there is a different morpheme 
with the same syntactic category, it can be a 
miss match on merging process. This linfita- 
tion can be overcome through the co-occurrence 
between the head morphemes of left and right 
sub-constituent. 
When B h is a head morphenm of left sub- 
constituent, r is a case relation, C h is a head 
morpheme of right sub-constituent as shown in 
figure 1, head-head co-occurrence heuristics are 
defined by: 
p(B,LI,. ,Ch ) ~ frequency(B h,r, C h) 
frequency(r, C h) " (8) 
Tile head-head co-occurrence heuristics have 
been augmented to equation 5 to model the lex- 
ical co-occurrence preference in category merg- 
ing process. Table 4 illustrates the sample en- 
tries of the co-occurrence probability database. 
In Table 4, a morpheme 'sac (means 'bird')', 
which has a "MCK (common noun)" ms POS 
tag, has been used a nominative of verb 'hal 
(means 'fly')' with 0.8925 probability. 
1004 
Table 1: Sample entries of the category probal)ility database ('DII' Ineans an '1' irregular verb.) 
P()S, morpheme category probability 
DII, nal v\[D\]\ {np\[noln\]} 0.6375 
DI1, hal v\[D\]\{np\[noln\],nl)\[acc\]} 0.362,5 
DI1 v \[D\] \ {rip \[nora\] } 0.3079 
DI1 v\[D\]\ {np\[llOm\],np\[acc\] } 0.2020 
Table 2: Sample entries of' syntactic merge probability database 
left; category 
~, / ( ~ \,u,\[,,o,,l\]) 
~,/(~, \,,p lace\]) 
right category 
v\[D\]\ {np\[noml,np\[acc\]} 
v\[D\]\ {,,p \[,lo,,,\],,u,\[acd } 
inerged category 
v\[D\]\{,,p\[acd} 
v\[D\]\ {ni)\[nonl\] } 
probability 
0.0473 
0.6250 
nl, (v/(v\nont))\n t , v/(v\np\[nom\]) I).2197 
The modified model has been tested Oil the 
same set of the open sentences as in the 1)asic 
model ext)eriment. 'l~fl)le 5 smnmarizes the re- 
sult of these expcwiments. 
• Ezperimcnt: (linear combination af th, c ba- 
sic model and the head-h, cad co-occurrence 
heuristics). 
P(% s) 
eij { r 
+/~p(\]/' I,,., c*')) 
× P(~,ik)~'(~,k+,;)), (9) 
i < k < j, 
if cij is a terminal, 
~J,.,;',~ p(c#i) = P(c.~:i I~g, td. 
Ta,bh; 5: Results from the Basic: Statistical 
Model t)lns head-head co-occurrence heuristics 
Total sentences 591 
No-crossing 81.05% 
Ave. crossing 0.70 
Labeled Recall 84.02 
Labeled Precision 85.30 
4.3 The coverage heuristics 
If" there is a case relation or a modification re- 
lation in two constituents, coverage heuristics 
designate it is easier to add the smaller tree to 
the larger one ttlan to merge the two medium 
sized trees. On the contrary, in the coordination 
relation, it is easier to nmrge two medium sized 
trees. We implemented these heuristics using 
/;tie tbllowing coverage score: 
Case relation, modification relation: 
COV_scorc = 
left subtrec coverage + riqh, t sub/roe coverage. (j_()~ 
4 × ~7~ ,~,,bt,.~,.,~ ~o,,,,',,e ;< ','i:jl,,i ~,,b>'.e eo,,~',',,.~,'. " 
Coo~d'iuatio'n: 
COV_sco'rc = 
'e x x/left .~,a,l.,'~c. ~o.,,,.~,.,,.:,.. x ,'#lht .~,O,l,,.,, ,:o.,,~,.,,,~ 1~ 
leJ't subtree cove,.aqe + R~ ~b~r('.e ~;.~t . 
A coverage heuristics are added to the basic: 
model to model the structural preferences. Ta- 
ble 6 shows the results of the experinlents on 
the same set of the open sentences. 
• Ezpcriment: (the basic model to th, c 
COV_scorc heuristics). We have used (;tie 
COV_.sco're as the exponent weight feature 
for this experiment since the two nmnl)ers 
arc; in the different nature of statistics. 
P(7-, S) = H (P(ciJ\] cik, Ok+l J) l-COV-'sc°rc 
eij CT 
×p(~k)p(c~+,j)), (1~,) 
i<k<j, 
if Cij iS a terminal, 
o,:,, P(~.j) = 1)(c~.jl~,~, ~d. 
1005 
Table 4: Sample entries of co-occurrence probability database. 
head-head co-occurrence probability 
(MCC <ganeungseong>,np\[nom\],HIl.< nob>) 0.8932 
(MCK<sae>,np\[nom\],DIl<nal>) 0.8925 
(MCK<galeuchim>,np\[acc\],DIeu<ddaleu>) 0.8743 
Table 6: Results from the Basic Statistical 
model plus Coverage heuristics 
Total sentences 591 
No-crossing 80.13% 
Ave. crossing 0.81 
Labeled Recall 82.59 
Labeled Precision 83.75 
5 Summary 
We developed a morpho-syntactic categorial 
parser of Korean and devised a morpheme- 
based statistical structural disambiguation 
s(;henles. 
Through the KCCG model, we successthlly 
handled difficult Korean modeling problems, in- 
chtding relative free-word ordering, coordina- 
tion, and case-marking, during the parsing. 
To extract the most plausible parse trees ti'om 
the parse forest, we have presented basic statis- 
tical techniques using the lexical and contextual 
information such as morpheme-category proba- 
bility and category merge probability. 
Two different nature of heuristics, head-head 
co-occurrence and coverage scores, are also de- 
veloped and tested to augment the basic statis- 
tical model. Each of them demonstrates reason- 
able t)ertbrmance increase. 
The next step will be to devise more heuristics 
and good combination strategies tbr the differ- 
ent nature of heuristics. 

References 

E. Black, S. Abney, D. Flickenger, C. Gdaniec, 
R. Grishman, P. Harrison, D. Hindle, R. In- 
gria, F. Jelinek, J. Klavans, M. Liberman, 
M. Marcus, S. Roukos, B. Santorini, and 
T. Strzalkowski. 1991. A Procedm'e for 
Quantitatively Comparing the Syntactic Coy- 
erage of English Grammars. In Prec. of 
Fourth DARPA Speech and Natural Lan- 
guage Workshop. 

Jeongwon Cha, Gcunbae Lee, and Jong-Hyeok 
Lee. 1998. Generalized unknown morpheme 
guessing for hybrid pos tagging of korean. 
In Pwceedings of Sixth Workshop on Very 
Large Corpora in Coling-ACL 98, Montreal, 
Canada. 

E. Charniak. 1995. Prsing with Context-Free 
Grammars and Word Statistics. Technical 
Report CS-95-28, Brown University. 

M. Collins. 1996. A New Statistical Parser 
Based on Bigram Lexical Dependencies. In 
Proceedings of th, e 3/tth Annual Meeting of the 
A CL, Santa Cruz. 

D. Hindle and M. Rooth. 1993. Structural am- 
biguity and lexical relations. Computational 
Linguistics, 19(1):103-120. 

B. Hoffman. 1995. ~7~,c Computational Analy- 
sis of the Syntax and Interpretation of 'if;roe" 
Word Order in Turkish. Ph.D. thesis, Univer- 
sity of Pennsylwmia. IRCS Report 95-17. 

Wonil Lee, Gennb:m Lee, and Jong-Hyeok Lee. 
1996. Chart-driven connectionist categorial 
t)arsing of spoken korean. Computer process- 
ing of oriental s, Vol 10, No 2:147-- 
159. 

D. M. Magerman and M. P. Marcus. 1991. 
Parsing the voyager domain using t)earl. In 
In Prec. Of the DARPA Speech and Natural 
Language Workshop~ pages 231-236. 

D. M. Magerman and C. Weir. 1992. Ef 
ficiency, robustness and accuracy in picky 
chart parsing. In In Prec. Of the 30th An- 
nual Meeting of the Assoc. For Computa- 
tional Linfluisties(ACL-92), pages 40 47. 

Mark Steedman. 1985. Dependency and Coor- 
dination in the Grammar of Dutch and En- 
glish. Language, 61:523 568. 
