PROCESSING CLINICAl NARRAI\]}VES IN IIUNGAR\]:AN 
G4bor PrGsz~.ky 
National Lduca Liona\] Library arid Museum 
Computer Oepar l:hqen 
Flonv~.d u. \]9. 
II.- \].055 Budal~{.'st 
IIUNGARY 
ABSTRACT 
IIHs pape\]' describes a systemU~at extracl:s in1:ormal:ion 
lrom 14urlgarlan descriptive texLs o£ medical domain. 
lexLs of cZinJca\] nam'atJves defim; a sublanguao e 
that uses \]imite.d synLax bui: holds the main character- 
istJcs o£ t:he language, namely #roe word order and 
r;ich mm'pho\]ogy. We offer a fairly general parsirlo 
method for :\[ree word orde£ \]anguaoes and the way how 
to use ~-1: for parsing Hungarian c\]in~ca\] texts, lhe 
system can hand\]e si.mple cases e\[ el.\]ipses, anaphoT'a, 
Luqknown woFd8 al/{\] LypJ ca:\[ abbrev:i.at\]ons ef e\] \] n:iea\] 
practice. lhe system trans\].ates texts {d! anamneses, 
paLJeet v:isiLs, \]ahoral:ory l:esl:s, medlcai examinations 
aed discharge sumnlaries Jnl:o an JnJ:ornlaL\]on :\[ormaL 
u'.~abZI.u for a ined\]ea:l exper I: aystem. Szm:i.\] aKly to Lhis 
expert system, Uqe :in1:ormal;:i.on formattinO prograln has 
been wr:il:Len in MPROI OG language and JLs experJmenta\] 
versj.on \['lllls OIl PROPER-\]6, a tlungarian made (\]BM-XI 
compatib\]e) m\]c~:ucomputer. 
\]. OVERV\]:I:W 
.l:n the pasl: :Jew years we Imve deve\]oped a comHutatJona\] 
system te analyze I lungariar~ I:exLs oeJog a morpholo- 
gica\] ana:l.yzer (Ih'risx~ky e\[ a\] 1982) and a genre:a\] 
pars\]no tn:o!jram ca\]led ANAGRAMMA (=ANA\]yt\](: \[;RAMMAr) 
(Pr6szdky 1984). Ihe who\]e sysLeln for JllJ'Orlllat:Jon 
fc\]rll\]a L L\] \[1N is hosed on Ltlese modules alld co, isis Ls oY' 
five coMsoquer/t parts: (i) morpho\]ogJr:ai ana:l.ysJs, 
(J J ) nornlalizal;ion, (~ i:i ) pak'sJrl!j, (iv) evallJa LJcJN, 
(v) mappino into the Jflformal:h)n formaL. Ihe lasl: 
\[\]\]ock JIS all eperaLJon that converts the output o\[ 
ANAGRAMMA, which Js (\]J)+(JJi)~(Jv), to a :\[OFIIlat Lhal: 
can be use.d by a ned:lea\] export oystem. \]he approach 
leads to a structure shown by Figure 1. 
l   TFATff  u,p  t ae.ten e 
~A ---- ~_.L~ ~ ......... , COMMON " " -f ~---. -- \[..~A\[L, UHO ACIUAL 
1 MORPIt It OGJ.CAI_ ANAl YZEk~ SPFCZAI 
sequellce ei' categori -- -~SA -~ 
NAGRAMMA | NORMA T~--~r~-- ~.I~ULE SYSTEM\] 
NORMAL. RULES \] 
J JJ normalized sequence of 
I II categories 
I MI'I ..... ----------~__L~ CYCL ICAL RUlE SYS. 
Jill ~ PARSING RUL\[{S 
1oil sequence o:\[ parLs of speeel 
....... -~ EVALUAI ING ~-~ 
........ VVALUATIN6 R~ 
IF .....    MAPPIN  
II f e MAPPZN{; , 
LJ ormatt. n.\[ormat\]on 
\[\]ASk- ~-----'} 
Figure \] 
2. MORPIIOI.OGICAL ANALYSIS 
II/0 £Jrtit phase :i.,s the \[mrphoi!ogical ana\].vs;Js o\[' w\[~rd 
forms. Ilunflarian \]oa free word order language~ there- 
:\[LiFe the I'o\]_e oJ: .~;u:\[f:i xes is very important #rom 
the viewpoJ.nL of ideM:i1:ying phrasal eons LJtuents. A 
lot o1: oynl;actJc and ~\](.mlanLJc in£ormaUon (iRJlnber, 
person, possessien, case, Lense, mood etc.) are 
car;.'\]ed by these e\]emenl;s. The cone~Yl:enwL\]oo of stem 
and suf1:Jxes J s someti.mes \]'atller complex: U/ere ace 
suffixes that have di1:fererYL stem--dependent 1:erms 
a\[id stem'.:\] that have different suffix-dependent forms. 
lhus the \]exJ.coH musL conLaJri a\]:l Lhe peso:iDle vari- 
anl:s o1: the sLems as ir/depenclenL enLc~os or we have 
I:o def\]fle an a\].!\]or\]thm feF collstrucLirlo I:hR real 
stem,.; \[rom Lhe arclfi.l~honemea o1." Lhe \]ex:\]enn. We have 
eheuen the 1:ormer a\]/:ernaL~ve. 
Ihe :jex:ieoll consists of four parts but only con- 
cepLua\]\].y. From l:he point o\]: view of the a\].gorJthm, 
J'L J s an Jrltegra\] who l.e. The reasons why wo dJ.s- 
L:iHgu:ish :its parts are. as 1:ellows: 
(J) All the N\[ processing proorams o1: an aggluLinaL- 
J ve :\[aiigl.ta!\]e nlus L knew all_.1: :!.he g.~lElmaLical~jgr~hemes 
of l:ix.' \]anfluage. 
(\] \] ) File d:ictJ oI~ary n£ c onlmo~LxpressJuer\]s J s not 
i,~cPssary hut JL is a useful part of ai\] NI sysLems. 
lhis modu\]e can be eli\[a\]:'!jud by the user. 
\[ i I ) II1,. cib bd(J \[ I L.A El,till ~,\[Jiil,{\]\]~) I,IU\] {~ 0;: i (.,)~ \[\] ~ J 
I:he IexJna\]. elements l:haL is needed for Lhe acl:ua\] 
I:ype of applicaL:ion (I)\[I queryillg, updaLJno, inform- 
al:Joe exl:racl:Jon, Lrans\]al:Jon etc.). 
(iv) The ~peeial \]exi.con conl:ains terms ef l:he aeLua\] 
applLca\[Jon field ~ our case the \[e£111s o£ medical 
scJ once). Ih:i s fllodu\] e Call, O\[ course, he en\] arced hy 
I;he user. 
After updatJn!\] the :lexicon, eni:ries w\]\]\] he ar- 
ranged in a\]phahel:Jca\] ord{~r. 
The inorpho\]oqica} analyzer Js a f\]lqJ I;e stale aul:o- 
mates (F~ZTA-~?"I~\[.~:CGR~ ~-~r3-te hc ana\]yzed from the 
:inpul: sequence of words and searches the dictionary 
in order t\[\] find the :i.npuL werd. :\[f the \]eft part of 
the word matches a dietiunary entry, Ule enLry's Jn- 
formatJona\] iJari: must be cepJed I:o the worl<\]ng buffer. 
lhe content of this buffer wJl\] be the input to l:he 
,sy~rLactical analyzer. Then the automaton begins to 
work from right i:o \].e\[t. Its oul:put is the sequence 
of l:he Jnformatiena\] parts of i:he grammatical i/lOr- 
pheme,s sLanding al:ter Lhe sLem w\[e ideITLi.fied a short 
while ago. \]:2 l:lle Jn\[ormatJon of t:he stem arld l:he 
suffixes are nol: compaUb\]e or there remained an un- 
prooessed part in l:he word, the a\].go\].'il:l-m\] tries Lo 
ana\]yze the word as a compound once more and if 
this proce,as faJ\]s then J L asks the user what l:o do. 
FLoure 2 J.s an J\]\]ustratlon o£ this llrocess. (The 
"origin" of the erll:ries is marked by G, C, A and S, 
l:hat is (jrammal:Jcal, eommon, actual and special \].exJ-- 
con, respeetive.ly. ) 
3. PROBLEMS \[\]F PARSING 
The we\].\] known nlothods .Henerally ul:i\]ized for pars\]n O 
NLs are no I: eonvP.nlenl: for treal:ing lanouages like 
IlunoarJan , Finnish, Fsthon\]an or Japanese, c1:. (NelJ- 
Inarkka el: a\] i984),(lsujli et a\] 1984),(l'rOsz@ky \[984). 
Zn I:hese \]anguages, the sol:fixes carry out II/OSL of i:he 
365 
LEXICON: 
~lkalommala PERS sznB ~ > G 'his/her/its' <ADV1 times> G 'times' 
~a~_~ <N2 person ... C 'father' 
infarktus <N2 Vdisease S 'infarctior1' 
PERS sing 3> G 'his/her/its' 
k~t NUN det 72 G '-two' 
nak GAS1 dat> g (dative) 
t FIN1 past sing~ G (past sg 3rd pets. 
Vol <V cop g 'was' 
Apj4nak k@t alkalommal volt infarktusa. INPUf: 
'His father had infarction two times. -m- 
N--'ORPHOLOGI~A--~ /~k I I A arkt/~us IANALYSISI /.( \, 
a#lanaK k~i; alkalommal volt inf a 
OUTPUT : PEm ASqr " I FA°V 1 r 1FINq\[ N2 I pER  
ersl~n s~Hat |/~2 / bime~ ~°dp?s~FdisJ ~n~ J eU 
Figure 2 
task of marking gramrnatical function, therefore, the 
word order -- strictly speaking, the phrase order -- 
will be relatively free. So we must turn our attention 
to (i) the internal structure of the phrases and (it) 
the order of phrases (and ale intonation, of course 
only in speech) that plays an ~mportant role in ex- 
pressing communicative functions. 
The basic idea of -the strategy we propose builds on 
the invariants of the sentence struchlre of free word 
order languages, that is, (i) the first thing to do is 
to recognize -the internal structure of the parts of 
speech and (ii) Wm second Js to interpret their rela- 
tive order. \[his order is connected with the communi- 
cative ,o.,.~- ~ (topic, focus etc.) of thc structurc. 
The sang tactic analysis of free word order,sentences 
is based upon -the morphemes identified by morphological 
analysis. The lexicon cannot help us to give the actual 
functional role of a morpheme because of two reasons: 
(in All possible functional roles of a morpheme cannot 
be listed. 
(it) If there were severa\] possible roles in the de- 
scription of morphemes nobody would know which of them 
to use actually. 
4. UNKNOWN ELEMENTS 
The problems of the unknown elements can arise not only 
in the ease of computational analysis, since people may 
read/hear morphemes never read/heard before, yet they 
can identify the actual syntactic role of them without 
any knowledge of any previous syntactical categorization. 
The~or word class of a word is statistical in- 
formation about its occurrence in particular syntactic 
positions. For example, the word 'beteg' can be a noun 
('patient') or an adjective ('sick','ill') in Hungarian. 
It is an adjective in adjectival use, that is without 
inflections or beZore adjectival suffixes: 
'E16zSleg soha nem volt beteg.' 
('He has never been ill before.') 
'Hat napja fekszik betegen.' 
('He has been laid up since six days.') 
The same morpheme can, however, be a noun before nom- 
inal suffixes: 
'A betegnek hem volt infarktusa.' 
('The patient has had no infarctions.') 
Although we consider categorization as a syntactic 
generalization, we do not claim that there are no in- 
dependent syntactical categories. In agglutinative 
languages such categories are e.g.-the nominal suffixes 
just mentioned. These categories are not arbitrary, 
because one cannot introduce a new sufZix to the 
language, but can, however, use new stems Jn the 
366 
sentence. If the parser knows these regularii;ies, then 
lexical categories will be used for control only. 
5. SENTENCE SIRUCTURE IN AGGLUTINATIVE LANGUAGES 
Below we will make use of Hungarian examples to show 
-the most important properties of a typical agglutinat- 
ive language. In a simple serltence there can be only 
one finite verbal suffix. If we have a sentence con-- 
raining -two of them, then we have to do with co-ordinate 
clauses or one sentence with a subordinate clause. 
Naturally, the finite suffix is J.mmed~ately preceded 
by a verbal stem. If the sentence has no finite verba\] 
suffixes, (in it contains a O-copula that is rather 
frequent, not orlly in medical -texts but also in the 
every-day Hungarian or (ii) there :i.s ~Jsis in the 
sentence. The non-finite verbal suffixes are also pre-- 
ceded by a verbal stem. These elements can behave 
differently accord4ng to whether or not they influence 
the word order of other elements. 
We consider t, he noun as an element that stands 
before a nominal suffix ~. Sometimes the Iexieon 
does not categorize theJs morpheme as a noun. We con- 
sider this situation as a case of a missin 0 noun. Re- 
generation of missir/g elements is important because 
of identifying elliptical constructions. For example~ 
Hungarian adjectives can have nominal endings when no 
noun occurs in the structure. 
As it seems, most of the morphemes do net have a 
f~xed lexiea\], cateoory ~ because their positions in the 
sentence actually define their functional role. But we 
have some important lexical features: 
(i) Sterns . lhey are closed morphologically to the 
\].eft and open to the right (formal\]y: <stem ). "Open" 
means an abi:lity to join other elements. In the case 
u i" ,,uu, l-lLhe u,,uz~ Liluue "OLIiL~L" elUIilL~llLW aL'U1 .~uL" 
example, Wqe case suf£ixes. 
(it) guZfixes, rhey are closed morphologically to the 
right and open to the left (suffix) ), e.g. the case 
endings. 
(iii) _Open endings. They are open morphologically on 
both sides ( open ), e.g. the morphemes markingplura\]- 
ity or possessivity. 
(iv) Closed elemnts. They are closed on both s~des 
((clo~, e.g. adjectives, numerals, adverbials. 
So, if a closed side immediately precedes an open one 
or an open one a closed one, the parser has to correct 
the "wrong" sequence inserting an empty morpheme: 
(an (stem <closed> --~-<stem suffix><closed> 
(b) < closed> suffix)--~-<c\]osed> {stem suffix# 
Instance (an carl be, for example, a genitive case- 
insertion (as this case ending can sometimes have an 
empty form in Hungarian) and instance (b) can be a 
noun insertion between an adjectival stem and a nominal 
suffix. 
6. PARTS OF SPEECH 
The surface scheme of a Hungarian sentence Js the 
following: 
( <A~<S NT<V NF)')~<V F> ( ~A>'<S N~<V NE>*) ~ 
where A stands for adverbials, S for nominal and V for 
verbal stems, N for nominal case endings, F for finite 
and NE for non-finite verbal suffixes. Hence the t~ 
of the constituents are as follows: 
(in independent' adverbials (without any suffix), 
(it) non-finite verbs (e.g. infinitive, gerund), 
(iii) nominal groups with case ending, 
(iv) a verb plus a finite suffix (the main verb of 
the sentence). 
Having made clear the internal structure of the 
constituents, the parser can deal with the formal 
evaIuation of the connections between the constituents 
(e.g. verb and complements, possessives and possessors 
etc.). In the first part of the parsing we do not \]feed 
any S-symbols, more precisely, ally string over a 
particu\].ar set, the~of~ can serve as S- 
symbol, l he ca:in parts of speech can be described with 
a llelp of the schemes (\])-(iv), but in fact, only (iJ) 
and (iii) are imporLanL. AdverbJa\].s of type (i)usually 
consist of one e\].emellt and every sewLence has one and 
only one strucl:ure of type (Jr). Sentences rather 
frequently consist of more than two constJtuenLs, but 
in a free word order language there is no)~_yj\]ical 
@ri~ ord_ ej'_ of Lhe,~;e constiLuui11:s. Our method is 
based Oll this observatJon. We do not describe the 
structural relations Jn the sentence sequentially from 
the J eft to I:he right end of the selrtencc, gut rules 
fo;:m blocks and these blocks are used Jn an order de- 
pending on tile elenlenLs of Uqe arLual sentence. 
7. ANAGRAMMA 
Between the morpho:LogJca\] ana\]y.sis and the parsing we 
nced a nornla\] izatJon procedure that iliserts the ndssiog 
morphemes ell the basis of the £ornlal lex\]cal properties 
of thc el emciF~.s of the ~nput string. I£ the input s l;rinO 
includes Jnt_e\]~ective signs or words and (J) thJ.s sion 
means subordJnatJon, then Jt seems to be obvious Lo 
take the embedded string out and handle it. \]:iko ao el- 
dependent selrLence, or (ii) Lids sign or ward means 
co-ordJnat:i.orl, then we wJ.\]l elaboraLe the co-ordinate 
structures paralle\]ly. 
S(\], to analyze a simple selYLeuce of Ilungarian, the 
A_NAGR_A_MMA~rsn~ would beg:in wiLh tile quesLlon 'le 
there any subsi:rJng ef tile sol\]bunco ta hR parsed LhaL 
has the form of the first rule's :lct'L hand side?'. \[f 
Lhe answer ls 'yes', the r:ight side of the same rule 
is substituted as many tJme, u as the subs LrJn 0 occurs 
irl the seni:ence. For exampl.e: 
fhe rule: AOJ N2 --,~ N1 
lhe sentence to be parsed: I\]ET ~ADJ N2 GAS\] I}E\[ ~AI\]J N2 CAS1 V FIN1 
lhe result: gEl N\] gAS\] oKr N\] gAS1 V I-IN\]. 
:1:£ the sentence does not contain the suhstrilkg, the 
next rule follows. In this way all rules can be applied 
once orlly, although we would prohab\]y have Lo use them 
mere LhaH onc;c. The repeated use of the l:'ules Call be 
realized with the he\].p of c__~c\]es: 
/~. CASI V AI\]JI --~ ADJ 
": AOJ N2 .... N\] U 
: I)Er N1 --~ N 
7: N CAS\] --~ GAS 
: AOJ CAS\]. --~ CAS 
9: V FIN\] --~ FIN 
70: ... 
Figure 3 
The kernel of the cycle is a sequential rule package 
and its condition is the quantity of rules applled al: 
the last pass over the cycle. I£ it is not O, then the 
algorithm continues at the first rule of the package. 
\]:f it :is O, \]:hat is, there we\]x.' no such app\].ications, 
the rule of the next number has -to be applied. 
A trace of an ANAGRAMS: 
OET ~T ~I\] ~ CAS\] V/~Jl N2 CAST OFT/~I\] N2 CASI V FINI 
5: OET DEI N1 E~SI V ADJI N2 0%ql DET NI EASI V FIN\] 
6: DET N CAST V ADJI ~2 CASI N CASI V FINI 
7: \[\]El CAS V ~I\]\] N2 ~ASI CAL\] V FIN1 
4: \[\]El All\] N2 CASI C#~ V FIN\]. 
5: OET N1 CASI gAS V FINI 
6: N OASI CAS V FIN\] 
7: GAS CAS V FIN\] 
9: CAS C/~ FIN 
Fi,g ~ra 4 
Ihe parsing is over if (i) all o\]emerlts of Lhe actual. 
string to be parsed are \[rOlll tile dJst:ingu~shed set 
(e. O. \[;AS and FIN in the above examplE), or (ii) the 
algorithm ls after tile last rule and there is no accept- 
able cycle-end after this rule. We say that the algo- 
rithm canr~ol: interpret the sentence \]f there have re- 
mained other than distinguished elements, lhe parser 
can operal;e more quickly Jf the ru\].es in the same 
package give the descrJption of the same gramrnatical 
pheHomenon. Such modulcs consist o£ rules \]:he \]eft 
sides of which are sJmJ.\]ar. If a packaoe contains 
only rules whose left side does not contain any e\]emeet 
of Lhe senterlce to be parsed, then it carl be emJtl:ed. 
We can use this method o~ s:imp\] ificatJon wi thou\]; much 
ado, ow:ing to an }{-\]:ike l:ormalism that guarantees that 
llO flew symbols can be bo\['n ae a result of app\]JcatJoi~ 
of the rewriting rules. We use decreasing bar levels 
alike the formal derivation process does wJthe~ent.s. 
8. I:VALUAT\]:\[\]N 
Tho evaluation illodule iU essentially a paLl:ernmatching 
aluori Lhm that identifies the \].Jnl< between (2) bhe 
predicates and their argumenLs, (\].i) the al/aphoric 
elemen I;s and l:heir anLecedents, and (J.iJ) the "para\] \] el " 
structures separated by the norma\].Jzer. The lexica\] 
fornls of predicates contain the surface case endings 
and tile secant:it role of \]:he needed cons I:\] LuenLs, IJ~mc.- 
\]'ere Uqe algor:il:hm has te \].ook for \];hose constituents 
and order the new features giveH to them by bhepre-. 
d\] ca Le. The i den t\] f icai, i on of I:lle allLRcedents ef ana- 
phol:iC clenlenLs \],s similar, but antecudents often occur 
in previous sentences. Therefore tile evaluator can se~ 
up a connection wi.th the analyzed form of the same 
paragraph. 
9. MAPP\]:NG INTO INFORMAlION I ORMA\[ 
After some consultation w:i.th physicians it was possible 
Lo establish the specJa\]Jzed concept classes and the 
~.tLe.rrl'~ ! of the concept ,;lass (.o-occurrence frnnl which 
tile information format could be defined. The nouns Jn 
tile lexicon are subcategor:ized by Lhei\]" membership in 
these c:lasses. Most classes are mapped Jntn the appre i- 
pr\]aLe s:hfl:~ because the names of the classes are the 
labels of the s\]oLs of the frames used by the expert 
system. Figure 5 shows the form of the formatted text: 
ANAMNE Z I S 
6ENETIKUS-FAKTOR FOK apa 
BETEGS\[:G iScllaemJ as szivbetegsdg 
KORELOZM\[~NY MI szorit6 fdSda\]om 
HOL me\]ikas 
MIKOll fizikai terheldsre 
GYAKORISAGA 
KEZEI ES-Et OZMI~NY nitrdt beta-blokkol6 
Flguee 5 
References

K@~in,L.-G.PE~szd9 'RZR @amnar' ~d<.P~o.of \]lwt.of LiI~\[. Nr.l., 
3t~4\]. (19I}5). 

NelJma\[t<ka,E.-H.J~iml- A.Lehto\]a 'ParsJ.rg an Inflectional Free 
hbrd @.~e.r I_anguage' P\[~sc. ECAI-.\[}4, Pisa, 167-\]76 (\]984). 

Pr6szdJ<y,G. 'ANAS~vMA: A PaFs:i.rg Stralegy arT\] 6Fam~ar for ~jgl.utim 
ating Lanc~la~' AbsLr. AI~A-84, Varna (1984). 

Pr6szdg,G.- Z.Kiss -L .T6U1 ~J~£ogica\] and YuqJ:onolog:Lcal 
• "" " O" - ." - i ..... ' /~lalys\].s f ll~azan Wb\[~l tones by Caq~lter \[hlpuLaLlorm~l 

Lir~is:L~. ar d \[Ymputer Lar~,~ Vol. 15., \]9-5--~-\]- 
Sage r,N. Na-lzma_____ll~rocess~ Addison-~bsley, Reading,, 
lvb,~s. (1981). 

\]~jiJ.,3. - 3.Naka~ma - M.Nagao 'Analysis 6racier of 3~qese. in 
U~ • P\[~)-'IEct ' Proc. CllI~L84 Stan£G~J, 267-274 (1984). 

Yarg,V. ~ T.I li~\]j.da : ~.-~i~\]a T'L~, of P blJristic tqqow\]c~J~e J n 
Olir~ \[.a~guq~e /~a\]ysis' Peer. OlINP=84,Star~o\[d,222-225 (1984). 
