HUMOR-BASED APPLICATIONS 
G~ibor Pr6sz6ky 1,2 
1 MORPHOLOGIC 
F6 u. 56-58. I/3 
H-1011 Budapest, Hungary 
E-mail: h6109pro@ella.hu 
Miki6s Pill 1 
20PKM Comp. Centre 
Honvdd u, 19. 
H-1055 Budapest ,Hungary 
E-mail: h6109pro@ella.hu 
I~fiszl6 Tihanyi 1,3 
3 INSTITUTE FOR LINGUISTICS 
Szinh,qz u. 5-9. 
H- 10 l 4 Budapest, Hungary 
E-mail: h1243tih@ella.hu 
INTRODUCTION 
There are several linguistic phenomena that can 
be processed by morphological tools of aggluti- 
native and other highly inflectional languages, 
while processing of the same features need syn- 
tactic parsers in case of other languages, like 
English. There are, however, only a few morpho- 
logical systems that are both fast enough and 
linguistically sound. I-lum0r, a reversible, string- 
based, unification approach is introduced in the 
paper that has been used for creating a variety of 
lingware applications, like spell-checking, hy- 
phenation, lemmatization, and of course, full 
morphological analysis and synthesis. Having 
discussed the philosophical and design decisions 
we show the above mentioned systems and then 
survey some competing approaches. 
DESIGN PHILOSOPHY OF HUMOR 
Several philosophical commitments regarding the 
NLP systems are summarized in Slocum (1988). 
ttumor has been designed according to the Slocum 
requirements. It is language independent, that is, 
it allows multilingual applications. Besides ag- 
glutinative languages (e.g. Hungarian, Turkish) 
and highly inflectional languages (e.g. Polish, 
Latin) it has been applied to languages of major 
economic and demographic significance (e.g. 
English, German, French), H,mor overcomes 
simple orthographic errors and mis-typings, thus 
it is a fault-tolerant system, The morphological 
analyzer version, for example, is able to analyze 
Hungarian texts from the 19th century when the 
orthographic system was not as uniform as 
nowadays. Word-forms are first "autocorrected" 
into the standard orthography and then analyzed 
properly. 
Example 1 : fault-tolerance in Humor 
>> hiv6 
hiv => hiv\[V\] q <3\[PART\] 
(calling) 
>> galyak 
galy => gaily\[N\] F ak\[}?L\] 
(.f e1~ c'e s ) 
Humor descriptions are reversible. It means that 
there is an oppo1Iunity to input a stem and sev- 
eral suffixes and the system generates every pos- 
sible word-form satisfying the request 
Example 2: reversibility 
Analysis: 
>> hAzaddal (~ith your house) 
hAz\[N\] + ad\[PERS-SG-2\]+daI\[\]NS\] 
(yon say) 
+ sz \[PERS-SG-2 \] 
(you say) 
asz\[PERS-SG-2\] 
>> mends z 
mend \[V\] 
>> mondas z 
molld IV\] 
Synthesis: 
>> hAz\[N\] + \[PERS-SG-2\] I \[INS\] 
hfizadda\] (with your house) 
>> trend\[V\] + \[PERS-SG-2\] 
raondsz, ll~ondasz (you say) 
The basic strategy of Humor is inherently suited to 
parallel execution. Search in tire main dictionary, 
secondary dictionaries and affix dictionaries can 
happen at the same time. What is more, a simul- 
taneous processing level (higher than morphol- 
ogy) based on the same strategy is under devel- 
opment. 
7270 
In real-world applications, number of linguistic 
rules is an important source of grammatical com- 
plexity. In the Humor strategy there is a single rule 
only that checks unifiability of feature graphs of 
subsequent substrmgs in the actual word-form. It is 
very simple and clear, based on surface-only 
analyses, no transformations are used; all the 
complexity of the system is hidden in the graphs 
describing morpho-graphemic behavior. 
Humor is ri.~orously tested on "real" end-users. 
Root dictionaries of the above mentioned la,- 
guagcs contain 25.000 -100.000 eutries. The Hun- 
garian version (90.000 stems) has been tested in 
every-day work since 1991 both by researdmrs of 
the Institute of" Linguistics of the Hungarian Acad- 
emy of Sciences (Prdszdky and Tihanyi 1992) and 
users of word-processors and Drl'p systems (Humor- 
based proofing tools have been licensed by Micro- 
solt. Lotus and other software developers). 
MORPIIOLOGICAI, PRO(:ESSES 
SUI'I'ORTI<D BY HUMOR 
The morphological analyzer is the kernel module 
of the system: ahnost all of the applications derived 
From Humor based on it. Humor has a guessing strat- 
egy that is based on orthographic, moqJho- 
phonological, morphological and lexical properties 
oF the words. It operates after the analysis module, 
mostly used in the sl)ellmg checkers based on Hu- 
mor and m the above mentioned 19th centmaj cor- 
pus application. 
5)¢nthesis is based on analysis, that is, all the pos- 
sible moq3hemic combinations built by the core 
synthesis module are filtered by the analyzer. 
Examlfle 3: synthesis steps 
>> mond\[VI ~ \[\]?P,E'3-,~ZF-2\] 
(I) Concrete morphemes instead 
of abstract morphs: 
(2) String concatenation: 
mort.de\] ,mondo\] , i~o~d6\], mo~ds z, 
111()I\] (~;-t,~l Z t i'tt()~Id<~ ,~1 Z 
(3) Analysis, one by olle: 
%monde\] , %mon(io\], %mond61, 
monds z, mondas z, %laoIld c~ s z 
(4) Filtering: 
~fLC)/ld£~ Z 
ICL()IIdL~ ~; Z 
For internal use we have developed a defaulting 
subsystem that is able to propose the most likely 
inflectional paradigm(s) for a base word. There are 
only a few moq)hologically open word classes in 
the languages we have studied Paradigms that are 
difficult to classify are generally closed; no new 
words of the language follow their morpho- 
graphemic patterns. The behavior of existing, pro- 
ductive paradigms is rather easy to describe algo- 
rithmically. 
The coding subsystem of Slocum (1988) is repre- 
sented by the so-called paradigm matrix of Humor 
systems. It is defined Ibr every possible allomorph: 
it gives infornmtion about the potential behavior of 
the stem allomotph before moqJhologically rele- 
vant affix families. 
COMI'ARISON WITII ()'FilER 
METIIOI)S 
There are only a few general, reversible morpho- 
logical systems that can be used for more than a 
single language. Besides the well-known two-level 
morphology (Koskenniemi 1983) and its modifica- 
tions (Katlttmen 1985, 1993) we mention the 
Nabu system (Slocum 1988). Molphological de- 
scription systems without lmge implementations 
(like the paradigmatic morphology of Calder 
(1989), or Paradigm Description Language of 
Anick and Artemieff(1992) are not listed here, be- 
cause their importance is mainly theoretical (at 
least, for the time being). Two-level morphology is 
a reversible, orthography-based system that has 
several advantages from a linguist's point of view. 
Namely, the morpho-phonenfic/graphemic rules 
can be tbrmalized in a general and very elegant 
way. It also has computational advantages, but the 
lexicons must contain entries with diacritics and 
other sophistications in order to produce the needed 
surface Yorms. Non-linguist users need an easy-to- 
extend dictionary rote which words can be inserted 
(ahnost) automatically. The lexical basis oF Humor 
contain surface characters only and no transforma- 
tions are applied. 
Compile time of a large Humor dictionary (o\[ 
90.000 entries) is 1 2 minutes on an average PC, 
that is another advantage (at least, for the linguist) 
if comparing it with the two-level systems' compil- 
ers. The result of the compilation is a compressed 
structure that can be used by any applications de- 
rived from Humor. The compression ratio is less 
than 20%. The size of the dictionary does not in- 
fluence the speed of the run-time system, because a 
special paging algorithm of our own is used. 
1271 
Example 4: Helyette, the monolingual thesaurus with morphological knowledge 
vlorphoLogic- Hclyette \[\]1 
Menu 
Input Output 
\[h,~zad~val \[n~z~te\[ed~vel \] 
R__oots Synonyms 
h~z 
Meanings 
h~z 
Example 5: MoBiDie, tile bilingual dictionary with morphological knowledge 
AorphoLogic- MoBiDi~ 
_Language Dictionary Entry _Clipboard _User Help 
\[nput-English Headwords found I d°tie  lldu  
Meaning~:Hunflafian Headword li~¢t 
illet(.~k durables ~ duration 
vfim dutiable 
b ictio_nafes 
duty a~sessment 
duty~lree 
Euro Commercial Paners 
7272 
HUMOR-BASED IMPLEMENTATIONS 
Humor systems have been implemented (at various 
depth) for English, German, French, Italian, Latin, 
Ancient Greek, Polish, Turkish, and it is fiflly im- 
plemented for Hungarian. The whole software 
package is written in standard C using C4-1 like 
objects. It runs on any platforms where C compiler 
can be found ~ . The Hungarian morphological ana- 
lyzer which is the lalgest and most precise imple- 
mentation needs 900 Kbytes disk space and around 
100 Kbytes of core memory. The stem dictionary 
contains more than 90.000 stems which cover all 
(approx. 70.000) lexemes of the Concise Explana- 
tol~v Dictionary of the Itungarian Language. Suf- 
lix dictionaries contain all the inflectional suffixes 
and the productive derivational nmrphemes of pres- 
ent-day Hungarian. With the help of these dictionar- 
ies Humor is able to analyze and/or generate around 
2.000.000.000 well-formed Hungarian word-forms. 
Its speed is between 50 and 100 words/s on an av- 
erage 40 MHz 386 machine. The whole system can 
be tuned 2 according to the speed requirements: the 
needed RAM size can be between 50 and 900 
Kbytes. 
There are several Hum0r subsystems with simplified 
output: lemmatizers, hyphenators, spelling checkers 
and correctors. They (called Hdys~Lsm, HslyessI and 
Hsly~s-e?, respectively) have been built into several 
word-processing and full-text retrieval systems' 
Hungarian versions (Word, Excel, AmiPro, Word- 
Perfect, Topic, etc.). ~ 
Besides the above well-known applications there are 
two new tools based on the same strategy, the re- 
flectional thesaurus called Hdy#t8 (Prdsz6ky and 
Tihanyi 1992) and the series ofintdligent bi-lingual 
dictionaries called NoBi0i0. Both are dictionaries 
with morphological knowledge: Hdysff0 is monolin- 
gual, while NoBil)i0 - as its name suggests -- bi- 
lingual. Having analyzed the input word the both 
systems look for the found stem in the main diction- 
ary. The inflectional thesaurus stores the reforma- 
tion encoded m the analyzed affixes and adds to the 
synonym word chosen by the user. The synthesis 
module of Humor starts to work now, and provides 
the user with the adequate inflected form of tim 
word in question. This procedure has a great impor- 
tance in case of highly inflectional languages. 
The synonym system of Hslysff8 contains 40.000 
headwords. The first version of the inflectional the- 
saums HdysH8 needs 1.6 Mbytes disk space and 
runs under MS-Windows. The size of the MoBiDic 
dictionary packages vary depending on the applied 
terminological collection. E.g. the Hungarian-- 
English Business Dictionary (Example 4) needs 1.8 
Mbytes space. 4 
Besides the above mentioned products, a Hungarian 
grammar checker (called HsIy6~6bb) and other syn- 
tax-based (and higher level ) mono- and multilin- 
gual applications derived also from the basic Hum0r 
algorithm are under development. 
REFERENCES 
Anick, P. and S. Artemieff (1992). A High-level 
Morphological Description Language Exploiting 
inflectional Paradigms. Proceedings of 
COLING-92, p. 67 -73. 
Calder, J. (1989). Paradiglnatic Morphology. Pro- 
ceedings of 4th Conference of EACL, p. 58-65. 
Karp, D., Y. Schabes, M. Zaidel, and D. Egedi 
(1992). A Freely Available Wide Coverage 
Morphological Analyzer for English. Proceed- 
ings ofCOLING-92, Vol. III. p. 950- 954. 
Koskenniemi, K. (1983). Two-level Morphology." 
A General Computational Model for Word- 
form Recognition and Production. Univ. of 
Helsinki, Dept. of Gen. Ling,, Publications 
No. 11. 
Pr6sz6ky, G. and L. Tihanyi (1992). A Fast Mor- 
phological Analyzer for Lemmatizing Corpora 
of Agglutinative Languages. In: Kiefer, F., G. 
Kiss, and J. Pajzs (Eds.) Papers in Computa- 
tional Lexicography -- COMPLEX !92, Lin- 
guistics Institute, Budapest, p. 265-278. 
Pr6sz6ky, G. and L. Tihanyi (1993). Helyette: In- 
flectional Thesaurus for Agglutinative Lan- 
guages. In: Proceedings of the 6th Conference 
of the EACL, p. 173. 
Slocum, J. (1983). Morphological Processing in the 
Nabu System. Proceedings of the 2nd Applied 
Natural Language Processing, p. 228-234. 
t /Jp to now, DOg, Windows, 0S/2, I hfix ~md M:ldllW~ 
ctwironlnCldS \]lave bccll tested. 
2 I ivcn by Ihc cnd-uscrs. 
3 For O\] '~M DulncIs flmrc is a wall-defined API to It.r~0t. 
MoBil)i~'s hu~guage ~mdlic aid not aH)lication ~mdlicpatis 
need not lm multiplied because vocabttlall.cs oft}le .'-;~urtc lall 
gtlages use a single conllnon tn,.phologicaI knowledge base. 
1273 
