File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2208_metho.xml
Size: 9,361 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2208"> <Title>HUMOR-BASED APPLICATIONS</Title> <Section position="4" start_page="0" end_page="7270" type="metho"> <SectionTitle> DESIGN PHILOSOPHY OF HUMOR </SectionTitle> <Paragraph position="0"> Several philosophical commitments regarding the NLP systems are summarized in Slocum (1988).</Paragraph> <Paragraph position="1"> ttumor has been designed according to the Slocum requirements. It is language independent, that is, it allows multilingual applications. Besides agglutinative languages (e.g. Hungarian, Turkish) and highly inflectional languages (e.g. Polish, Latin) it has been applied to languages of major economic and demographic significance (e.g.</Paragraph> <Paragraph position="2"> English, German, French), H,mor overcomes simple orthographic errors and mis-typings, thus it is a fault-tolerant system, The morphological analyzer version, for example, is able to analyze Hungarian texts from the 19th century when the orthographic system was not as uniform as nowadays. Word-forms are first &quot;autocorrected&quot; into the standard orthography and then analyzed properly.</Paragraph> <Paragraph position="4"> Humor descriptions are reversible. It means that there is an oppo1Iunity to input a stem and several suffixes and the system generates every possible word-form satisfying the request</Paragraph> <Paragraph position="6"> raondsz, ll~ondasz (you say) The basic strategy of Humor is inherently suited to parallel execution. Search in tire main dictionary, secondary dictionaries and affix dictionaries can happen at the same time. What is more, a simultaneous processing level (higher than morphology) based on the same strategy is under development. null In real-world applications, number of linguistic rules is an important source of grammatical complexity. In the Humor strategy there is a single rule only that checks unifiability of feature graphs of subsequent substrmgs in the actual word-form. It is very simple and clear, based on surface-only analyses, no transformations are used; all the complexity of the system is hidden in the graphs describing morpho-graphemic behavior.</Paragraph> <Paragraph position="7"> Humor is ri.~orously tested on &quot;real&quot; end-users. Root dictionaries of the above mentioned la,guagcs contain 25.000 -100.000 eutries. The Hungarian version (90.000 stems) has been tested in every-day work since 1991 both by researdmrs of the Institute of&quot; Linguistics of the Hungarian Academy of Sciences (Prdszdky and Tihanyi 1992) and users of word-processors and Drl'p systems (Humor-based proofing tools have been licensed by Microsolt. Lotus and other software developers).</Paragraph> </Section> <Section position="5" start_page="7270" end_page="7270" type="metho"> <SectionTitle> MORPIIOLOGICAI, PRO(:ESSES SUI'I'ORTI<D BY HUMOR </SectionTitle> <Paragraph position="0"> The morphological analyzer is the kernel module of the system: ahnost all of the applications derived From Humor based on it. Humor has a guessing strategy that is based on orthographic, moqJhophonological, morphological and lexical properties oF the words. It operates after the analysis module, mostly used in the sl)ellmg checkers based on Humor and m the above mentioned 19th centmaj corpus application.</Paragraph> <Paragraph position="1"> 5)C/nthesis is based on analysis, that is, all the possible moq3hemic combinations built by the core synthesis module are filtered by the analyzer.</Paragraph> <Paragraph position="2"> monds z, mondas z, %laoIld c~ s z (4) Filtering:</Paragraph> <Paragraph position="4"> For internal use we have developed a defaulting subsystem that is able to propose the most likely inflectional paradigm(s) for a base word. There are only a few moq)hologically open word classes in the languages we have studied Paradigms that are difficult to classify are generally closed; no new words of the language follow their morpho-graphemic patterns. The behavior of existing, productive paradigms is rather easy to describe algorithmically. null The coding subsystem of Slocum (1988) is represented by the so-called paradigm matrix of Humor systems. It is defined Ibr every possible allomorph: it gives infornmtion about the potential behavior of the stem allomotph before moqJhologically relevant affix families.</Paragraph> </Section> <Section position="6" start_page="7270" end_page="7272" type="metho"> <SectionTitle> COMI'ARISON WITII ()'FilER METIIOI)S </SectionTitle> <Paragraph position="0"> There are only a few general, reversible morphological systems that can be used for more than a single language. Besides the well-known two-level morphology (Koskenniemi 1983) and its modifications (Katlttmen 1985, 1993) we mention the Nabu system (Slocum 1988). Molphological description systems without lmge implementations (like the paradigmatic morphology of Calder (1989), or Paradigm Description Language of Anick and Artemieff(1992) are not listed here, because their importance is mainly theoretical (at least, for the time being). Two-level morphology is a reversible, orthography-based system that has several advantages from a linguist's point of view.</Paragraph> <Paragraph position="1"> Namely, the morpho-phonenfic/graphemic rules can be tbrmalized in a general and very elegant way. It also has computational advantages, but the lexicons must contain entries with diacritics and other sophistications in order to produce the needed surface Yorms. Non-linguist users need an easy-toextend dictionary rote which words can be inserted (ahnost) automatically. The lexical basis oF Humor contain surface characters only and no transformations are applied.</Paragraph> <Paragraph position="2"> Compile time of a large Humor dictionary (o\[ 90.000 entries) is 1 2 minutes on an average PC, that is another advantage (at least, for the linguist) if comparing it with the two-level systems' compilers. The result of the compilation is a compressed structure that can be used by any applications derived from Humor. The compression ratio is less than 20%. The size of the dictionary does not influence the speed of the run-time system, because a special paging algorithm of our own is used.</Paragraph> </Section> <Section position="7" start_page="7272" end_page="7272" type="metho"> <SectionTitle> HUMOR-BASED IMPLEMENTATIONS </SectionTitle> <Paragraph position="0"> Humor systems have been implemented (at various depth) for English, German, French, Italian, Latin, Ancient Greek, Polish, Turkish, and it is fiflly implemented for Hungarian. The whole software package is written in standard C using C4-1 like objects. It runs on any platforms where C compiler can be found ~ . The Hungarian morphological analyzer which is the lalgest and most precise implementation needs 900 Kbytes disk space and around 100 Kbytes of core memory. The stem dictionary contains more than 90.000 stems which cover all (approx. 70.000) lexemes of the Concise Explanatol~v Dictionary of the Itungarian Language. Suflix dictionaries contain all the inflectional suffixes and the productive derivational nmrphemes of present-day Hungarian. With the help of these dictionaries Humor is able to analyze and/or generate around 2.000.000.000 well-formed Hungarian word-forms.</Paragraph> <Paragraph position="1"> Its speed is between 50 and 100 words/s on an average 40 MHz 386 machine. The whole system can be tuned 2 according to the speed requirements: the needed RAM size can be between 50 and 900 Kbytes.</Paragraph> <Paragraph position="2"> There are several Hum0r subsystems with simplified output: lemmatizers, hyphenators, spelling checkers and correctors. They (called Hdys~Lsm, HslyessI and Hsly~s-e?, respectively) have been built into several word-processing and full-text retrieval systems' Hungarian versions (Word, Excel, AmiPro, Word-Perfect, Topic, etc.). ~ Besides the above well-known applications there are two new tools based on the same strategy, the reflectional thesaurus called Hdy#t8 (Prdsz6ky and Tihanyi 1992) and the series ofintdligent bi-lingual dictionaries called NoBi0i0. Both are dictionaries with morphological knowledge: Hdysff0 is monolingual, while NoBil)i0 - as its name suggests -- bilingual. Having analyzed the input word the both systems look for the found stem in the main dictionary. The inflectional thesaurus stores the reformation encoded m the analyzed affixes and adds to the synonym word chosen by the user. The synthesis module of Humor starts to work now, and provides the user with the adequate inflected form of tim word in question. This procedure has a great importance in case of highly inflectional languages.</Paragraph> <Paragraph position="3"> The synonym system of Hslysff8 contains 40.000 headwords. The first version of the inflectional thesaums HdysH8 needs 1.6 Mbytes disk space and runs under MS-Windows. The size of the MoBiDic dictionary packages vary depending on the applied terminological collection. E.g. the Hungarian--English Business Dictionary (Example 4) needs 1.8 Mbytes space. 4 Besides the above mentioned products, a Hungarian grammar checker (called HsIy6~6bb) and other syntax-based (and higher level ) mono- and multilingual applications derived also from the basic Hum0r algorithm are under development.</Paragraph> </Section> class="xml-element"></Paper>