File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/91/e91-1053_abstr.xml
Size: 12,803 bytes
Last Modified: 2025-10-06 13:47:10
<?xml version="1.0" standalone="yes"?> <Paper uid="E91-1053"> <Title>institute for Knowledge Based Systems</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The desire to construct robust and portable natural language systems has led to research on how a core vocabulary for such systems can be defined. Stalistical methods and semantic criteria for doing this arc discussed and compared. Currcnlly it docs not seem possible to precisely define the notion of core vocabulary, but it is argued that workable criteria can nevertheless be \['o1.1110. l:inally it is emplmsized that the implementation of a core vt~cabulary must be seen as a long-range research prt~gram rather than as a short-term goal.</Paragraph> <Paragraph position="1"> Motiva!ion Rcasearch on natural language processing systems today strives for the construction of robust and portable systems) A system is robust, if it can handle a large variety of user inputs without giving up or producing unexpected results. A system is portable in the sense intended here, if it is not geared to a single subject domain, but can be ported with a reasonable effort to a vailely of subjccl domains. It is common understanding that there cxisls a t:t:llhal \[i'aglnt'.tlt of a language which I. is required for dealing with virtually any subject d(~main, anti 2. is invariant with respect to meaning and use accross subject donmins, it is of course a non-trivial empirical question whether such a cen{ral fragment really exists, and if so, to say what it is, but a number of researchers scenl tO share the asstllnption that it does (ef. e.g. Alshawi ct al. (19R8)). Any robust and portablc system would then have to handle this core fi'agmcnl.</Paragraph> <Paragraph position="2"> in this paper I am concerned with a second - related - assumption, namely that there exists a core vocabulary which is needed for handling any subject domain. 'llfis assumpti(m is also shared by many researchers, and it tmdcrlies the production of basic vocabularies for language learning such as Ochlcr (1980). l.Jsually the attthors claim that their word lists are I)ased on statistical investigations, but they also emphasize that. they did not slavishly stick to the statistics but used additional crilcria such as &quot;usage value&quot;, &quot;availability&quot;, &quot;familiarity&quot;, or &quot;lcamability&quot; without ever saying how these are I will address the following questions: I. llow can the intuilivc notion of core vocabulary be properly dciincd? lhose words which suffice to dcfinc all of the remaining vocabuh~ry of a language.</Paragraph> <Paragraph position="3"> The first two definitions call Re&quot; statistical methods which shall be discussed in the next gcction, and the third onc obviously requires a .~cmanlic approach which shall bc discusscd in ~cclion &quot;Soma,tic crilcria&quot;.</Paragraph> <Paragraph position="4"> Statistical methods l:rcqucncy counts have well established the basic propcrtics of the frequency distribution for text corpora. Thus in Kuccra and l;rancis (1967) we get coverage ligures like lifts lbr their complete corl~US of about I million tokens: The research deso'ibed here has been comlucted i. the eonh;xt of the I .I I,O(; project (I lerzog et al., 1986). II has profiled fi'om inlensive discussio.s with R. Maye,'. Mnch of the underlying statistical work on text corpora is dec Io U, Bandara and (L Walch from Ihe speech recognition project SPRIN(\] (Wothke et at., 1989). Our investigations ;ire based (.I (~erman, but for ease of refere.ce also some l!nglish examples are given. - 303 These figt,rcs vary only slightly with corlms size, and also for German similar values are observed.</Paragraph> <Paragraph position="5"> ! lowevcr, while coverage figures are rather stable with respect to the n most frequent words of a corpus, what are tile n most frequent words may vary widely with corpora or subcorpora. Two parameters rcsponsible for this variation are obvious: null I. Subject matter and 2. ' Communicative function.</Paragraph> <Paragraph position="6"> Thus in the &quot;Kultur&quot; section of a newspaper which we have analyzed we see that words like Musik, Theater, Regisseur, etc. occur with a drastically higher freqt, ency than in the other sections, which of course can be attributed to subject matter. 13ul personal pronouns, in particular 1st and 2nd person pronouns, also show a much higher frequency, and this can hardly be attrihutcd to subject matter, rather to different communicative functi(ms of feuillet(mistie writing and say economic news.</Paragraph> <Paragraph position="7"> All of tiffs relates of course to tile much discussed issue of what c,mstitutes a representatitve corpus for statistical linguistic analysis. Since specific subject matters and communicative functions vary in importance for different speakers of a language, it will be difficult if not inapossible to eliminate arbitrariness. Rather, a definition of representative corpus must take into account tile research goals pursued.</Paragraph> <Paragraph position="8"> For a natural language system which is supposed to analyze and generate texts, to engage in dialogues with users, and which is to acquire knowledge fi'om the analysis of definiti(ms and rules formulaled in natural language, one needs a corpus of texts where all these aspects are sufficicntly rcprcscntcd. We were able to draw upon a wtriety of corpora none of w\[fich would sh()w all the featt, res rcquircd, but the combination of them seems to be quite reasonable.</Paragraph> <Paragraph position="9"> We conlpared Ihe fi)llowing five word lists: I. Oehlcr (1980): (;rundwortschatz consisting of 2247 words, 2. Erk (1972): scientific texts from 34 disciplines, 1283 words with fl'cquency > 20, 3. Pregel/Rickhcit (1987): texts by primary school children, 593 words with frequency > 20, 4. SPRING-corpus of newspaper texts, 2733 most frequent words, 5. I)UI)EN (1989): definitions for words be- null ginning with a, 2693 words with frequency >4.</Paragraph> <Paragraph position="10"> Frona these, word lists IL were formed consisting of those words occurring in at least n of the original word lists (I < n < 5). The lengths of these lisls are B~: 5409, !12 : 2248, l.h: 1215, B4: 565, and B5 : 116.</Paragraph> <Paragraph position="11"> The size of /Is shows thai a really common core of a varicty of texts may be extremely small, the successive losening of rcstrictimls used here allows for a balanced extension of this very smaU core. The list//3 was chosen as the statistical core vocabulary serving as a base for applying semantic criteria, becat, se the overall core vocabulary was envisaged to have a size of approx. 1500 words. Inspection shows that many intuitively basic words and very few idiosyncratic words are contained due to the method of intersecting the word lists, l lence, !1~ seems quite reasonable. Semanlic criteria If one takes tim n most frequent words of any frequency count one will no doubt discover that these words will not exhibit a linguistic closure in the sense that natural scntcnces can be formed with all and only the words in the set. l:urther one will see that semantic relations will be incomplete. Thus one tinds in Oehler (1980) which is based on the old Kacding count that weiblich (female) occurs but not its antonym re&milch (male). For a core vocabulary to bc set up for a natural language system, 1 think, tmc must strive for lingt, istic closure, since otherwise, one cnds up with words one cannot use. This means that you cannot base the core w~cabulary on frequency counts alone.</Paragraph> <Paragraph position="12"> l~urthermore, one cannot expect that one will imve just the vocabulary needed to formulate delinitions for the words in the list chosen. To avoid circularity, one will have to accept that certai,i words cannot bc defined wilhin the vocabulary, but one will also have to accept that for some words less than complete definitions can be given. Because of this lack of delinability, a sere'retie core wwal,llary can only be understood as an approximativc notion geared towards &quot;the best cmc can do&quot;. What one can hope to do, is to define 1. taxonomic rdations, 2. &quot;selcctional restrictions&quot; or constraints on seunauflic compatibility, and 3. meaning rules of arbitrary complexity (including classical definitions).</Paragraph> <Paragraph position="13"> 1 propose to formulate all of these typcs of rules in natural language for B3 trying to stay within at least tile vocabulary of B, , to add lhe words used in the fommlations to the original set, and continue until one cannot think of further rules. I claim that one can achieve a fixed point from where on no new words are added to tim set, and that at this moment one has reached a rather good approximation to a semantic core vocabulary. null There is undoubtedly a relationship between frequency and semantic relevance: since taxonomic relations are often exemplified by anaphoric references, since semantic compatibility constraints lead to tile co-occurrence of ap- 304 propriate words, and since other more complex semantic relationslfips arc bound to be exhibited in the various threads of discourse, one has all reason to expect a certain amount of congruence between frequency counts and the semantic core vocabulary as defined above.</Paragraph> <Paragraph position="14"> The work on fimnulating taxonomic relati(,ns, semantic constraints and other meaning rules is underway, and since it will inw,lve all of the w)cabulary, linguistic closure will be achieved at the same time.</Paragraph> <Paragraph position="15"> As an example, take a taxonomic rule for Arm which is in Bs Jeder(B3) Arm ist Tei1(B4) eines KiSrpers(B3) (Every arm is part of a body.) The word Kreperteil (body part) is only available in Bj and was therefore not used, or instead o1' 7&quot;eil one could also have used Glied(B3, member), bu! then the rule would not have covered arms of machines or rivers. This highlights a big problem in the natural language formt, lation of meaning rules: how is ambiguity dealt with? Space does not permit a full discussion here, therefore suffice it to say that it is one of our research goals to formulate meaning rules which specify criteria for disambiguation. Linguistic description The preceding discussion has concentrated on how to establish a core vocabulary. Now a few brief remarks shall follow on how the words of the core vocabulary can bc linguistically described. null 'l'he morphology of I:mguagcs such as (\]crman is well understood and has been coded for an extendcd vocabulary in Ihc lexical data-base of the IJ:,X project (llamett ct al., 1986). This database also conlains dctailcd syntactic inforuaation, in pa~l.icular on government patterus. null It is the description tat' lhe semanlie (and pragmatic) properties of many words one wouhl obviously wanl h) include in a core vocabulary lhat will confront us with huge unsoivcd theoretical problems. Be it modal verbs or propositional attitudes, sentence adverbs or &quot;abstract&quot; nouns of various kinds, hwestigations on some individual words havc generated heaps of literature, for others it seems that people have not even dared to look at thcm. l)oes this make the enterprise of implementing a core w)cabulary a futile one? I think not. I think the implementation of a core w)cabulary should be seen as a long-range research goal for both computational and theoretical linguistics, and filrthermore that natural language systems provide a good environmcnt for doing experiments in semantics, because they encourage an integrated treatment of linguistic phenomena.</Paragraph> <Paragraph position="16"> Conclusions ()ur research on establishing a core vocabulary tor German in the framework of the I,I1,OG project Ires revealed that currently no absolute definition can be given, but ways have been shown how to arrive at a working dclinition with respect to the objectives of natural language systems. It has been shown that both statistical mclhods and semanlic criteria can, and I think, have to contribute to the establishment of a core vocabulary.</Paragraph> <Paragraph position="17"> The linguistic description and thus the implcmentation of a core vocabulary depends heavily on progress in theoretical linguistics, in particular in semantics and pragmatics, but 1 want to stress that h)cussing on a core w~cabulary is a fruitful way to direct linguistic research, which can be supported by the need for integraled treatments in natural language systems.</Paragraph> </Section> class="xml-element"></Paper>