File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/84/p84-1009_metho.xml
Size: 14,291 bytes
Last Modified: 2025-10-06 14:11:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P84-1009"> <Title>APPLICATIONS OF A LEXICOGRAPHICAL DATA BASE FOR GERMAN</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> APPLICATIONS OF A LEXICOGRAPHICAL DATA BASE FOR GERMAN Wolfgang Teubert </SectionTitle> <Paragraph position="0"> Institut f~r deutsche Sprache Friedrich-Karl-Str. 12</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 6800 Mannheim i, West Germany ABSTRACT </SectionTitle> <Paragraph position="0"> The Institut fHr deutsche Sprache recently has begun setting up a LExicographical DAta Base for German (LEDA). This data base is designed to improve efficiency in the collection, analysis, ordering and description of language material by facilitating access to textual samples within corpora and to word articles, within machine readable dictionaries and by providing a frame to store results of lexicographical research for further processing. LEDA thus consists of the three components Tezt Bank, Diationary Bank and ResuZt Bank and serves as a tool to suppport monolingual German dictionary projects at the Institute and elsewhere.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> I INTRODUCTORY REMARKS </SectionTitle> <Paragraph position="0"> Since the foundation of the Institut fHr deutsche Sprache in 1964, its research has been based on empirical findings; samples of language produced in spoken or written from were the main basis. To handle efficiently large quantities of texts to be researched it was necessary to use a computer, to assemble machine readable corpora and to develop programs for corpus analysis. An outline of the computational activities of the Institute is given in LDV-Info (1981 ff); the basic corpora are described in Teubert (1982).</Paragraph> <Paragraph position="1"> The present main frame computer, which was installed in January 1983, is a Siemens 7.536 with a core storage of 2 megabytes, a number of tape and disc decks and at the moment 15 visual display units for interactive use.</Paragraph> <Paragraph position="2"> Whereas in former years most jobs were carried out in batch, the terminals now make it possible for the linguist to work interactively with the computer. It was therefore a logical step to devise Lexicographical Data Base for German (LEDA) as a tool for the compilation of new dictionaries. The ideology of interactive use demands a different concept of programming where the lexicographer himself can choose from the menu of alternatives offered by the system and fix his own search parameters. Work on the Lexicographical Data Base was begun in 1981; a first version incorporating all three components is planned to be. ready for use in 1986.</Paragraph> <Paragraph position="3"> What is the goal of LEDA? In any lexicographical project, once the concept for the new dictionary has been established, there are three major tasks where the computer can be employed: (i) For each lemma, textual samples have to be determined in the corpus which is the linguistic base of the dictionary.</Paragraph> <Paragraph position="4"> The text corpus and the programs to be applied to it will form one component of LEDA, namely the Text Bank.</Paragraph> <Paragraph position="5"> (ii) For each lemma, the lexicographer will want to compare corpus samples with the respective word articles of existing relevant dictionaries. For easy access, these dictionaries should be transformed into a machine readable corpus of integrated word articles. Word corpus and the pertaining retrieval programs will form the second component, i.e. the Dictionary Bank.</Paragraph> <Paragraph position="6"> (iii) Once the formal structure of the word articles in the new dictionary has been established, description of the lemmata within to the framework of this structure can be begun. A data base system will provide this frame so that homogenous and interrelated descriptions can be carried out by each member of the dictionary team at all stages of the compilation. This component of LEDA we call the Result Bank.</Paragraph> </Section> <Section position="4" start_page="0" end_page="35" type="metho"> <SectionTitle> II TEXT BANK </SectionTitle> <Paragraph position="0"> Each dictionary project should make use of a text corpus assembled to the specific requirements of the particular lexicographical goal. As self-evident as this claim seems to be, it is nonetheless true for most German monolingual dictionaries on the market that they have been compiled without any corpus; this is apparently even the case for the new six volume BROCKHAUS-WAHRIG, as has been pointed out by Wiegand/Kucera (1981 and 1982). For a general dictionary of contemporary German containing about 200 000 lemmata, the Homburger Thesen (1978) asked for a corpus of not less than 50 million words (tokens).</Paragraph> <Paragraph position="1"> To be used in the text bank, corpora will have to conform to the special codification or pre-editing requirements demanded by the interactive query system. At present, a number of machine readable corpora in unified codification are available at the Institute, including the Mannheim corpora of contemporary written language, the Freiburg corpus of spoken language and the East/West German newspaper corpus, totalling altogether about 7 million running words of text.</Paragraph> <Paragraph position="2"> Further corpora habe been taken over from other research institutions, publishing houses and other sources. These texts had been coded in all kinds of different conventions, and programs had to (and still have to) be develQped to transform them according to the Mannheim coding rules. Other texts to be included in the corpus of the text bank will be recorded by OCR, via terminal or by use of an optical scanner, if they are not available on machine readable data carriers. By the end of 1985 texts of a total length of 20 million words will be available from which any dictionary project can make its own selection.</Paragraph> <Paragraph position="3"> A special query system called REFER has been developed and is still being improved. For a detailed description of it, see Br~ckner (1982) and (1984). The purpose of this system is to ensure quick access to the data of the text bank, thus enabling the lexicographer to use the corpus interactively via the terminal.</Paragraph> <Paragraph position="4"> Unlike other query programs, REFER does not search a word form (or a combinantion of graphemes) in the corpus itself, but in registers containing all the word forms.</Paragraph> <Paragraph position="5"> One register is arranged in the usual alphabetical way, the other is organized in reverse or a tergo to allow a search for suffixes or the terminal elements of compounds. All word forms in the registers are connected with the references to their actual occurrence in the corpus, which are then looked up directly. With REFER, it normally takes no more than three to five seconds for the search procedure to be completed, and all occurrences of the word form within an arbitrarily chosen context can be viewed on the screen. Response behaviour does not depend on the size of the text bank.</Paragraph> <Paragraph position="6"> In addition, REFER following options: features the - The lexicographer can search for a word form, for word forms beginning or ending with a specified string of graphemes or for word forms containing a specified string of graphemes at any place.</Paragraph> <Paragraph position="7"> - The lexicographer can search for any combination of word forms and/or graphemic strings to occur within a single sentence of the corpus.</Paragraph> <Paragraph position="8"> - REFER is connected with a morphological generator supplying all inflected forms for the basic form, e.g. the infinitive (cf. fahren (inf.) --- fahre, f~hrst, fahrt, f-~rt, fuhr, fuhren, fuhrst, f~hre, f~, f-~st, 9efahren).-?--~s will make it much easler for the lexicographer to state his query.</Paragraph> <Paragraph position="9"> - For all word forms, REFER will provide information on the relative and absolute frequency and the distribution over the texts of the corpus.</Paragraph> <Paragraph position="10"> - The lexicographer hat a choice of options for the output. He can view the search item in the context of a full sentence, in the context of any number of sentences or in the form of a KWIC-Index, both on the screen and in print.</Paragraph> <Paragraph position="11"> - For each search procedure, the linguist can define his own subcorpus from the complete corpus.</Paragraph> <Paragraph position="12"> - Lemmatized registers are in preparation. They will be produced automatically using a complete dictionary of word forms with their morphological descriptions. These lemmatized registers not only reduce the search time, but also give the accurate frequency of a lemma, not just a word form, in the corpus.</Paragraph> <Paragraph position="13"> - Register of word classes and morphological descriptions (e.g. listing references of all past participles) will be produced automatically by inverting the lemmatized registers. Thus the linguist can search for relevant grammatical constructions, like all verb complexes in the passive voice.</Paragraph> <Paragraph position="14"> - Another feature will permit searching for an element at a predetermined sentence position, like all finite verbs as the first words of a sentence or all nouns preceded by two adjectives.</Paragraph> <Paragraph position="15"> Thus the text bank is a tool for the lexicographer to gain information of the following kind: - Which word forms of a lemma are found in the corpus? Are there spelling or inflectional variations? - In which meanings and syntactical constructions is the lemma employed? - What collocations are there? What compounds is the lemma part of? - Is there evidence for idiomatic and phraseological usuage? - What is the relative and absolute frequency of the lemma? Is there a characteristic distribution over different text types? - Which samples can best be used to demonstrate the meanings of the lemma? Preliminary versions of the text bank are in use since 1982. Not only lexicographers but also grammarians employ this interactive system to gain the textual samples they need. A steadily growing number of service demands both from members of the Institute and from linguists at other institutions are being fulfilled by the text bank.</Paragraph> </Section> <Section position="5" start_page="35" end_page="35" type="metho"> <SectionTitle> III DICTIONARY BANK </SectionTitle> <Paragraph position="0"> If access to the textual samples of a corpus is an indisputable prerequisite for successful dictionary compilation, consultation of other relevant dictionaries can facilitate the drawing up of lexical entries. It is virtually impossible to assemble a corpus so extensive and encompassing that it will suffice to describe the whole vocabulary of a language, even within the limits of the particular conception of any dictionary (unless it were a pure corpus dictionary). A dictionary of contemporary language should not let down its user if he is reading a text written in the early 19th century though it will contain words and meanings of words not found in a corpus of post World War II texts. This holds even more for languages for special purposes; they cannot be described without recurrence to technical dictionaries, collections of terminology and thesauri, because the more or less standardized meanings cannot be retrieved from their occurrences in texts.</Paragraph> <Paragraph position="1"> According to Nagao et al. (1982), &quot;dictionaries themselves are rich sources, as linguistic corpora. When dictionary data is stored in a data base system, the data can be examined by making cross references of various viewpoints. This leads to new discoveries of linguistic facts which are almost impossible to achieve in the conventional printed versions&quot; A dictionary bank will therefore form one of the components of the Lexicographical Data Base.</Paragraph> <Paragraph position="2"> Since 1979 a team at the Bonn Institut fur Kommunikationsforschung und Phonetik is compiling a 'cumulative word data base for German', using ii existing machine readable dictionaries of various kinds, including dictionaries assembled for Artificial Intelligence projects, machine translation systems and, for copyright reasons, only two generals purpose dictionaries. Programs have been developed to make up for the differences in the description of lemmata and to permit automatic cumulation. For further information regarding this project, see Hess/Brustkern/Lenders (1983) and Brustkern/Schulze (1983, 1983a). The cumulative word data base, which is due to be completed in 1984, will then be implemented in Mannheim and form the core of the dictionary bank of LEDA.</Paragraph> <Paragraph position="3"> In its final version, the dictionary bank will provide a fully integrated cumulation of the source dictionaries, down to the level of lexical entries, including statement of word class and morphosyntactical information. A complete integration within the microstructure of the lexical entry, however, seems neither possible nor even desirable. Automatic unification cannot be achieved on the level of semantic and pragmatic description. Here, the source for each information item has to be retrievable to assist the lexicographer in the evulation. The dictionary bank will be a valuable tool not only for the lexicographer but also for the grammarian. Retrieval programs will make it possible to come up with a listing of all verbs with a dative and accusative complement, or of all nouns belonging to a particular inflectional class. Since the construction of the dictionary bank and the result bank will be related to each other, every time a new dictionary has been compiled in the result bank, it can be copied into the dictionary bank, making it a growing source of lexical knowledge. The dictionary bank can then be used as a master dictionary as defined by Wolfart (1979), from which derived printed versions for different purposes can be produced.</Paragraph> </Section> class="xml-element"></Paper>