File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/c92-4195_intro.xml
Size: 5,272 bytes
Last Modified: 2025-10-06 14:05:16
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-4195"> <Title>BROAD COVERAGE AUTOMATIC MORPHOLOGICAL SEGMENTATION OF GERMAN WORDS</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 INTRODUCTION IBM Scientific Center Heidelberg is develop- </SectionTitle> <Paragraph position="0"> ing a large vocabulary speech recognition system for German (Wothke et al. 1989). The system needs for each word of its reference vocabulary two types of reference patterns: * prototypal acoustic reference patterns.</Paragraph> <Paragraph position="1"> * phonetic transcriptions of the main pronunciation variants of the word.</Paragraph> <Paragraph position="2"> Up to now the transcriptions were generated for each orthographic word of the reference vocabulary by an automatic procedure having two drawbacks which caused a high amount of manual revision for the generated transcriptions: null * For each&quot; word only one transcription was generated. Our speech recognition system, however, needs at least the most significant pronunciation variants of each word.</Paragraph> <Paragraph position="3"> * The automatic procedure took into account only the letter context of each letter to determine its transcription, in German, however, the transcription of a letter is very often also dependent on its morphological context.- Most of the transcription errors of the former system were a consequence of the fact that the system did not have any intbrmation about tile morph structure of the words.</Paragraph> <Paragraph position="4"> To reduce the manual work necessary to revise the transcriptions we currently develop a system with the following new features: 1. An orthographic word is first segmented into its morphs.</Paragraph> <Paragraph position="5"> 2. In a second step one or more phonetic transcriptions are produced for each segmentation of the word using letter-to-phone rules which can refer to the morph structure detected in the first step.</Paragraph> <Paragraph position="6"> The following paragraphs will deal with the first step. We will mainly restrict ourselves to the linguistic knowledge incorporated in our current morph segmentation system. The overall architecture of the segmentation system and details of the segmentation algorithm are described in Wothke/Schmidt (1991).</Paragraph> <Paragraph position="7"> A morphological segmentation procedure for German has to deal with the following basic features of German morphology: * Composition.</Paragraph> <Paragraph position="8"> * Derivation.</Paragraph> <Paragraph position="9"> * lnflexion.</Paragraph> <Paragraph position="10"> * Ambiguous morph structure: Some words can be segmented in several ways.</Paragraph> <Paragraph position="11"> * Reduction of consonant triples: If two lexi null cal morphs are concatenated, where the first morph ends in a vocalic letter and two identical consonantal letters and the second morph starts with the same consonantal letter and a vocalic letter, then the result of the concatenation does not contain the consonantal letter three times but only twice. - The inverse process, i.e. trebling of double consonants, has to be carried out, when segmenting such words. Figure 1 shows the architecture of the morphological segmentation system. The interpreter for the segmentation has 5 main input files:</Paragraph> <Paragraph position="13"> ,, A morph dictionary containing inibrmation about the morph class/es each morph belongs to.</Paragraph> <Paragraph position="14"> * A word syntax represented in tile formalism of right linear regular grammars. It has to describe the set of those sequences of morph chtsses which underlie words.</Paragraph> <Paragraph position="15"> o A morph boundary table, where the user can specify the symbols used by the interpreter to mark the diflbrent kinds of morph boundaries. We specified that + is inserted before a prefix, = is inserted before a lexical morph, % is inserted before an infix, a derivational, or an inflexionat suffix, is inserted belore a Latin or Greek derivational suffix, ~ is inserted before a French or \['~nglish derivational suffix.</Paragraph> <Paragraph position="16"> . A table of (brbidden classes, where the user can enter the names of those morph classes which may not attract either of the three identical consonantal letters arising from consonant trebling (i.e. infix classes, suffix classes, and prefix classes).</Paragraph> <Paragraph position="17"> * A file containing the orthographic words to be segmented into morphs.</Paragraph> <Paragraph position="18"> The linguistic knowledge in the first el files exists in 2 representations: * An external representation which is created by the user of the system and which is human readable.</Paragraph> <Paragraph position="19"> * An internal representation which is automatically generated by a preprocessor from the external representation and which is more suitable for tile processing hy the interpreter.</Paragraph> <Paragraph position="20"> Thc intcrprcter loads tile internal representations of the 4 files and scgmcnts orthographic words according to tile knowledge in tile files. If a word is morphologically ambigm)us, several segmentations arc generated.</Paragraph> </Section> class="xml-element"></Paper>