File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/e93-1023_intro.xml
Size: 5,720 bytes
Last Modified: 2025-10-06 14:05:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1023"> <Title>A Probabilistic Context-free Grammar for Disambiguation in Morphological Parsing</Title> <Section position="2" start_page="0" end_page="183" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> MORPA is a MORphological PArser developed for use in the text-to-speech conversion system for Dutch, SPRAAKMAKER \[van Leeuwen and te Lindeft, 1993\]. An important step in text-to-speech conversion is the generation of the correct phonemic representation on the basis of the input text. As is wellknown, phonemic transcriptions can not be derived *This work was carried out at the Phonetics Laboratory at Leiden University and supported by the Speech Technology Foundation, which is funded by the Netherlands Stimulation Project for Information Sciences, SPIN.</Paragraph> <Paragraph position="1"> directly from orthographic input in Dutch, as there is no one-to-one correspondence between graphemes and phonemes. Also, stress and the effects of most phonological rules are not reflected in orthography.</Paragraph> <Paragraph position="2"> A text-to-speech system therefore requires an intelligent method to convert the spelled words of the input sentence into a phonemic representation.</Paragraph> <Paragraph position="3"> As far as the pronunciation of words is concerned, it is impossible to list the entire vocabulary of the language, because language users have the ability to create new words and the vocabulary, as such, is indefinitely large. Daily newspapers, for instance, contain a large amount of newly formed words every day. Not all of these survive in the long run, but some of them do. Consider the examples in (1): (1) golfooriog 'gulf war' drugsbaron 'drugs baron' vredesmacht 'peacekeeping force' Because it is unfeasible to give the lexicon a daily update, this approach is not appropriate if the text-to-speech system is to convert unrestricted text. Assuming that newly created words will typically consist of already existing morphemes, and that new morphemes are added to the language only rarely, we can, however, use a lexicon in which all Dutch morphemes and their pronunciations are listed. Then complex words, such as the ones in (1), have to be decomposed into their constituent parts before their pronunciation can be looked up.</Paragraph> <Paragraph position="4"> Since the pronunciation of a word does not always consists of the concatenation of the pronunciation of the morphemes, because the pronunciation of morphemes can be modified in certain contexts, the text-to-speech system also has to be provided with phonological rules which adjust the pronunciation of morphemes according to their context \[Allen et aL, 1987; Nunn and van tteuven, 1993\].</Paragraph> <Paragraph position="5"> Dutch phonological rules are in several ways dependent on morphemic segmentation and word class assignment. As is shown in (2a), for example, the grapheme d is pronounced voiceless when it occurs stem-finally, but voiced when it occurs stem-initially. Final devoicing, the phonological rule which affects the pronunciation of the d, depends on syllable structure, and syllabification is sensitive to the morphological structure of a word: compound boundaries are also syllable boundaries. This has serious consequences in Dutch, as Dutch compounds are usually written as one word, i.e. without spaces or hyphens in between the parts. Example (2b) shows that the stress in compounds differs from the stress in monomorphemic words. In (2c) it is shown that the stress in (predicatively used) adjectival compounds differs from the stress in nominal compounds: %n + recht, N So to be able to produce high quality speech on unrestricted text, the text- to-speech system SPRAAKMAKER contains the morpheme lexicon-based morphological parser MORPA to recover the morphemic segmentation and word class of the input word. The module MORPHON \[Nunn and van Heuven, 1993\] applies phonological rules which derive the pronunciation of the word by making use of the morphological information. Also, the word class provided by MORPA feeds the module for sentence analysis which serves sentence prosody \[Dirksen and Quen~, 1993\].</Paragraph> <Paragraph position="6"> Our method of morphological analysis comprises a morpheme lexicon. Assuming that Dutch word formation is concatenative, word or word parts are recognized by dividing the word into substrings that correspond to entries in the lexicon. The major problem this method poses is ambiguity, i.e. the generation of alternative segmentations and word class assignments for one input word, many of which are implausible. In a text-to-speech system, an incorrect analysis is unacceptable, because it may lead to a wrong pronunciation \[Nunn and van IIeuven, 1993\]. In order to deal with ambiguity, MORPA has been provided with a probabilistic context-free grammar (PCFG), i.e. it combines a &quot;conventional&quot; context-free morphological grammar to filter out ungram'police sergeant' 'roof of foliage' 'evening hour'</Paragraph> <Paragraph position="8"> matical segmentations with a probability-based scoring function which determines the likelihood of each successful parse. Then, aiming at a system that generates the &quot;best&quot; analysis first, the remaining analyses are ordered along a scale of plausibility. In this paper, I will separately describe the rule-based disambiguation techniques and probability-based scoring function. Illustrative performance data obtained from an evaluation will show that a probabilistic context-free grammar yields good results in morphological parsing.</Paragraph> </Section> class="xml-element"></Paper>