File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/86/p86-1022_intro.xml
Size: 7,040 bytes
Last Modified: 2025-10-06 14:04:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P86-1022"> <Title>THE CONTRIBUTION OF PARSING TO PROSODIC PHRASING IN AN EXPERIMENTAL TEXT-TO-SPEECH SYSTEM</Title> <Section position="2" start_page="0" end_page="145" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> We describe an experimental text-to-speech system that uses a deterministic parser and prosody rules to generate phrase-level pitch and duration information for English input. This information is used to annotate the input sentence, which is then processed by the text-to-speech programs currently under development at Bell Labs. In constructing the ,system, our goal has been to test the hypotheses (i) that information available in the syntax tree. in particular. grammatical functions such as subject-predicate and head-complement, is bv itself useful in determining prosodic phrasing for svnthetic speech, and (ii) that it ts possible to use a syntactic parser that specifies grammatical functions to determine prosodic phrasing for synthetic speech.</Paragraph> <Paragraph position="1"> Although certain connections between syntax and prosody are well-known (e.g. the influence of part of speech on stress in words like progress, or the setting off of parenthetical expressions) very little practical knowledge is available on which aspects of syntax might be connected to prosodic phrasing. In many studies, investigators have sought connections between constituent structure and prosody (e.g. Cooper and Paccia-Cooper 1980. Umeda 1982. Gee and Grosjean 1983) but, with the exception of Selkirk (1984). they tend to neglect the representation of grammatical functions in the svntax tree. Moreover, previous work has not been specific enough to provide the basis for a full system implementation. Based on our study of prosodic phrasing in recorded human speech, we decided to emphasize three aspects of structure that relate to phrasing: syntactic constituency, grammatical function, and constituent length. These findings.</Paragraph> <Paragraph position="2"> which we will discuss in detail, have been implemented as a collection of prosody rules in an experimental text-to-speech system.</Paragraph> <Paragraph position="3"> Two important features characterize our system.</Paragraph> <Paragraph position="4"> First. the input to our prosody system is a parse tree generated by a version of the deterministtc parser Fidditch (Hindle 1983). The left-corner search strategy of this parser and, in particular, its determinism, give Fidditch the speed that makes online text-to-speech production feasible. 1 In building a parse tree, Fldditch identifies the core subject-verb-object relations but makes no attempt to represent adjunct or modifier relations. Thus relative clauses.</Paragraph> <Paragraph position="5"> adverbials, and other non-argument constituents have no specified position in the tree and no specified semantic role. Second. the rules in the prosody system build a prosody tree by referring both to the syntactic structure and to earlier stages of prosodic structure.</Paragraph> <Paragraph position="6"> The result is a hierarchical representation that supports the view, also proposed in Selkirk (1984).</Paragraph> <Paragraph position="7"> that grammatical function information is related to prosodic phrasin.g, but indirectly, through different levels of processing.</Paragraph> <Paragraph position="8"> Informal tests of the system show that it is capable of producing a significant improvement in the prosodic quality of the resulting synthesized speech, Our investigations of the system's problems, which we describe, have not revealed any serious counterexample to our basic approach. In many cases.</Paragraph> <Paragraph position="9"> it appears that problems with the current version can be resolved by taking our approach a step further, and including lexical information required by the parser as another factor in the determination of prosodic phrasing.</Paragraph> <Paragraph position="10"> TEXT-TO-SPEECH Most text-to-speech systems comprise two components: pronunciation rules and a speech synthesizer. Pronunciation rules convert the input text into a phonetic transcription; this information mav also be supplemented by a dictionary that provides information about the part of speech, stress pattern.</Paragraph> <Paragraph position="11"> and phonetic makeup of particular words. The speech I. With a ~rammar of about 600 rules and a lexicon of about 2400 words, &quot;Fidditch parses the 25 sample sentences of Robinson (1982), averagin~ 7 words per sentence and chosen for their structural divers*t'C/, at an avera~hrate of .405 seconds per sentence on a Sv'mbolics 3670. ~ rate is approximately proportional to th~ number of words in a sentence. synthesizer then converts this phonetic transcription into a series of speech parameters which are subsequently processecl to produce digitized speech.</Paragraph> <Paragraph position="12"> While these systems tend to perform quite well on word pronunciation, they fall short when it comes to providing good prosody for complete sentences.</Paragraph> <Paragraph position="13"> Current text-to-speech systems have no access to the syntactic and semantic properties of a sentence that influence phrase-level prosody. Hence rules for sentence prosody, when they are provided at all typically depend on superficial aspects of text (e.g.</Paragraph> <Paragraph position="14"> punctuation) and on heuristics that vary widely in sophistication. Although such techniques often add a more natural quality to the resulting synthetic speech, !hey .can fail in important ways, for example, by xgnormg the prosodic event between a lengthy subject and a predicate, so that there is no clear prosodic boundary between right and mark in The characters on the right mark the salient features. 2 Several authors (e.g. Allen 1976; Elovitz et al.</Paragraph> <Paragraph position="15"> 1976; Luce et al. 1983) have suggested that prosodic differences between synthetic and natural speech are the primary, unaddressed factor leading to difficulties in the comprehension of fluent synthetic speech. The relation between phrase-level prosody and its sources, however, is so poorly understood that we have no good sense of the degree to which different levels of explanation--syntactic, semantic, or pragmatic--are applicable. We currently have reasonable tools for automatic syntactic anal~,sis of a text. but there is nothing .equivalently well-developed for semantic or pragmatic textual analysis. Thus an obvious goal is to explore the extent to which phrase-level prosody can be explained by the syntax tree and develop a detailed description of that relation. A further goal is to convert the resulting insights about this relation into a system that can work with a speech synthesizer. This allows us to test our description more adequately and perhaps also produce something that will further text- to-speech technology.</Paragraph> </Section> class="xml-element"></Paper>