File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1203_metho.xml

Size: 13,568 bytes

Last Modified: 2025-10-06 14:14:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1203">
  <Title>A compact Representation of prosodically relevant Knowledge in a Speech Dialogue System</Title>
  <Section position="3" start_page="18" end_page="20" type="metho">
    <SectionTitle>
3 The Interface Protocol
</SectionTitle>
    <Paragraph position="0"> The goal of the interface protocol is to form a compact representation that contains all phonologically and prosodically relevant information which the generator can currently derive from its concept input or the syntactic structures generated from it. Both are relevant for phonological realisation, but they neither are nor directly contain the phonological knowledge itself, as the strategic generation, linguistic generation and final synthesis task are divided into different modules. The representation concerns categories that the prosodic construction makes use of rather than instructing it directly. 4 A phonologically oriented description suitable for generating proper sentence prosody differs in many aspects from the traditional syntactically oriented description normally produced by a sentence generator such as EFFENDI. The following section shows how the basic phonological specification can be derived from existing semantic and syntactic structures of an utterance in three main steps. For reasons of simplicity the treatment of incremental processing is postponed to the next section.</Paragraph>
    <Paragraph position="1"> 4In integrated systems, where conceptual construction, generation and synthesis have full mutual access to the relevant knowledge, there is no need for such an interface, and the linguistic grammar can directly incorporate the phonological features (cf. e.g. (Prevost and Steedman 1994)). However, apart for lack of flexiblility, integrated systems mostly must make use of the concept-to-speech synthesis ((Steedman 1996)), whereas the interface presented here can also be used with a text-to-speech synthesis.</Paragraph>
    <Section position="1" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
3.1 Phonological Categorization
</SectionTitle>
      <Paragraph position="0"> In classical grammars every word belongs to a category which describes how words of this category may be inflected and how they interact with other words in a sentence on both a syntactic and a semantic level. In formal computer grammars for parsers and generators, words are also assigned to categories. We will call these categories and all other phenomena connected with such grammars &amp;quot;syntactical.&amp;quot; The structures necessary to describe the prosodic behavior of a sentence or utterance may differ considerably from those necessary for classical grammars. In this paper we refer to all phenomena associated with prosodic or pronunciational behavior, as opposed to that described above, as being &amp;quot;phonological&amp;quot;. In this sense each word to be uttered has a phonological category associated with it. These categories tell the synthesizer something about the phonological function of each word in a sentence, in particular about the relative stress of the words to be uttered. These categories will often differ from the purely syntactic categories, which define the semantic and syntactic function of each word in a sentence. These categories will vary from language to language. In addition to the phonological category, one or more special attributes such as focus or emphasis (coming from the semantic generator input) may be optionally associated with each word.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.2 Phonological Segmentation
</SectionTitle>
      <Paragraph position="0"> In every language, spoken sentences are broken up into so-called &amp;quot;thought groups&amp;quot; or &amp;quot;breath groups&amp;quot; if they are more than a few words long. Also certain &amp;quot;atomic&amp;quot; groups such as &amp;quot;in the big room&amp;quot; are never broken up any further. The elements that constitute an atomic group are of course language dependent.</Paragraph>
      <Paragraph position="1"> These phonologically oriented atomic groups may or may not correspond to syntactic groups (i.e. subtrees) produced by the generator, but can be derived from the latter. Each atomic group also has a group category associated with it, which describes how the group interacts prosodically with others. Some of the group categories we initially propose for German are summarized in the following. Note that, e.g., &amp;quot;phonological&amp;quot; conjunctional phrases have no phrasal counterpart on the syntactic level:  words are contiguous or an isolated inflected verb in a subordinate clause</Paragraph>
    </Section>
    <Section position="3" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
3.3 Association of Atomic Groups to Each
Other
</SectionTitle>
      <Paragraph position="0"> Once the atomic groups have been determined, it is necessary to specify how these groups are logically connected to each other. In a phrase such as tfrom the manl\[in the room\] I wearing the coat\[ the second atomic group is logically connected to the first because &amp;quot;in the room&amp;quot; refers to man. Likewise the third group is also logically connected to the first group rather than its antecedent because &amp;quot;wearing the coat&amp;quot; also refers to man and not to room. This type of information can be derived from the original syntactic tree structure produced by the generator module. How such groups are connected to each other has a bearing on how the ultimate division into breath groups is determined by the synthesizer module.</Paragraph>
    </Section>
    <Section position="4" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
3.4 The Protocol
</SectionTitle>
      <Paragraph position="0"> This section describes the formal syntax of the interface protocol and illustrates it with an example.</Paragraph>
      <Paragraph position="1"> Each interface protocol describes a dialogue turn, which may consist of one or more sentences. In our example the turn consists of a single sentence. The protocol contains the following information: * the type of each sentence to be uttered, * a list of all words to be uttered along with their associated categories, special attributes if any, and the order in which they are to be uttered, * a specification of each atomic group along with its associated group category and * a description of all logical connections between atomic groups.</Paragraph>
      <Paragraph position="2"> The interface protocol for the sentence &amp;quot;Sie mbchten wissen, wann der Zug nach Ulm f'~ihrt&amp;quot; (literally: You would like to know, when the train to Ulm leaves.) looks like this:  Each sentence of an interface protocol consists of a specification of the sentence type, followed by a description of the atomic groups in the order they are to be uttered. The sentence-type descriptor is uniquely identified by the initial &amp;quot;$&amp;quot; and also serves to separate sentences from each other. Currently, the follwing types of sentences are distinguished:  Each atomic group is introduced by &amp;quot;**&amp;quot;, after which the individual words of the group along with their category and any optional attributes are listed. Word categories are enclosed in parentheses. Attributes, if present, are enclosed in square brackets, such as &amp;quot;\[focus\]&amp;quot; to indicate that the word in question forms the sentence focus. The last word/category pair is then followed by the group  category, which is uniquely identified by the preceding &amp;quot;#&amp;quot;. Finally a series of one or more pointers specifies other groups that are logically related. Each pointer is introduced by a &amp;quot;&gt;&amp;quot; followed by a signed number which specifies how many groups before (-) or after (+) the present group the connected group lies. These pointers are effectively double headed.</Paragraph>
      <Paragraph position="3"> In the example the first group points to the second (&gt;+1), and the second group points back to the first one (&gt;-i). This protocol is designed in such a way that all spacing between elements, as shown in the exmnple, is optional.</Paragraph>
      <Paragraph position="4"> Apart from the use in EFFENDI, the protocol is also used as synthesis input specification in the VERBMOBILS-project ((Wahlster 1993)), for the system utterances within the german clarification dialogue. null 4 The interface to the synthesis component in EFFENDI This section considers the question how the interface protocol can be used when the syntactic generator and the synthesis module interleave incrementally meaning that some words of the output are handed over to the synthesis module while others are still being generated at the same time. The problem for this processing mode is that the pieces handed over to the synthesis module cannot contain all prosodically relevant information as far as sentence parts that have not yet been generated are concerned. In sequential processing the complete protocol for an utterance is automatically computed and handed over to the synthesis module in one single step. In incremental processing the protocol must be handed over to the synthesis module in a piecemeal fashion. The question is therefore how the information handed over to the synthesis module can be reduced in favor of an early beginning of the articulation of a system answer.</Paragraph>
      <Paragraph position="5"> Since the protocol consists of a separation into breath groups, it seems to be reasonable to hand them over to the synthesis module as soon as they have been identified 6. In order to minimize the num~VERBMOBIL is a translation system that can assist a tkce-to-face dialogue between two non-native english speakers. The dialogue partners have the option to switch to their repective mother tongue and to activate VERBMOBIL to translate their utterances into english. This processing sometimes requires a clarification dialogue, e.g., if some background noises irritated the recognizer.</Paragraph>
      <Paragraph position="6"> ~Note, that the identification of the breath groups runs in parallel to the ongoing generation, so that the only missing information may be some pointers to breath groups that have not yet been generated.</Paragraph>
      <Paragraph position="7"> ber of missing pointers it is possible to impose a delay on one or more breath groups. This means that a breath group is handed over to the synthesis component if some of the following breath groups have already been identified by the generator.</Paragraph>
      <Paragraph position="8"> The most important problem in incremental generation is the necessity of repairs that have to be done if, e.g., a previously unknown word cannot be attached to the word order already articulated.</Paragraph>
      <Paragraph position="9"> Since already articulated words cannot be retracted, an extensive repetition of the concerned phrase is necessary to correct the already articulated but wrong formulation. E.g., if the noun phrase &amp;quot;the man&amp;quot; has been articulated and it is incrementally extended by an adjective &amp;quot;young&amp;quot;, the correction of the articulation consists of the repetition of the whole phrase &amp;quot;the young man&amp;quot;.</Paragraph>
      <Paragraph position="10"> In order to avoid such extensive repetitions, we developed a strategy called &amp;quot;afterthought syntax&amp;quot;. If words resulting from semantic information that was not available when the first words of a sentence were uttered can't be syntactically correctly attached to the words already articulated, then the syntactic ordering is (partly) disregarded, i.e. precedence is given to completeness of the semantic content and shortness of the utterance over syntactic correctness. In virtually all cases, the resulting utterance remains completely understandable. Technically this behaviour is implemented using elliptic generation. The (now complete) utterance is regenerated, and all parts of the utterance that have already been uttered are marked as ellipses, i.e. prohibited from being uttered again. However, rules are applied to ensure that repair elements receive a syntatic context if they need it, thus overriding that prohibition, if necessary: Sie m6chten wissen, w~n~ der Zug f~ihrt ...</Paragraph>
      <Paragraph position="11"> (You want-to know, when the train leaves ...) tier n~ichste Zug *. * (the next train ...) nach Ulm. (to Ulm.) The first elliptical resumption is caused by the previously unknown adjective &amp;quot;n~chste&amp;quot; which leads to the repetition of the complete noun phrase, while the second resumption is caused by the PP &amp;quot;nach Ulm&amp;quot; which, according to standard German syntax, would have to be placed before the verb in a subordinate clause.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="20" end_page="21" type="metho">
    <SectionTitle>
5 Future Work
</SectionTitle>
    <Paragraph position="0"> For the near future, we plan to implement a full interaction between the dialogue manager, the genera- null tor, and the speech synthesis module in incremental processing. We hope to gain practical experience in interleaved generation and synthesis. This is especially vital for finding an answer to the question, how articulation can be delayed in favor of an acceptable output quality in such a way that the overall reaction time of the system is only marginally increased.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML