File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1208_metho.xml

Size: 16,859 bytes

Last Modified: 2025-10-06 14:14:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1208">
  <Title>Looking for the presence of linguistic concepts in the prosody of spoken utterances</Title>
  <Section position="5" start_page="57" end_page="57" type="metho">
    <SectionTitle>
2.5 Results
</SectionTitle>
    <Paragraph position="0"> The reliability of the test results does not depend on the listener's ability to concentrate solely on the prosody as is the case when evaluating original utterantes; nonsense sentences or utterances consisting of nonsense words. The results can be based on a large number of stimuli rather than be restricted to the particularities of only a few, because there are no semantical limitations to generating more stimuli.</Paragraph>
  </Section>
  <Section position="6" start_page="57" end_page="59" type="metho">
    <SectionTitle>
3 Validation test series
</SectionTitle>
    <Paragraph position="0"> Several methods for speech delexicalisation can be found in the literature \[Kre82.`Pas93.`Leh79,Mer96, Oha79:Pij94;Sch84\]. The aim of all these manipulations is to render the lexical content of an utterance unintelligibl% while leaving the speech melody and temporal structure intact. We think that the ideal stimulus manipulation for prosodic perception tests should meet three main requirements: * it should clearly convey the primary prosodic functions (i.e. accentuation, phrasing and sentence modality~ * the detection of these phenomena should not require too much listening effort from the test subject * the manipulation procedure should be simple and quick We compared six methods of delexicalisation according to these criteria. Subjects had to complete four different tasks. They were questioned after each task which of the six different stimulus versions they found easiest for the task, most difficult for the task. most pleasant and least pleasant. Learning effects are negligible because the presentation order was changed for each subject.</Paragraph>
    <Section position="1" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
3.1 Stimuli manipulation
</SectionTitle>
      <Paragraph position="0"> All the stimuli referred to in this paper were digitally recorded in an anechoic chamber with 16kHz and 16bit. The following sex manipulation methods were compared:</Paragraph>
      <Paragraph position="2"> The extracted pitchmarks of the original signal were filled with an excitation signal proposed by the CCITT \[CIT89\], and also low-pass filtered.</Paragraph>
      <Paragraph position="3"> The original signal was low-pass filtered using a time variant filter with a cut-off frequency just above F0. At unvoiced segments within the signal the cut-off frequency was automatically set to zero.</Paragraph>
      <Paragraph position="4"> A combination of spectral inversion and filtering proposed by \[Kre82\]. After high-pass filtering at 600Hz, the signal is spectrally inverted.` then low-pass filtered at 4000Hz and then added to the residual of the original signal low-pass filtered at 200Hz. The resulting signal preserves the voiced / unvoiced distinction and is the most intelligible of the versions compared.</Paragraph>
      <Paragraph position="5"> The extracted pitchmarks of the original signal were filled with the Liljencrants-Fant model \[Fan85\] of glottal flow.</Paragraph>
      <Paragraph position="6"> A simple sawtooth signal was inserted into the extracted pitchlnarks.</Paragraph>
      <Paragraph position="7"> The pitchmarks were filled with a sinus with a first harmonic of 1/4 of the amplitude and a second harmonic of 1/16 of the amplitude.</Paragraph>
      <Paragraph position="8"> Other ways of rendering an utterance unintelligible, such as \[Pij94,Pag96\], were not included as we tried to keep the effort for stimuli manipulation as low as possible.</Paragraph>
    </Section>
    <Section position="2" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
3.2 Counting of syllables
</SectionTitle>
      <Paragraph position="0"> In the first test session 18 subjects were asked to count the number of syllables of 12 short sentences aurally presented in the different manipulated versions. The stinmli were chosen out of five different sentences (5-8 syllables of length) spoken by a fealale speaker and manipulated with the six different procedures described above. Out of these stimuli two sentences per version were used for syllable counting while the rest was used for the accent assigmnent task. As this was an open response task, there is no referential chance level as in the other tests. The resuits show that the syllable number of nearly 60% of all stimuli can be determined exactly with the proposed method, at least at sentence level (Pig. 1).</Paragraph>
      <Paragraph position="1"> In 86% of all cases, the correct number of syllables plus/minus one were detected.</Paragraph>
    </Section>
    <Section position="3" start_page="57" end_page="58" type="sub_section">
      <SectionTitle>
3.3 Phrase accent assignment
</SectionTitle>
      <Paragraph position="0"> The same subjects then listened to the other 18 sentences (six versions in three different sentences) to  dittbrently manipula.ted stimuli.</Paragraph>
      <Paragraph position="1"> assign a phrase accent to a syllable. Again presentation order differed from subject to subject. Now~ they could see a cursor moving along an oscillogram of the current phrase; where each syllable boundary was marked. This combination of aural and visual presentation was chosen to make sure that the subjects: ability to count syllables was not tested again. To avoid any influences of the visual amplitude differences between the syllables on the subject:s choice: the stinmli had been adjusted to have a more or less equal energy distribution over the whole phrase. We thus reduced the intonational information by the energy factor. The results appear to confirm that this is the least important factor \[Fry58\] within prosodic perception. In 73.4% of all cases the phrase accent was correctly assigned (Fig. 2). Some of the subjects reported that the possibility of relating the perceived accent to a visual cursor position helped a lot. Others, who seemed to have no problems with the syllable counting task; said that they were rather confused by the visualization.</Paragraph>
    </Section>
    <Section position="4" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
3.4 Recognition of phrase modality
</SectionTitle>
      <Paragraph position="0"> 16 subjects were presented with three phrases recorded from a male speaker and pronounced in three different modalities: terminal, progredient (i.e. continuation rise) and interrogative \[Son96a\]. Each subject hstened to 32 stimuli chosen randomly from these nine phrases manipulated by the six procedures and decided on one of the given modalities.</Paragraph>
      <Paragraph position="1"> The result was highly significant: 84% of the stimuli were correctly recognized (Fig. 3).</Paragraph>
    </Section>
    <Section position="5" start_page="58" end_page="58" type="sub_section">
      <SectionTitle>
3.5 Phrase boundary detection
</SectionTitle>
      <Paragraph position="0"> 12 subjects were asked to place two phrase boundaries in 20 manipulated stimuh with the additional help of visual presentation. Four different sentences (12-20 syllables) had been read by a female speaker; all containing two syntactically motivated prosodic boundaries. The visual signal contained markers at each visible syllable boundary which served as possible phrase boundary location. As there were 15 possible boundaries per sentence in the mean, chance level can be calculated as being around 6.6%. All stimuli were checked whether they contained a visually obvious pause at the boundaries. These pauses were manually eliminated. Even though this meant that the most important clue for boundary detection \[Leh79\] was eliminated the subjects managed a significantly correct detection in 66.6% of all stimuli (Fig. 4). One of the two boundaries was correctly placed in 90% of the cases.</Paragraph>
    </Section>
    <Section position="6" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
3.6 Choice of stimulus manipulation
</SectionTitle>
      <Paragraph position="0"> All four tasks yielded correct results. It was surprising that the error rate for the differently manipulated stimuli did not significantly differ, neither within a task nor over all. So the decision which manipulation procedure to prefer can only be  of the signal did you find a) easiest? b) most ditficult? c) most plea.saalt? d)/east pleasmat? based upon the subjective evaluation of the pleasantness. As the differences between the tasks are small enough, we compare the subjects' opininions over all tasks (Fig. 5). The least &amp;quot;easy&amp;quot; version was the one filtered at the fundamental frequency. The sinusoidal signal and the signal after the Liljencrants-Fant model were &amp;quot;not difficult&amp;quot;. &amp;quot;Most comfortable&amp;quot; was the CCITT excitation signal, the signal filtered at F0 and the sinoidal signal. The spectrally inverted signal and the sawtooth excitation signal were judged &amp;quot;least comfortable&amp;quot;. All these differences were significant (p&lt;0.05). All in all we conclude that the sinoidal signal is the most appropriate one (Fig. 6). Our findings confirmed the resuits about the pleasantness of manipulated signals in \[Kla97\].</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="59" end_page="100" type="metho">
    <SectionTitle>
4 Examples of tests carried out to
</SectionTitle>
    <Paragraph position="0"> detect prosodic concepts The first two tests described here (emotions and syntactic structure) took place before the comparison of stimulus manipulation methods. Therefore they have been carried out using the sawtooth excitation signal. In the latter two tests (dialogue acts and given/new), the sinusoidal signal manipulation described in 2.3 was used.</Paragraph>
    <Section position="1" start_page="59" end_page="59" type="sub_section">
      <SectionTitle>
4.1 Emotions
</SectionTitle>
      <Paragraph position="0"> In a test aimed at identifying the emotional content (e.g. fear, joy, anger, disgust, sadness) from the prosodic properties only, speech signals that were resynthesized with a concatenative system yielded the same poor results as the delexicalized stimuli \[Heu96\]. Both stimuli gave results that were at chance level. It is obvious that in this case, where the naturalness of an utterance depends on features that are not readily controllable by time-domain synthesis system (e.g. aspiration, creaky voice etc.) a test procedure with resynthesized speech will not improve the results that have been obtained with the delexicalized stimuli, because all the parameters that are used for the resynthesis are present in the delexicalized stimuli.</Paragraph>
    </Section>
    <Section position="2" start_page="59" end_page="60" type="sub_section">
      <SectionTitle>
4.2 Syntactic structure
</SectionTitle>
      <Paragraph position="0"> To show that prosody transports information about the syntactic structure of a sentence, subjects were asked to assign one of several given syntactic structures to the presented delexicalized stimuli \[Son96b\]. The possible syntactic structures were represented by written sentences, one of which had the same syntactic structure as the stimulus. These sentences differed from the utterances that served as the source for the test stimuli (see Fig. 7). Asked to pick  stimuhs presented as excitation signal: &amp;quot;A~i:f der alten Theke steht der Eintopf.&amp;quot; ;answering sheet: Die kleine Katze lie.qt in der Truhe.</Paragraph>
      <Paragraph position="1"> In der Truhe lie.qt die kleine Katze.</Paragraph>
      <Paragraph position="2"> Die Katze lie.qt in der kleinen Truhe.</Paragraph>
      <Paragraph position="3"> In der kleinen Truhe liegt die Katze.</Paragraph>
      <Paragraph position="4"> out the sentence they were hearing, the subjects believed that what they heard was the written sentence, which shows that their decision was based solely on prosody. Stimuli of one male speaker were correctly classified in 80~ of all cases. A professional male speaker with very elaborate speaking style yielded 67~) of correct answers.</Paragraph>
    </Section>
    <Section position="3" start_page="60" end_page="100" type="sub_section">
      <SectionTitle>
4.3 Dialogue acts
</SectionTitle>
      <Paragraph position="0"> The motivation for this test was to decide whether different dialogue act types have a perceivable influence on the prosodic structure of an utterance.</Paragraph>
      <Paragraph position="1"> Within the VERBMOBIL project, dialogue act types from the domain of appointment scheduling dialogues are used \[Rei95\]. If these dialogue act types have specific prosodic forms, then the synthesis module should generate them accordingly.</Paragraph>
      <Paragraph position="2"> For a first approach we chose to evaluate the four dialogue act types:</Paragraph>
      <Paragraph position="4"> For each dialogue act type: eight sentence's were read by a male and a female speaker. For affirmation and negation~ only statements were chosen (length: 1-10 syllables), and four questions and four answers for suggestion and request (length: 6-14 syllables). The resulting 64 sentences were manipulated and randomly presented to ten subjects who had to assign one of the four dialogue act types to each sentence. Although each subject remarked that this was a pretty difficult task, their answers were significantly (p&lt;0.001) above chance level (Fig. 8).</Paragraph>
      <Paragraph position="5"> What seemed more difficult than relating the utterance to an abstract internal reference was the fact that the two speakers' utterances were presented in random order. They differed remarkably not only as to their fundamental frequency but also to their expressive strategies. Whereas the male speaker was more often thought to sound negating, the female speaker was mostly recognized as being requestive.</Paragraph>
      <Paragraph position="6"> Also, dialogue acts spoken by the female speaker were recognized significantly better as those spoken by the male. This indicates the degree to which the interpretation of a linguistic concept depends on the speaker's personality and should be taken into account whenever speaker adaptation of the synthetic output is desired. Perception tests should always take into account the subjects' comments on the completed task. This can yield very useful but often neglected extra information. The subject (no. 10 in Fig. 9) who scored better than the others explained his strategy. To distinguish between affirmation/negation on the one hand and suggestion/request on the other, he assumed that in the former, the focused part of the utterance lies at the very beginning of the utterance, whereas in the latter, the second half of the utterance should bear more focus. Whether this assumption can be generalized or not has to be investigated in further perception tests.</Paragraph>
      <Paragraph position="7">  af/in~a~on negation suggestion request dialogue act presented  for each prssented act. The line indicates chance level.</Paragraph>
    </Section>
    <Section position="4" start_page="100" end_page="100" type="sub_section">
      <SectionTitle>
4.4 Given/new
</SectionTitle>
      <Paragraph position="0"> As an extension of the phrase accent assignment test we tested the accuracy with which subjects perceive differently focussed parts within a delexicalized utterance. The stimuli consisted of eight sentences of a new/given structure and eight sentences of a given/new structure of different length. They were read by a female and a male speaker as possible answers to a question, then manipulated and presented in random order. The 'given' part was always a rephrasing of a part of the question. Ten subjects were given a short explanatory text with an example and then asked to decide in which order the different  for each subject. The line indicates chance level parts appeared witlfin the utterance and where the boundary between the two parts was located. The task was supported by an oscillogram of the stimulus containing four marks as possible boundary locations. As in Section 3.3.` the energy distribution over the whole sentence was smoothed. Some subjects claimed that the location task was easier than the order recognition task. The order recognition task was correctly completed in 78%, the boundary was correctly located in 62% (Fig. 10). Both tasks were significantly (p&lt;0.001) completed over chance level, yet some inter-subject differences were also significant. The subjects located the 'new I part significantly (p&lt;0.002) more often at the beginning of the sentence, which can be explained by intonational downstep.</Paragraph>
      <Paragraph position="1">  levd=50%) a~d the boundary location task (chance level=25%) for each speaker.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML