File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/c94-2160_evalu.xml

Size: 8,285 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2160">
  <Title>THE PARSODY SYSTEM : AUTOMATIC PREDICTION OF PROSODIC BOUNDARIES FOR TEXT-TO-SPEECH</Title>
  <Section position="6" start_page="993" end_page="995" type="evalu">
    <SectionTitle>
EVALUATION
</SectionTitle>
    <Paragraph position="0"> Ultimately the success of a prosody component of a TI'S system will be determined by perceptual tests on the naturalness, or the acceptability, of the synthesised speech. Such tests are subjective, as well as time consuming and costly to perform, so a more objective point of reference is required.</Paragraph>
    <Paragraph position="1"> 4The values on a boundary node can be seen in Figure 2. Values are computed by taking the sum of the phonological words at each node and adding one. A phonological word is one joined by a &amp;quot;+&amp;quot; symbol. In Figure 2, each terminal node has only one phonological word.</Paragraph>
    <Paragraph position="2"> Our approach compares the prosodic boundaries assigned by Parsody with data provided in Bachenko and Fitzpatrick's paper \[2\]. This data comprises a set of 35 sentences with the prosodic boundaries marked by hand 5. Of these sentences, 14 are the 'original' Gee and Grosjean sentences re-analysed by Bachenko and Fitzpatrick. Our evaluation was concerned only with the primary and secondary boundaries assigned by Bachenko and Fi~patrick.</Paragraph>
    <Paragraph position="3"> Some points should be noted about these sentences. Tile Bachenko and Fi~patrick sentences (excluding the 14 Gee and Grosjean sentences) have a fairly simple sentence structure, and should therefore be handled well by the system (Parsody and Bachenko and Fi~patrick's system). In our opinion they do not constitute a rigorous test for the prosodic component of a TTS system, but they are useful for ev~duation nevertheless.</Paragraph>
    <Paragraph position="4"> The Gee and Grosjean sentences, however, have a complex sentence structure, ,although this is similar for each sentence. Experience would suggest that this is not a realistic s,'unple of sentences from which to work. Bachenko and Fitzpatrick have converted these sentences to their notation. This results in catch sentence having only one primary boundary, and all but one sentence having eric secondary botmdary.</Paragraph>
    <Paragraph position="5"> Furthermore, the primary boundary nearly always appears at the mid-point of the sentence. These results seem intuitively simple for such complex sentenccs, so in this evaluation the data-set &amp;quot;Gee and Grosjean Reanalysed&amp;quot; is a test against the Gee and Grosjean data, with the boundaries marked according to the normalisation algorithm employed by Parsody.</Paragraph>
    <Paragraph position="6">  clear from their description. A &amp;quot;correct&amp;quot; boundary ix a perfect match with the human-annotation. A &amp;quot;close&amp;quot; botmdary is oue where another boundary appears ill itS place (e.g. a secondary instead of a primary). &amp;quot;Extra boundary&amp;quot; refers to tile number ot! boundaries producexl by the system greater than the actual number of boundaries (a negative figure indicating that fewer boundaries el' that type were produced).</Paragraph>
    <Paragraph position="7"> The &amp;quot;~ores&amp;quot; presented, basically provide a figure by which systems can be compared (with each uther, or with human-annotated results). A score of 1 would indicate a perfect comparison of results. The figure includes both the successes and failures (including overgeneration) of the system. The overall score given, is the mean of the primary and secondm'y Scores.</Paragraph>
    <Paragraph position="8"> The scores are calculated according to the following lormula.</Paragraph>
    <Paragraph position="10"> matched closely (ie. a Primary marked by a Secondary, or vice</Paragraph>
    <Paragraph position="12"> The CorrcctBoundary result is multiplied by 2 as a weighting factor. Obviously it is better to have correct boundaries than close boundaries. Accordingly, the ActualBoumlary score is also doubled to maintain the scale.</Paragraph>
    <Paragraph position="13"> Note that the smaller the overgeneration factor, tile larger the amount of overgeneration (a score greater thin1 1 indicates undergenerafion).</Paragraph>
    <Paragraph position="14"> The results reported show that the Parsody system compares favourably, under this analysis, with tile Bachenko and Fitzpatrick system - for example in Table 1 the overall score is 68% for Parsody, and 49% for Bachenko and Fitzpalrick's system. What is encouraging is the better performance on the predictkm of primary boundaries. The automatic scoring program also presents the results in a useful way. To relate these results to Bacbenko and Fitzpatrick's evaluation in \[12\], they quote a figure of 80%, given the assumption that prima'y and secondary boundaries are basically similar from a comprehensibility, and acceptability, viewpoint on the synthesised speech. This score is given hy summing the Correct Primary and Close Primary scores and dividing hy the total nmnber of Primary txnmdaries, l'arsody scores 93% in this case ( calculated from Table 1).</Paragraph>
    <Paragraph position="15"> As regards our evaluation metht~t proper, it is clear that the method requires improvement. Future ,nethods should concentrate on punishing the inco~ect placement of boundaries, especially those that affect the perception of the synthesised speech, a viewpoint that Bachenko and l::itztmtrick also seem to hold.</Paragraph>
    <Paragraph position="16">  This short article outlined the Parsody system, the essentials of which form a component of BT's Laureate Text-to-Speech system. Key features of the Parsody system include its ability to provide accurate parses robustly, which allows it to handle ill-formed input with case. Parsody also provides a robust rule-based prosodic annotation facility, that has been developed from algorithms presented in the literature, but which have been extended for greater performance.</Paragraph>
    <Paragraph position="17"> Most of the problems with the Parsody system currently lie with the parser. Despite the high performance of the word tagger, the effect of wrongly tagging a word is large, since the prosody component uses this information to construct a prosody tree in a bottom-up fashion. To improve the tagging performance we ,are considering including word collocation statistics. Also, it would be desirable to increase the range of syntactic structures produced by the parser. To improve the parser performance we are looking at extending the minimal grammar, but in such a way that processing speed is maintained. Future versions of the parser may also include special disambiguation roles concenlrating on words having multiple pronunciations. Topic and focus marking will also be introduced at some stage.</Paragraph>
    <Paragraph position="18"> We also hope to investigate the stochastic approach to prosodic marking. Future work will focus on assembling a suitable corpus, it is likely that the best prosodic marking procedare is one which is a hybrid of both the rule-based and stochastic-based approaches. As was mentioned earlier, the immediate goal with respect to prosodic marking has been the prediction of prosodic boundary location and of the tx)undary strengths. Future work will concentrate on the interpretation of boundary slrengths, for cxample by investigating the corrclafion of our normalised (hence gradable) boundaries with acoustic phenomena at Ihcse boundaries.</Paragraph>
    <Paragraph position="19"> Finally, it is important to remember the intended goal of text-to-speech systems is to synthesise unrestricted text input, hfitial work has begun on extending the evaluation of the system to more 'normal' sentences. For example, work in BT's Natural Language Group includes automatic text summarisation; in tests on the summarisation of newspaper articles the length of sentences often exceeds 100 words. Our text-to-speech system must be able to handle such sentences efficiently both at the parsing and prosody stage. The lessons learnt from more difficult input such as this, may serve to increase our understanding of the relationship between syntax and prosody.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML