File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1094_metho.xml

Size: 19,851 bytes

Last Modified: 2025-10-06 14:07:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1094">
  <Title>Integrating Linguistic and Performance-Based Constraints for Assigning Phrase Breaks</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The computational model
</SectionTitle>
    <Paragraph position="0"> Our initial English model was developed within the framework of the LT TTT tokenization toolkit (Grover et al., 2000): this provides a modular and configurable pipeline architecture in which various components incrementally add XML markup to the input text stream. More details of the implementation can be found in (Atterer, 2002). In principle the algorithm consists of two main steps, each of which in turn is broken down into two further steps: Step 1 Assignment of a0 -phrases  1. Chunking 2. Restructuring of chunks to build a0 -phrases Step 2 Bundling of a0 -phrases into intonational phrases (&amp;quot;Insert Phrase Breaks&amp;quot;) 1. Insertion of breaks using punctuation 2. Insertion of further breaks using balancing and length constraints  The first important step is to identify a0 -phrases. Although we require some syntactic markup as input to constructing these, a full parse is not necessary. Instead, we carry out a shallow parse using a chunker. For English, we use Abney's Cass chunker.3 Cass builds syntactic structure incrementally starting with a level of simple chunks and then building various levels of more complex phrases above them. Phrases of each level are constructed non-recursively out of constituents of the previous level. For this work we only use the lowest level of units such as nx(noun chunk) and vx(verb chunk), as illustrated in (3).</Paragraph>
    <Paragraph position="1">  (3) &lt;nx&gt;Their presence&lt;/nx&gt; &lt;vx&gt;has enriched&lt;/vx&gt; &lt;nx&gt;this university&lt;/nx&gt; and &lt;nx&gt;this country&lt;/nx&gt;, and &lt;nx&gt;many&lt;/nx&gt; &lt;vx&gt;will return&lt;/vx&gt; &lt;nx&gt;home&lt;/nx&gt; &lt;inf&gt;to enhance&lt;/inf&gt;&lt;nx&gt;their own nations&lt;/nx&gt;.</Paragraph>
    <Paragraph position="2">  Abney's defininition of chunk is very similar to Nespor and Vogel's notion of a0 -phrase: &amp;quot;roughly speaking, a chunk is the non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head, but not including post-head constituents.&amp;quot; (Abney, 1996). Chunks defined in this way map almost directly into our a0 phrases, except that we also include in the a0 -phrase any unchunked material on the left boundary of the chunk. For example, the sequence and &lt;nx&gt; this country&lt;/nx&gt; in (3) is converted into a single a0 phrase. null For the German version of the model, we used a chunker developed by Helmut Schmid (work in progress) and carried out some subsequent restructuring of the chunker's output. The four main modifications to the chunk structure are as follows. 1 In German, as opposed to English, the auxiliary can be separated from the verb/verb group it belongs to. That is, a complement or modifier can split the verb chunk, and consequently the chunker builds two separate verb chunks. Since the auxiliary  does not count as a lexical head, we delete the chunk boundary after it. This is illustrated by examples (4) and (5) where the deletion of the chunk boundary after the auxiliary hat results in the a0 -phrase hat den Fuhrungsstreit.</Paragraph>
    <Paragraph position="3"> (4) &lt;nx&gt; Der nordrhein-westfalische Ministerprasident &lt;/nx&gt; &lt;nx&gt; Rau &lt;/nx&gt; &lt;vx&gt; hat &lt;/vx&gt; &lt;nx&gt;den Fuhrungsstreit &lt;/nx&gt; &lt;px&gt; bei &lt;nx&gt; den Sozialdemokraten &lt;/nx&gt;&lt;/px&gt;&lt;vx&gt;kritisiert &lt;/vx&gt;. &lt;nx&gt; (5) &lt;phi&gt; Der nordrhein-westfalische Ministerprasident Rau &lt;/phi&gt;&lt; phi&gt; hat den Fuhrungsstreit &lt;/phi&gt;&lt;phi&gt; bei den Sozialdemokraten kritisiert. &lt;/phi&gt; 2 Proper names, which are often output as separate chunks by the chunker, are attached to a preceding noun. In (5) the name Rau has been attached to the preceding noun chunk of (4).</Paragraph>
    <Paragraph position="4"> 3 Verb particles at the end of sentences are attached to the preceding chunk. Such verb particles are in fact part of verbs, but are sometimes separated from the verb stem, e.g. the particle auf from the verb aufgeben (to give up) in the sentence Er gab seinen Plan auf. (Lit: He gave his plan up.) In example (7) the particle ab is attached to the preceding chunk of (6).</Paragraph>
    <Paragraph position="5"> (6) &lt;nx&gt; Die weitere Entwicklung &lt;/nx&gt;&lt;px&gt; in &lt;nx&gt; den kommenden Jahren &lt;/nx&gt; &lt;/px&gt;&lt;vx&gt; hange &lt;/vx&gt;&lt;px&gt; von &lt;nx&gt; den unternehmerischen Qualitaten &lt;/nx&gt; &lt;/px&gt;&lt;vx&gt; ab &lt;/vx&gt;.</Paragraph>
    <Paragraph position="6"> (7) &lt;phi&gt; Die weitere Entwicklung &lt;/phi&gt;&lt;phi&gt; in den kommenden Jahren &lt;/phi&gt;&lt;phi&gt; hange &lt;/phi&gt;&lt;phi&gt; von den unternehmerischen Qualitaten ab .</Paragraph>
    <Paragraph position="7"> &lt;/phi&gt; 4 Phrase-final verb chunks which consist of only  one word are also attached to the preceding material. This is also illustrated by (4) and (5) where the final verb chunk consisting only of the past participle kritisiert is included in the same a0 -phrase as the preceding chunk.</Paragraph>
    <Paragraph position="8"> After identifying break-options in the form of a0 phrases, we have to bundle these constituents into intonational phrases. As mentioned before, there is observational evidence that utterances should be divided into intonational phrases of roughly equal length. Examining the Spoken English Corpus (SEC), Knowles et al. (1996a, p.111) found that speakers insert breaks after about five syllables in most of the cases and that they almost never utter more than 15 syllables without a break.</Paragraph>
    <Paragraph position="9"> Our algorithm will thus contain a threshold parameter which sets an upper bound on the length of I-phrases. This value is used to calculate the optimum length of the I-phrases for particular sentences. Even though the threshold sets an upper bound, it is not a rigid one: an I-phrase can become longer in some cases. This is similar to cases in which a speaker would like to pause and maybe take a breath, but has to utter a few more words in order to complete a chunk.</Paragraph>
    <Paragraph position="10"> As we mentioned before, we envisage our system as forming one component of a TTS system, and therefore it is reasonable to expect punctuation in the input. This information provides a hard initial constraint on the formation of I-phrases; commas and periods always correspond to I-phrase boundaries. Once we have identified these I-phrase boundaries, the resulting segments are further sub-divided by applying the following procedure.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Insert Phrase Breaks
</SectionTitle>
      <Paragraph position="0"> If the number of syllables nsin an intonational phrase is greater than threshold th, then (a) Calculate the number of desired breaks db = ns/th and the optimum length olof each new intonational phrase ol = ns/(db + 1).</Paragraph>
      <Paragraph position="1"> (b) Determine the location of each new break starting at the beginning of an intonational phrase, counting ol syllables forward, and carrying on until the end of the current a0 -phrase. This is performed dbtimes for the obligatory intonational phrase.</Paragraph>
      <Paragraph position="2"> So a threshold of 13, for instance, turns the structure shown in example (4) into the one shown in (8) where breaks are marked by ' a0 ' and turns the structure in example (5) into the one shown in example (9).</Paragraph>
      <Paragraph position="3">  (8) Their presence has enriched this university a0 and this country, a0 and many will return home a0 to enhance their own nations. a0 (9) Der nordrhein-westfalische Ministerprasident  Rau a0 hat den Fuhrungsstreit bei den Sozialdemokraten kritisiert. a0 We tried modifying the last step such that the algorithm could return to the beginning of the current a0 -phrase if this was closer than the end. It is interesting that this obtained slightly worse results, since we believe that the current algorithm is closer to what humans seem to do: reading on until they feel that a break is necessary but not inserting a break until they have completed the current a0 -phrase.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation Results
</SectionTitle>
    <Paragraph position="0"> We have already alluded to the fact that often there are several equally acceptable possibilities for assigning prosodic structure to a given stretch of text. Consequently, the very notion of evaluating a phrase-break model against a gold standard is problematic as long as the gold standard only represents one out of the space of all acceptable phrasings.</Paragraph>
    <Paragraph position="1"> Nevertheless, we have adopted the standard evaluation methodology in the absence of a more suitable alternative.</Paragraph>
    <Paragraph position="2"> The English model was evaluated using a test corpus of 8,605 words taken from the Spoken English Corpus (SEC) (Knowles et al., 1996b).4 Our test corpus comprises 6 randomly selected texts from 6  icame/lanspeks.htmland consists of approximately 52k words of contemporary spoken British English drawn from various genres. The material is available in orthographic and prosodic transcription (including two levels of phrase breaks) and in two versions with grammatical tagging.</Paragraph>
    <Paragraph position="3"> different genres. We calculated recall and precision values. Recall is the percentage of breaks in the corpus that our model finds: recall a0 a1a3a2a5a4a1 a6a8a7a10a9a11a9a13a12 where B is the total number of breaks in the test corpus and D is the number of deletion errors (breaks which the model does not assign, even though they are in the test corpus). Precision is the percentage of breaks assigned by the model which is correct according to the corpus: precision a0a15a14 a2a17a16</Paragraph>
    <Paragraph position="5"> where S is the total number of breaks which our model assigns to the corpus and I is the number of insertion errors (breaks that the model assigns even though no break occurs in the test corpus). We also</Paragraph>
    <Paragraph position="7"> The results for running the English version of the model with selected thresholds are shown in Table 1. Increasing the threshold decreases the number</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Recall Precision F-score
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> of breaks that the model assigns: recall goes down, and precision goes up. Decreasing the threshold results in more overgeneration, with recall going up and precision going down. A threshold of 7 produced the best overall results. Reducing or increasing the threshold below 5 or above 12 results in an overall F-score of below 70. However this is not true for certain individual texts. One of the 6 texts we examined was the transcription of a public speech and thus presumably delivered in a different way than news broadcast for instance. (Example 8 was taken from this speech). Its F-score for a threshold of 13 was 71 while its F-score for a threshold of 7 was only 68. Section 4 below contains further discussion of the role played by the threshold parameter in modelling performance.</Paragraph>
      <Paragraph position="3"> For comparison, the table also shows the results of two other approaches, namely a baseline model which we ran on our test data and which only assigns breaks at punctuation marks, and Taylor and Black (1998)'s Markov model for English.5 It should be mentioned that Taylor and Black's model was trained on the SEC corpus, part of which is used for the evaluation here. It is thus optimized for this corpus and has the disadvantage of being less general than our model. Taylor and Black (1998, p.15) report that recall dropped from 79% to 73% when their model was tested on non-SEC data.</Paragraph>
      <Paragraph position="4">  tem against the more homogeneous corpus (14 sentences) of Gee and Grosjean, when restricted to predicting major breaks (intra-sentential and intersentential). For comparison, we also show the results reported by Bachenko and Fitzpatrick (1990) from running their rule-based model on the same corpus.6 The German version of the model was evaluated using 7,409 words of the news corpus of the Institute of Natural Language Processing (IMS), University of Stuttgart (Rapp, 1998). News broadcasts read by various speakers were hand-labelled with two levels of breaks (Mayer, 1995). For the evaluation we used all breaks without distinguishing between different levels. The results are shown in Table 3. As a comparison, we also show the baseline results using punctuation only, and results achieved by Schweitzer and Haase (2000) using rule-based approaches for German. The first set of results by Schweitzer and Haase were obtained with a robust stochastic parser and a head-lexicalized probabilistic context-free grammar, and the second set by 5Precision was calculated from the figures in Table 2 on p. 10 in their paper, assuming 1,404 breaks and 7,662 junctures as stated on p. 4 there.</Paragraph>
      <Paragraph position="5"> 6These were calculated from the annotated sentences in their appendix counting major intra-sentential and inter-sentential breaks. Sentences with parsing errors were treated as if no break had been assigned. A relatively high threshold was picked because we only tried to account for major breaks, and thus lower thresholds would cause too many insertion errors. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Accounting for prosodic breaks at
</SectionTitle>
    <Paragraph position="0"> various speech rates When speakers talk faster they use fewer breaks per utterance, and when they talk more slowly they use more breaks (Trouvain and Grice, 1999). This is reminiscent of what our model does when we increase and decrease the threshold parameter respectively. Intuitively, the algorithm was often able to predict acceptable break patterns for various threshold parameters. The variation in threshold seemed to reflect what speakers would do when varying their speech rate.</Paragraph>
    <Paragraph position="1"> In order to capture this effect in a more formal way we tried to evaluate the algorithm on a corpus which was recorded at three different speech rates (Trouvain and Grice, 1999). Three speakers (CZ, PS and AT) read a German text of 108 words 3 times slowly, 3 times at a normal rate, and 3 times at a fast rate.</Paragraph>
    <Paragraph position="2"> Trouvain and Grice show that reducing/increasing breaks is not the only prosodic correlate of changing speech rate; for example, speakers also reduce phone durations or pause durations. The extent to which increasing/decreasing the number of breaks correlates with speech rate varies both within and across speakers. One of the speakers, for instance, uses 23 breaks in her first slow version, 28 in her second slow version, and 26 in her third slow version. On average this was definitely more than she used in her normal versions (20, 20 and 24 respectively). To test our algorithm we only used the slow version with the largest number of breaks, the fast version with the smallest number of breaks, and one of the normal versions which was closest to the average of the normal versions. We did this for each speaker.</Paragraph>
    <Paragraph position="3"> We expected to see an effect of the slower version being better modelled by low threshold parameters, and the fast versions by higher parameters. It turned out, however, that the slow versions produced much lower recall/precision values compared to the faster versions. This was due to the fact the when they produced their slow versions, the speakers tended to insert breaks at positions which do not correspond to our a0 -phrase boundaries, such as immediately after sentence-initial temporal adverbial phrases (which are not marked by commas in German). We would have needed a tagger which distinguishes adverbials of time from other adverbials to account for this. Moreover, further changes in the rules for the restructuring of chunks might have been appropriate, such as preventing breaks before any phrase-final verb chunks up to a certain length.</Paragraph>
    <Paragraph position="4"> This expedient needs to be approached carefully, however, since when we are trying to model such a small corpus, there is a danger of 'overfitting' the rule set in a way which fails to generalize properly to more extensive corpora.</Paragraph>
    <Paragraph position="5"> For the time being, we decided to manually carry out the first step of the algorithm, namely the assignment of a0 -phrases, in order to test whether the heuristics are useful for modelling different speech rates. We assigned a final a0 -phrase boundary to all those structural locations where we could find a phrase break in more than one of our 27 spoken versions of the text. This resulted in a structure which could in theory be found automatically if the necessary information was available (e.g. explicitly annotating adverbs of time).</Paragraph>
    <Paragraph position="6"> Running the heuristics on this a0 -structure did indeed show some potential for imitating various speech rates. Table 4 shows recall/precision pairs for running the algorithm with the range of possible threshold values on a slow, normal and fast version by speaker CZ. The grey shading in the table shows the best values, i.e. where recall is greater than 90.0% and precision is greater than 80.0.%7 It does indeed appear that higher thresholds lead to a better model of fast speech rates, and lower thresholds are more appropriate for slow speech rates. The 7The model has a general tendency to assign higher recall than precision values. Therefore we have to weigh precision a little bit lower than recall (approximately in a ratio of 8:9) to see the effect. For better readability we leave out the Fscores, which also would only show the effect if weights were included.</Paragraph>
    <Paragraph position="7">  tables for the other two speakers (Table 5 and Table 6) show the same tendency. They also reflect the tendency of those two speakers to use the strategy of varying the number of breaks to a lesser extent than CZ when speeding up (cf. Trouvain and Grice (1999)).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Our heuristic can imitate the phrasing of various speech rates. This can be achieved by modifying a threshold parameter. Slow speech rate is imitated by decreasing, and fast rate by increasing this single parameter.</Paragraph>
    <Paragraph position="1"> However, the results are not quite satisfactory yet, because some of the steps of the overall procedure for assigning phrase breaks were manually corrected. It would be necessary to implement these additional changes in the chunker rules, and examine whether they enhance or decrease the overall performance. The latter might be the case if they are too genre specific.</Paragraph>
    <Paragraph position="2"> As we noted earlier, a more general problem is that larger text corpora for the evaluation of different speech rates are not available. Another approach, which we would like to explore in future work, would be to feed the output of the model into a TTS system and measure human judgements of acceptability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML