XML Viewer - h91-1074

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1074_metho.xml
Size: 15,059 bytes
Last Modified: 2025-10-06 14:12:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1074">
  <Title>Predicting Intonational Boundaries Automatically from Text: The ATIS Domain</Title>
  <Section position="3" start_page="0" end_page="378" type="metho">
    <SectionTitle>
2 Inferring Phrasing from Text
</SectionTitle>
    <Paragraph position="0"> How the intonational phrasing of an utterance is related to aspects of the text uttered is potentially an important source of information for speech recognition, to constrain the set of allowable hypotheses by identifying boundary locations in both the recognized text and the acoustic signal or to moderate durational information at likely boundary locations. However, to date, syntactically-based prediction of intonational boundaries has met with limited success. While considerable work has been done on the relationship between some particular syntactic configurations and intonational boundaries \[12, 2, 6, 9\], the prediction of boundaries in unrestricted and spontaneous speech rarely been attempted \[1\]. 1 Predicting boundaries solely from information available automaticaLly from text analysis presents a further challenge, which must also be addressed if predictions are to be useful in real spoken language systems.</Paragraph>
    <Paragraph position="1"> To address these issues, we experimented with the prediction of intonational boundaries from text analysis, using 298 utterances from 26 speakers in the Air Travel Information Service (ATIS) database for training and testing. ~ To prepare data for analysis, we labeled the speech prosodically by hand, noting location and type of intonational boundaries and presence or absence of pitch accents, using both the waveform and pitchtracks of each utterance. Although major and minor boundaries were distinguished in the labeling process, in the analysis presented below these axe collapsed. Each data point in our analysis consists of a potential boundary location in an utterance, defined by a pair of adjacent words &lt; wi,w i &gt;. There are 3677 potential boundary locations &lt; wi,wj &gt; in the ATIS sample analyzed here.</Paragraph>
    <Paragraph position="2"> For each potential boundary site, we examine the predictive power of a number of textual features whose values can be determined from orthographic transcriptions of the ATIS sentences, as well as a number of phonological categories features available from our hand-labeling, to see, first, how well boundary locations can be predicted automatically from text, l Bachenko and Fitzpatrick classify 83.5-86.2% of boundaries correctly for a test set of 35 sentences; Ostendorf et al report 808.3% correct prediction of boundaries only on a different 35 sentence test set. Altenberg models only major boundaries for a portion of his training data, 48 minutes of partly-read, partly spontaneous speech from a single speaker, 2These sentences were selected from the 772-odd utterances in the original TI collection.</Paragraph>
    <Paragraph position="3">  and, second, whether prediction using fuller information, currently available only via hand-labeling, can improve performance significantly.</Paragraph>
    <Paragraph position="4"> Temporal variables used in the analysis include utterance and phrase duration, and distance of the potential boundary from various strategic points in the utterance. Although it is tempting to assume that phrase boundaries represent a purely intonational phenomenon, it is possible that processing constraints help govern their occurrence. So, for example, longer utterances may tend to include more boundaries. Accordingly, we measure the length of each utterance both in seconds and in words. The distance of the boundary site from the beginning and end of the utterance also appears likely to be correlated with boundary location. The tendency to end a phrase may also be affected by the position of the potential boundary site in the utterance. For example, positions very close to the beginning or end of an utterance may well be unlikely positions for intonational boundaries. We measure this variable too, both in seconds and in words. The importance of phrase length has also been proposed \[6, 2\] as a factor in boundary location. Simply put, it may be that consecutive phrases have roughly equal length. To test this, we calculate the elapsed distance from the last boundary to the potential boundary site, divided by the length of the last phrase encountered, both in time and words. To obtain this information from text analysis alone would require us to factor prior boundary predictions into subsequent predictions.</Paragraph>
    <Paragraph position="5"> While this would be feasible, it is not straightforward in our current analysis strategy. To see whether this information is useful, therefore, we currently use observed boundary location. null Syntactic constituency information is widely considered a major factor in phrasing \[6, 14, 11, 15\]. That is, some types of constituents may be more or less likely to be broken up into phrases, and some constituent boundaries may be more or less likely to coincide with intonational boundaries. To test the former, we examine the class of the lowest node in the parse tree to dominate both wi and wj, as determined by Hindle's parser, Fidditch \[7\]. To test the latter we determine the class of the highest node in the parse tree to dominate wi, but not wj, and similarly for w i but not wi. Word class is often used to predict boundary location, particularly in text-to-speech, where simple parsing into function/content word groupings generally controls the generation of phrase boundaries. To test the importance of word class, we examine part-of-speech in a window of four words surrounding each potential phrase break, using Church's part-of-speech tagger \[5\].</Paragraph>
    <Paragraph position="6"> Informal observation suggests that phrase boundaries are more likely to occur in some PITCH ACCENT contexts than in others. For example, phrase boundaries between words that are DEACCENTED seem to occur much less frequently than boundaries between two accented words. To test this, we look at the pitch accent values of wi and w i for each &lt; wl, wj &gt;, comparing observed values with predicted pitch accent information obtained from \[8\].</Paragraph>
    <Paragraph position="7"> Finally, in a multi-speaker database, an obvious variable to test is speaker identity. While for applications to speaker-independent recognition this variable would be uninstantiable, we nonetheless need to determine how important speaker idiosyncracy may be in boundary location. Since we have found no significant increase in predictive power when this variable is used, results presented below are speakerindependent. null</Paragraph>
  </Section>
  <Section position="4" start_page="378" end_page="379" type="metho">
    <SectionTitle>
3 Analysis and Results
</SectionTitle>
    <Paragraph position="0"> For statistical modeling, we employ Classification and Regression Tree (CART) analysis \[4\] to generate decision trees from sets of continuous and discrete variables. At each stage in growing the tree, CART determines which factor should govern the forking of two paths from that node. Furthermore, CART must decide which values of the factor to associate with each path. Ideally, splitting rules should choose the factor and value split which minimizes the prediction error rate. The rules in the implementation employed for this study \[13\] approximate optimality by choosing at each node the split which minimizes the prediction error rate on the training data. In this implementation, all these decisions are binary, based upon consideration of each possible binary partition of values of categorical variables and consideration of different cut-points for values of continuous variables.</Paragraph>
    <Paragraph position="1"> Stopping rules terminate the splitting process at each internal node. To determine the best tree, this implementation uses two sets of stopping rules. The first set is extremely conservative, resulting in an overly large tree, which usually lacks the generality necessary to account for data outside of the training set. To compensate, the second rule set forms a sequence of subtrees. Each tree is grown on a sizable fraction (80%) of the training data and tested on the remaining portion. This step is repeated until the tree has been grown and tested on all of the data. The stopping rules thus have access to cross-validated error rates for each subtree. The subtree with the lowest rates then defines the stopping points for each path in the full tree. Results presented below all represent cross-validated data.</Paragraph>
    <Paragraph position="2"> Prediction rules label label the terminal nodes. For continuous variables, the rules calculate the mean of the data points classified together at that node. For categorical variables, the rules choose the class that occurs most frequently among the data points. The success of these rules can be measured through estimates of deviation. In this implementation, the deviation for continuous variables is the sum of the squared error for the observations. The deviation for categorical variables is simply the number of misdassified observations.</Paragraph>
    <Paragraph position="3"> In analyzing our data, we employ four different sets of variables. The first includes observed phonological information about pitch accent and prior boundary location, as well as automatically obtainable information. The success rate of boundary prediction from this set is quite high, with correct cross-validated classification of 3330 out of 3677 potential boundary sites -- an overall success rate of 90% (Figure 1). Furthermore, there are only five decision points in the tree. Thus, the tree represents a dean, simple model of phrase boundary prediction, assuming accurate phonological information.</Paragraph>
    <Paragraph position="4">  Turning to the tree itself, we that the ratio of current phrase length to prior phrase length is very important in boundary location. This variable alone (assuming that the boundary site occurs before the end of the utterance) permits correct classification of 2403 out of 2556 potential boundary sites. Occurrence of a phrase boundary thus appears extremely unlikely in cases where its presence would result in a phrase less than half the length of the preceding phrase. The first and last decision points in the tree axe the most trivial.</Paragraph>
    <Paragraph position="5"> The first split indicates that utterances virtually always end with a boundary -- rather unsurprising news. The last split shows the importance of distance from the beginning of the utterance in boundaxy location; boundaries are more likely to occur when more than 2 } seconds have elapsed from the start of the utterance. 3 The third node in the tree indicates that noun phrases form a tightly bound intonational unit.</Paragraph>
    <Paragraph position="6"> The fourth split in 1 shows the role of accent context in determining phrase boundary location. If wi is not accented, then it is unlikdy that a phrase boundary will occur after it.</Paragraph>
    <Paragraph position="7"> The importance of accent information in Figure 1 raises the question of whether or not automatically inferred accent information (via \[8\]) can substitute effectively for observed data. In fact, when predicted accent information is substituted, the success rate of the classification remains approximately the same, at 90%. However, the number of splits in the resultant tree increases -- and fails to include the accenting of wi as a factor in the classification! A look at the errors in accent prediction in this domain reveals that the majority occur when function words preceding a boundary are incorrectly predicted to be deaccented. This appears to be an idiosyncracy of the corpus; such words generally occurred before relatively long pauses. Nevertheless, classification succeeds well in the absence of accent information, perhaps reflecting a high correlation between predictors of accent and predictors of phrase boundaries. For example, both pitch accent and boundary location are sensitive to location of prior intonational boundaries and part-of-speech context.</Paragraph>
    <Paragraph position="8"> In a third analysis, we eliminate the dynamic boundary percentage measure. The result remains nearly as good as before, with a success rate of 89%. This analysis reconfirms the usefulness of observed accent status of wi in boundary prediction. By itself (again assuming that the potential boundary site occurs before the end of the utterance), this factor accounts for 1590 out of 1638 potential boundary site classifications. This analysis also confirms the strength of the intonational ties among the components of noun phrases. In this tree, 536 out of 606 potential boundary sites receive final classification from this feature.</Paragraph>
    <Paragraph position="9"> We conclude our analysis by producing a classification tree that uses text-based information alone. For this analysis we use predicted accent values and omit information about prior boundary location. Figure 2 shows results of this analysis, with a successful classification of 90% of the data. In Figure 2, more variables are used to obtain a classification percentage similar to the previous classifications. Here, accent predictions are used trivially, to indicate sentence-final boundaries 3This fact may be idiosyncratic to our data, given the fact that we observed a trend towards initial hesitations.</Paragraph>
    <Paragraph position="10"> (ra='NA'), a function performed in Figure 1 by distance of potential boundary site from end of utterance (et). The second split in 2 does rely upon temporal distance -- this time, distance of boundary site from the beginning of the utterance. Together these measurements correctly predict 38.2eA of the data. The classifier next uses a variable which has not appeared in earlier cla:.,sifications -- the part-of-speech of tcj. In 2, in the majority of cases (88%) where w i is a function word other than 'to,' &amp;quot;in,' or a conjunction (true for about half of potential boundary sites), a boundary does not occur. Part-of-speech of u,i and type of constituent dominating wi but not tu~ are further used to classify these items. This portion of the classification is reminiscent of the notion of 'function word group&amp;quot; used commonly in assigning prosody in text-to-speech, in which phrases are defined, roughly, from one function word to the next. Overall rate of the utterance and type of utterance appear in the tree, in addition to part-of-speech and constituency information, and distance of potential boundary site from beginning and end of utterance. In general, results of this first stage of analysis suggest -- encouragingly -- that there is considerable redundancy in the features predicting boundary location: when some features are unavailable, others can be used with similar rates of success.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML