File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2165_metho.xml
Size: 19,382 bytes
Last Modified: 2025-10-06 14:15:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2165"> <Title>formance Computing and Communications in Healthcare (funded by the New York state Science and Technology Foundation under Grant</Title> <Section position="5" start_page="1003" end_page="1006" type="metho"> <SectionTitle> 4 Modeling Intonation </SectionTitle> <Paragraph position="0"> While previous research provides some correlation between linguistic features and intonation, more knowledge is needed. The NLG component provides very rich syntactic and semantic information which has not been explored before for intonation modeling. This includes, for example, the semantic role played by each semantic constituent. In developing a CTS, it is worth taking advantage of these features.</Paragraph> <Paragraph position="1"> Previous TTS research results cannot be implemented directly in our intonation generation component. Many features studied in TTS are not provided by FUF/SURGE. For example, the part-of-speech (POS) tags in FUF/SURGE are different from those used in TTS. Furthermore, it make little sense to apply part of speech tagging to generated text instead of using the accurate POS provided in a NLG system. Finally, NLG provides information that is difficult to accurately obtain from full text (e.g., complete syntactic parses).</Paragraph> <Paragraph position="2"> These motivating factors led us to carry out a study consisting of a series of three experiments designed to answer the following questions: * How do the different features produced by FUF/SURGE contribute to determining intonation? * What is the minimal number of features needed to achieve the best accuracy for each of the four intonation features?</Paragraph> <Section position="1" start_page="1003" end_page="1003" type="sub_section"> <SectionTitle> 4.1 Tools and Data </SectionTitle> <Paragraph position="0"> In order to model intonational features automatically, features from FUF/SURGE and a speech corpus are provided as input to a machine learning tool called RIPPER (Cohen, 1995), which produces a set of classification rules based on the training examples. The performance of RIPPER is comparable to benchmark decision tree induction systems such as CART and C4.5. We also employ a statistical method based on a generalized linear model (Chambers and Hastie, 1992) provided in the S package to select salient predictors for input to RIPPER.</Paragraph> <Paragraph position="1"> Figure 1 shows the input Functional Description(FD) for the sentence &quot;John is the teacher&quot;. After this FD is unified with the syntactic grammar, SURGE, the resulting FD includes hundreds of semantic, syntactic and lexical features. We extract 13 features shown in Table 1 which are more closely related to intonation as indicated by previous research. We have chosen features which are applicable to most words to avoid unspecified values in the training data.</Paragraph> <Paragraph position="2"> For example, &quot;tense&quot; is not extracted simply because it can be only applied to verbs. Table 1 includes descriptions for each of the features used. These are divided into semantic, syntactic, and semi-syntactic/semantic features which describe the syntactic properties of semantic constituents. Finally, word position (NO.) and the actual word (LEX) are extracted directly from the linearized string.</Paragraph> <Paragraph position="3"> About 400 isolated sentences with wide coverage of various linguistic phenomena were created as test cases for FUF/SURGE when it was developed. We asked two male native speakers to read 258 sentences, each sentence may be repeated several times. The speech was recorded on a bAT in an office. The most fluent version oPS each sentence was kept. The resulting speech was transcribed by one author based on ToBI with break index, pitch accent, phrase accent and boundary tone labeled, using the XWAVE speech analysis tool. The 13 features described in Table 1 as well as one intonation feature are used as predictors for the response intonation feature. The final corpus contains 258 sentences for each speaker, including 119 noun phrases, 37 of which have embeded sentences, and 139 sentences. The average sentence/phrase length is 5.43 words. The baseline performance achieved by always guessing the majority class is 67.09% for break index, 54.10% for pitch accent, 66.23% for phrase accent and 79.37% for boundary tone based on the speech corpus from one speaker.</Paragraph> <Paragraph position="4"> The relatively high baseline for boundary tone is because for most of the cases, there is only one L% boundary tone at the end of each sentence in our training data. Speaker effect on intonation is briefly studied in experiment 2. All other experiments used data from one speaker with the above baselines.</Paragraph> </Section> <Section position="2" start_page="1003" end_page="1003" type="sub_section"> <SectionTitle> 4.2 Experiments </SectionTitle> <Paragraph position="0"> Our first set of experiments was designed as an initial test of how the features from FUF/SURGE contribute to intonation. We focused on how the newly available semantic features affect intonation. We were also interested in finding out whether the 13 selected features are redundant in making intonation decisions.</Paragraph> <Paragraph position="1"> We started from a simple model which includes only 3 factors, the type of semantic constituent boundary before (BB) and after (BA) the word, and part of speech (POS). The semantic constituent boundary can take on 6 different values; for example, it can be a clause boundary, a boundary associated with a primary semantic role (e.g., a participant), with a secondary semantic role (e.g., a type of modifier), among others. Our purpose in this experiment was to test how well the model can do with a limited number of parameters. Applying RIPPER to the simple model yielded rules that significantly improved performance over the baseline models. For example, the accuracy of the rules learned for break index increases to 87.37% from 67.09%; the average improvement on all 4 intonational features is 19.33%.</Paragraph> <Paragraph position="2"> Next, we ran two additional tests, one with additional syntactic features and another with additional semantic features. The results show that the two new models behave similarly on all intonational features; they both achieve some The semantic constituent boundary before the word.</Paragraph> <Paragraph position="3"> The semantic constituent boundary after the word.</Paragraph> <Paragraph position="4"> The semantic feature of the word.</Paragraph> <Paragraph position="5"> The semantic role played by the immediate parental semantic constituent of the word.</Paragraph> <Paragraph position="6"> The generic semantic role played by the immediate parental semantic constituent of the word. The part of speech of the word The generic part of speech of the word The syntactic function of the word The part of speech of the immediate parental semantic constituent of the word.</Paragraph> <Paragraph position="7"> The generic part of speech of the immediate parental semantic constituent of the word.</Paragraph> <Paragraph position="8"> The syntactic function of the immediate parental i semantic constituent of the word.</Paragraph> <Paragraph position="9"> The position of the word in a sentence The lexical form of the word</Paragraph> </Section> <Section position="3" start_page="1003" end_page="1006" type="sub_section"> <SectionTitle> Examples </SectionTitle> <Paragraph position="0"> participant boundaries or circumstance boundaries etc.</Paragraph> <Paragraph position="1"> participant boundaries or circumstance boundaries etc.</Paragraph> <Paragraph position="2"> The semantic feature of &quot;did&quot; in &quot;I did know him.&quot; is &quot;insistence&quot;. The SP of &quot;teacher&quot; in &quot;John is the teacher&quot; is &quot;identifier&quot;. The GSP of &quot;teacher&quot; in &quot;John is the teacher&quot; is &quot;participant&quot; common noun, proper noun etc.</Paragraph> <Paragraph position="3"> noun is the corresponding GPOS of both common noun and proper noun.</Paragraph> <Paragraph position="4"> The SYNFUN of &quot;teacher&quot; in &quot;the teacher&quot; is &quot;head&quot;.</Paragraph> <Paragraph position="5"> The SPPOS of &quot;teacher&quot; is &quot;common noun&quot;. I The SPGPOS of &quot;teacher&quot; in &quot;the teacher&quot; is &quot;noun phrase&quot;. \] The SPSYNFUN of &quot;teacher&quot; in &quot;John is I the teacher&quot; is &quot;subject complement. 1, 2, 3, 4 etc.</Paragraph> <Paragraph position="6"> &quot;John&quot;, &quot;is&quot;, &quot;the&quot;, '%eacher&quot;etc. improvements over the simple model, and the new semantic model (containing the features SEMFUN, SP and GSP in addition to BB, BA and POS) also achieves some improvements over the syntactic model (containing GPOS, SYN-FUN, SPPOS, SPGPOS and SPSYNFUN in addition to BB, BA and POS), but none of these improvements are statistically significant using binomial test.</Paragraph> <Paragraph position="7"> Finally, we ran an experiment using all 13 features, plus one intonational feature. The performance achieved by using all predictors was a little worse than the semantic model but a little better than the simple model. Again none of these changes are statistically significant.</Paragraph> <Paragraph position="8"> This experiment suggests that there is some redundancy among features. All the more complicated models failed to achieve significant improvements over the simple model which only has three features. Thus, overall, we can conclude from this first set of experiments that FUF/SURGE features do improve performance over the baseline, but they do not indicate conclusively which features are best for each of the 4 intonation models.</Paragraph> <Paragraph position="9"> Although RIPPER has the ability to select predictors for its rules which increase accuracy, it's not clear whether all the features in the RIPPER rules are necessary. Our first experiment from FUF and SURGE seems to suggest that irrelevant features could damage the performance of RIPPER because the model with all features generally performs worse than the semantic model. Therefore, the purpose of the second experiment is to find the salient predictors and eliminate redundant and irrelevant ones. The result of this study also helps us gain a better understanding of the relations between FUF/SURGE features and intonation. null Since the response variables, such as break index and pitch accent, are categorical values, a generalized linear model is appropriate. We mapped all intonation features into binary values as required in this framework (e.g., pitch accent is mapped to either &quot;accent&quot; or &quot;deaccent&quot;). The resulting data are analyzed by the generalized linear model in a step-wise fashion. At each step, a predictor is selected and dropped based on how well the new model can fit the data. For example, in the break index model, after GSP is dropped, the new model achieves the same performance as the initial model. This suggests that GSP is redundant for break index.</Paragraph> <Paragraph position="10"> Since the mapping process removes distinctions within the original categories, it is possible that the simplified model will not perform as well as the original model. To confirm that the simplified model still performs reasonably well, the new simplified models are tested by letting RIPPER learn new rules based only on the selected predictors.</Paragraph> <Paragraph position="11"> Table 2 shows the performance of the new models versus the original models. As shown in the &quot;selected features&quot; and &quot;dropped features&quot; column, almost half of the predictors are dropped (average number of factors dropped is 44.64%), and the new model achieves similar performance.</Paragraph> <Paragraph position="12"> For boundary tone, the accuracy of the rules learned from the new model is higher than the original model. For all other three models, the accuracy is slightly less but very close to the old models. Another interesting observation is that the pitch accent model appears to be more complicated than the other models. Twelve features are kept in this model, which include syntactic, semantic and intonational features. The other three models are associated with fewer features. The boundary tone model appears to be the simplest with only 4 features selected.</Paragraph> <Paragraph position="13"> A similar experiment was done for data combined from the two speakers. An additional variable called &quot;speaker&quot; is added into the model. Again, the data is analyzed by the generalized linear model. The results show that &quot;speaker&quot; is consistently selected by the system as an important factor in all 4 models.</Paragraph> <Paragraph position="14"> This means that different speakers will result in different intonational models. As a result, we based our experiments on a single speaker instead of combining the data from both speakers into a single model. At this point, we carried out no other experiments to study speaker difference. null The simplified model acquired from Experiment 2 was quite helpful in reducing the complexity of the remaining experiments which were designed to take the intra-sentential context into consideration. Much of intonation is not only v.s. the original model affected by features from isolated words, but also by words in context. For example, usually there are no adjacent intonational or intermediate phrase boundaries. Therefore, assigning one boundary affects when the next boundary can be assigned. In order to account for this type of interaction, we extract features of words within a window of size 2i+1 for i=0,1,2,3; thus, for each experiment, the features of the i previous adjacent words, the i following adjacent words and the current word are extracted. Only the salient predictors selected by experiment 2 are explored here.</Paragraph> <Paragraph position="15"> The results in Table 3 show that intra-sentential context appears to be important in improving the performance of the intonation models. The accuracies of break index, phrase accent and boundary tone model, shown in the &quot;Accuracy&quot; columns, are around 90% after the window size is increased from 1 to 7. The accuracy of pitch accent model is around 80%. Except the boundary tone model, the best performance for all other three models improve significantly over the simple model with p=0.0017 for break index model, p=0 for both pitch accent and phrase accent model. Similarly, they are also significantly improved over the model without context information with p=0.0135 for break index, p=0 for both phrase accent and pitch accent.</Paragraph> </Section> <Section position="4" start_page="1006" end_page="1006" type="sub_section"> <SectionTitle> 4.3 The Rules Learned </SectionTitle> <Paragraph position="0"> In this section we describe some typical rules learned with relatively high accuracy. The following is a 5-word window pitch accent rule.</Paragraph> </Section> </Section> <Section position="6" start_page="1006" end_page="1007" type="metho"> <SectionTitle> IF ACCENTI=NA and POS=adv THEN ACCENT=H* (12/0) </SectionTitle> <Paragraph position="0"> This states that if the following word is de-accented and the current word's part of speech is &quot;adv&quot;, then the current word should be accented. It covers 12 positive examples and no negative examples in the training data.</Paragraph> <Paragraph position="1"> A break index rule with a 5-word window is: IF BBI=CB and SPPOSl=relativ~pronoun THEN INDEX=3 (23/0) This rule tells us if the boundary before the next word is a clause boundary and the next word's semantic parent's part of speech is relative pronoun, then there is an intermediate phrase boundary after the current word. This rule is supported by 23 examples in the training data and contradicted by none.</Paragraph> <Paragraph position="2"> Although the above 5-word window rules only involve words within a 3-word window, none of these rules reappears in the 3-word window rules. They are partially covered by other rules. For example, there is a similar pitch accent rule in the 3-word window model: IF POS=adv THEN ACCENT=H* (22/5) This indicates a strong interaction between rules learned before and after. Since RIPPER uses a local optimization strategy, the final results depend on the order of selecting classifiers. If the data set is large enough, this problem can be alleviated.</Paragraph> </Section> <Section position="7" start_page="1007" end_page="1007" type="metho"> <SectionTitle> 5 Generation Architecture </SectionTitle> <Paragraph position="0"> The final rules learned in Experiment 3 include intonation features as predictors. In order to make use of these rules, the following procedure is applied twice in our generation component.</Paragraph> <Paragraph position="1"> First, intonation is modeled with FUF/SURGE features only. Although this model is not as good as the final model, it still accounts for the majority of the success with more than 73% accuracy for all 4 intonation features. Then, after all words have been assigned an initial value, the final rules learned in Experiment 3 are applied and the refined results are used to generate an abstract intonation description represented in the Speech Integrating Markup Language(SIML) format (Pan and McKeown, 1997). This abstract description is then transformed into specific TTS control parameters.</Paragraph> <Paragraph position="2"> Our current corpus is very small. Expanding the corpus with new sentences is necessary. 3: System performance with different window size \[- ............................. i Discourse, pragmatic and other semantic features will be added into our future intonation model. Therefore, the rules implemented in the generation component must be continuously upgraded. Implementing a fixed set of rules is undesirable. As a result, our current generation component shown in Figure 2 focuses on facilitating the updating of the intonation model. Two separate rule sets (with or without intonation features as predictors) are learned as before and stored in rulebasel and rulebase2 respectively. A rule interpreter is designed to parse the rules in the rule bases. The interpreter extracts features and values encoded in the rules and passes them to the intonation generator.</Paragraph> <Paragraph position="3"> The features extracted from the FUF/SURGE are compared with the features from the rules.</Paragraph> <Paragraph position="4"> If all conditions of a rule match the features from FUF/SURGE, a word is assigned the classifted value (the RHS of the rule). Otherwise, other rules are tried until it is assigned a value. The rules are tried one by one based on the order in which they are learned. After every word is tagged with all 4 intonation features, a converter transforms the abstract description into specific TTS control parameters.</Paragraph> </Section> class="xml-element"></Paper>