XML Viewer - w00-0716

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0716_metho.xml
Size: 13,419 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0716">
  <Title>Generating Synthetic Speech Prosody with Lazy Learning in Tree Structures</Title>
  <Section position="3" start_page="0" end_page="87" type="metho">
    <SectionTitle>
2 Tree Structures
</SectionTitle>
    <Paragraph position="0"> So far we have considered two types of structures in this work: a simple syntactic structure and a performance structure (Gee and Grosjean, 1983). Their comparison in use should provide some interesting knowledge about the usefulness or the limitations of the elements of information included in each one.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Syntactic Structure
</SectionTitle>
      <Paragraph position="0"> The syntactic structure considered is built exclusively from the syntactic parsing of the given sentences. This parsing, with the relative syntactic tags, constitute the backbone of the structure. Below this structure lie the words of the sentence, with their part-of-speech tags. Additional levels of nodes can be added deeper in the structure to represent the syllables of each word, and the phonemes of each syllable.</Paragraph>
      <Paragraph position="1"> The syntactic structure corresponding to the sentence &amp;quot;Hennessy will be a hard act to follow&amp;quot; is presented in Figure 1 as an example (the syllable level has been omitted for clarity).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="87" type="sub_section">
      <SectionTitle>
2.2 Performance Structure
</SectionTitle>
      <Paragraph position="0"> The performance structure used in our approach is a combination of syntactic and phonological informations. Its upper part is a binary tree where each node represents a break between the two parts of the sentence contained into the sub-trees of the node. This binary structure defines a hierarchy: the closer to the root the node is, the more salient (or stronger) the break is.</Paragraph>
      <Paragraph position="1">  &amp;quot;Hennessy will be a hard act to follow&amp;quot;. (Syntactic labels: S: simple declarative clause, NP: noun phrase, VP: verb phrase. Part-of-speech labels: NNP: proper noun, MD: modal, VB: base form verb, DT: determiner, J J: adjective, NN: singular noun, TO: special label for &amp;quot;to&amp;quot;). The lower part represents the phonological phrases into which the whole sentence is divided by the binary structure, and uses the same representation levels as in the syntactic structure. The only difference comes from a simplification performed by joining the words into phonological words (composed of one content word noun, adjective, verb or adverb - and of the surrounding function words). Each phonological phrase is labeled with a syntactic category (the main one), and no break is supposed to occur inside.</Paragraph>
      <Paragraph position="2"> A possible performance structure for the same example: &amp;quot;Hennessy will be a hard act to follow&amp;quot; is shown in Figure 2.</Paragraph>
      <Paragraph position="3">  tence &amp;quot;Hennessy will be a hard act to follow&amp;quot;. The syntactic and part-of-speech labels have the same meaning as in Figure 1. B1, B2 and B3 are the break-related nodes.</Paragraph>
      <Paragraph position="4"> Unlike the syntactic structure, a first step of prediction is done in the performance structure with the break values. This prosody information is known for the sentences in the corpus, but has to be predicted for new ones (to put our system in a full synthesis context where no prosodic value is available). The currently used method (Bachenko and Fitzpatrick, 1990) provides rules to infer a default phrasing for a sentence. Not only the effects of this estimation will have to be quantified, but we plan to develop a more accurate solution to predict this structure accordingly to any corpus speaker characteristics.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="87" end_page="88" type="metho">
    <SectionTitle>
3 Tree Metrics
</SectionTitle>
    <Paragraph position="0"> Now that the tree structures are defined, we need the tools to predict the prosody. We have considered two similarity metrics to calculate the &amp;quot;distance&amp;quot; between two tree structures, inspired from the Wagner and Fisher's editing distance (Wagner and Fisher, 1974).</Paragraph>
    <Section position="1" start_page="87" end_page="87" type="sub_section">
      <SectionTitle>
3.1 Principles
</SectionTitle>
      <Paragraph position="0"> Introducing a small set of elementary transformation operators upon trees (insertion or deletion of a node, substitution of a node by another one) it is possible to determine a set of specific operation sequences that transform any given tree into another one. Specifying costs for each elementary operation (possibly a function of the node values) allows the evaluation of a whole transformation cost by adding the operation costs in the sequence. Therefore the tree distance can be defined as the cost of the sequence minimizing this sum.</Paragraph>
    </Section>
    <Section position="2" start_page="87" end_page="88" type="sub_section">
      <SectionTitle>
3.2 Considered Metrics
</SectionTitle>
      <Paragraph position="0"> Many metrics can be defined from this principle. The differences come from the application conditions which can be set on the operators. In our experiments, two metrics are tested. They both preserve the order of the nodes in the trees, an essential condition in our application.</Paragraph>
      <Paragraph position="1"> The first one (Selkow, 1977) allows only substitutions between nodes at the same depth level in the trees. Moreover, the insertion or deletion of a node involves respectively the insertion or deletion of the whole subtree depending of the node. These strict conditions should be able to locate very close structures.</Paragraph>
      <Paragraph position="2"> The other one (Zhang, 1995) allows the substitutions of nodes whatever theirs locations are inside the structures. It also allows the insertion or deletion of lonely nodes inside the structures.</Paragraph>
      <Paragraph position="3"> Compared to the previous metric, these less rigorous stipulations should not only retrieve the  very close structures, but also other ones which wouldn't have been found.</Paragraph>
      <Paragraph position="4"> Moreover, these two algorithms also provide a mapping between the nodes of the trees. This mapping illustrates the operations which led to the final distance value: the parts of the trees which were inserted or deleted, and the ones which were substituted or unchanged.</Paragraph>
    </Section>
    <Section position="3" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
3.3 Operation Costs
</SectionTitle>
      <Paragraph position="0"> As exposed in section 3.1, a tree is &amp;quot;close&amp;quot; to another one because of the definition of the operators costs. In this work, they have been set to allow the only comparison of nodes of same structural nature (break-related nodes together, syllable-related nodes together...), and to represent the linguistic &amp;quot;similarity&amp;quot; between comparable elements (to set that an adjective may be &amp;quot;closer&amp;quot; to a noun than to a determiner...).</Paragraph>
      <Paragraph position="1"> These operation costs are currently manually set. To decide on the scale of values to affect is not an easy task, and it needs some human expertise. One possibility would be to further automate the process for setting these values.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="88" end_page="89" type="metho">
    <SectionTitle>
4 Prosody Prediction
</SectionTitle>
    <Paragraph position="0"> The tree representations and the metrics can now be used to predict the prosody of a sentence. null</Paragraph>
    <Section position="1" start_page="88" end_page="88" type="sub_section">
      <SectionTitle>
4.1 Nearest Neighbour Prediction
</SectionTitle>
      <Paragraph position="0"> The simple method that we have firstly used is the nearest neighbour algorithm: given a new sentence, the closest match among the corpus of sentences of known prosody is retrieved and used to infer the prosody of the new sentence.</Paragraph>
      <Paragraph position="1"> The mapping from the tree distance computations can be used to give a simple way to know where to apply the prosody of one sentence onto the other one, from the words linked inside.</Paragraph>
      <Paragraph position="2"> Unfortunately, this process may not be as easy. The ideal mapping would be that each word of the new sentence had a corresponding word in the other sentence. Hopeless, the two sentences may not be as closed as desired, and some words may have been inserted or deleted.</Paragraph>
      <Paragraph position="3"> To decide on the prosody for these unlinked parts is a problem.</Paragraph>
    </Section>
    <Section position="2" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
4.2 Analogy-Based Prediction
</SectionTitle>
      <Paragraph position="0"> A potential way to improve the prediction is based on analogy. The previous mapping between the two structures defines a tree transformation. The idea of this approach is based on the knowledge brought by other pairs of structures from the corpus sharing the same transformation. null This approach can be connected to the analogical framework defined by Pirrelli and Yvon, where inference processes are presented for symbolic and string values by the mean of two notions: the analogical proportion, and the analogical transfer (Pirrelli and Yvon, 1999).</Paragraph>
      <Paragraph position="1"> Concerning our problem, and given three known tree structures T1, T2, T3 and a new one T I, an analogical proportion would be expressed as: T1 is to T2 as T3 is to T ~ if and only if the set of operations transforming T1 into T2 is equivalent to the one transforming T3 into T I, accordingly to a specific tree metric. There are various levels for defining this transformation equivalence. A strict identity would be for instance the insertion of the same structure at the same place, representing the same word (and having the same syntactic function in the sentence). A less strict equivalence could be the insertion of a different word having the same number of syllables. Weaker and weaker conditions can be set. As a consequence, these different possibilities have to be tested accordingly to the amount of diversity in the corpus to prove the efficiency of this equivalence.</Paragraph>
      <Paragraph position="2"> Next, the analogical transfer would be to apply on the phrase described by T3 the prosody transformation defined between T1 and T2 as to get the prosody of the phrase of T ~. The formalization of this prosody transfer is still under development.</Paragraph>
      <Paragraph position="3"> From these two notions, the analogical inference would be therefore defined as: * firstly, to retrieve all analogical proportions involving T ~ and three known structures in the corpus; * secondly, to compute the analogical transfer for each 3-tuple of known structures, and to store its result in a set of possible outputs if the transfer succeeds.</Paragraph>
      <Paragraph position="4"> This analogical inference as described above may be a long task in the retrieval of every 3-tuple of known structures since a tree transformation can be defined between any pair of them. For very dissimilar structures, the set of  operations would be very complex and uneasy to employ. A first way to improve this search is to keep the structure resulting of the nearest neighbour computation as T3. The transformation between T t and T3 should be one of the simplest (accordingly to the operations cost; see section 3.3), and then the search would be limited to the retrieval of a pair (T1,T2) sharing an equivalent transformation. However, this is still time-consuming, and we are trying to define a general way to limit the search in such a tree structure space, for example based on tree indexing for efficiency (Daelemans et al., 1997).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="89" end_page="89" type="metho">
    <SectionTitle>
5 First Results
</SectionTitle>
    <Paragraph position="0"> Because of the uncompleted development of this approach, most experiments are still under progress. So far they were run to find the closest match of held-out corpus sentences using the syntactic structure and the performance structure, for each of the distance metrics. We are using both the &amp;quot;actual&amp;quot; and estimated performance structures to quantify the effects of this estimation. Cross-validation tests have been chosen to validate our method.</Paragraph>
    <Paragraph position="1"> These experiments are not all complete, but an initial analysis of the results doesn't seem to show many differences between the tree metrics considered. We believe that this is due to the small size of the corpus we are using. With only around 300 sentences, most structures are very different, so the majority of pairwise comparisons should be very distant. We are currently running experiments where the tree structures are generated at the phrase level. This strategy implies to adapt the tree metrics to take into consideration the location of the phrases in the sentences (two similar structures should be privileged if they have the same location in their respective sentences).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML