File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0114_metho.xml

Size: 16,398 bytes

Last Modified: 2025-10-06 14:10:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0114">
  <Title>Broadcast Audio and Video Bimodal Corpus Exploitation and Application</Title>
  <Section position="4" start_page="0" end_page="102" type="metho">
    <SectionTitle>
2 Corpus Information
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="102" type="sub_section">
      <SectionTitle>
2.1 Corpus metadata
</SectionTitle>
      <Paragraph position="0"> First of all, we have to select radio and television programs to record. Since a broadcast bimodal corpus should represent the real life usages of spoken language in radio and television, the differences between radio and television, the differences between central and local televisions, and the categories of programs should all together be taken into account during the process of collecting. The followings are the framework (.wav files &amp; .mpeg files matched with .txt files) of head information (metadata) of broadcast audio &amp; video bimodal corpus that has been collected:  -----------------------------------------------------------No.: ... Level: central, local, Hong Kong and Taiwan Station: CCTV, CNR, Phoenix Television...</Paragraph>
      <Paragraph position="1"> Style: monologue, dialogue, multi-style Register: (hypogyny of monologue) presentation, explanation, reading, talk (hypogyny of dialogue) two person talk show, three person talk show, multi-person talk show Content: news, literature, service Audiences: woman, children, elder...</Paragraph>
    </Section>
    <Section position="2" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
2.2 Corpus structure
</SectionTitle>
      <Paragraph position="0"> The purpose of building the broadcast spoken language corpus is to provide the service for the research of broadcast spoken language, esp. for the contrastive studies of the prosodic features of different genres of broadcast language. Hence, the selections of samples of the corpus mainly involve monologues, dialogues or both. As the performing forms of radio and television programs are getting more and more diverse, it is very difficult to decide whether a program is a monologue or dialogue, because these two genres of programs often co-occur in one program.</Paragraph>
      <Paragraph position="1"> Furthermore, these kinds of programs are increasing their share of radio and television programs. Consequently, this kind of program is most frequent in the corpus. Table 1 displays the structural framework of the broadcast audio and video bimodal corpus.</Paragraph>
      <Paragraph position="2"> Table 1 the structure of broadcast bimodal corpus Style Example two person talk show / interview Face to face...etc.</Paragraph>
      <Paragraph position="3"> three person talk show / interview</Paragraph>
    </Section>
    <Section position="3" start_page="102" end_page="102" type="sub_section">
      <SectionTitle>
2.3 Recording &amp; management information
</SectionTitle>
      <Paragraph position="0"> All the recorded data are over the programs on radio and TV, that is, it is recorded directly from radio and TV programs by Pinnacle PCTV pro card to connect cable TV with our recording computers. The recorded speech data are saved as 22 kHz and 16bit, Windows PCM waveform, the video data are saved as MPEG or WMV format file by Ulead VideoStudio in a post-processing step. Every program or segment of programs is composed of three parts: *.wav data, *.txt data, and *.mpeg/.wmv data.</Paragraph>
      <Paragraph position="1"> Zhao Shixia et al (2000) pointed out that the structure of a speech corpus consists of synchronized objects (text files, wav files, and annotated prosodic files), arranged in deep hierarchies (recording environment), and labeled with speakerattribute metadata. Therefore, the managed objects of our broadcast bimodal corpus are integrated programs or segments of programs. All data are stored separately but have complex logical inter-relations. These inter-relations can be obtained through the description of the programs.</Paragraph>
      <Paragraph position="2"> Figure 1 displays the logical structure of the broadcast bimodal corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="102" end_page="105" type="metho">
    <SectionTitle>
3 Annotation
</SectionTitle>
    <Paragraph position="0"> Why should we annotate a corpus? An annotation is the fundamental act of associating some content to a region in a signal. The annotation quality and depth have a direct impact on the utility and possible applications of the corpus (Ding Xinshan 1998). The annotation of our corpus consists of transcription, segmental annotation, and prosodic annotation.</Paragraph>
    <Section position="1" start_page="102" end_page="103" type="sub_section">
      <SectionTitle>
3.1 Transcription and segmentation
</SectionTitle>
      <Paragraph position="0"> Transcription is primarily composed of pinyin transcription of Chinese characters. Besides, tones are annotated &amp;quot;1&amp;quot;, &amp;quot;2&amp;quot;, &amp;quot;3&amp;quot;, and &amp;quot;4&amp;quot; after the syllable, the neutral tone is labeled &amp;quot;0&amp;quot;; final &amp;quot;u&amp;quot; annotated as &amp;quot;v&amp;quot;, and &amp;quot;ue&amp;quot; annotated as &amp;quot;ue&amp;quot;, for example, &amp;quot;Lu (lu )&amp;quot; annotated as &amp;quot;lv3&amp;quot;, &amp;quot;Nue (nue)&amp;quot; annotated as &amp;quot;nue4&amp;quot;.</Paragraph>
      <Paragraph position="1"> In the utterance, compared with broken syllables, successive speech alters greatly, due to the influence of co-articulation, semantics and prosody. The purpose of segmental annotating is to annotate the altered phonemes in the syllables amidst the utterance. For instances, the voicing of some plosives (e.g. b, d, g); labial's influence on alveolar nasal (e.g. &amp;quot;-n&amp;quot; in &amp;quot;renmin&amp;quot; affected by the initial of &amp;quot;min&amp;quot; gradually change into &amp;quot;labionasal&amp;quot;, demonstrating the similarities between alveolar nasal and labionasal initial in the frequency spectrum). In the places of unapparent pauses, the stop in the front of plosives esp. af- null fricates often vanishes, which are called the inexistence of silence.</Paragraph>
      <Paragraph position="2"> We transcription and segmentation we used BSCA (Broadcasting Speech Corpus Annotator) which was designed by ourselves (Hu Fengguo and Zou Yu 2005). An annotated example is shown in Figure 2: Figure 2 BSCA: a tool for annotation</Paragraph>
    </Section>
    <Section position="2" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
3.2 Prosodic annotation tiers
</SectionTitle>
      <Paragraph position="0"> Prosodic annotation increases the utility of a speech corpus. An annotated speech corpus can not only offer us a database for the research and exploration of speech information but can also enlarge our knowledge of speech and prosodic features through a visual and scientific method.</Paragraph>
      <Paragraph position="1"> Prosodic annotation is a categorical description for the prosodic features with linguistic functions, in other words, annotation of the changes of tone, the patterns of stress, and the prosodic structure with linguistic functions. The prosodic labeling conventions are a set of machine-readable codes for transforming speech prosodies and rule conventions. Based on ToBI (Kim Silverman et al. 1992, John F. Pitrelli et al.</Paragraph>
      <Paragraph position="2"> 1994) and C-ToBI Conventions (Li Aijun 2002), according to the practical needs of broadcast speech language, the prosodic annotation mainly involves labeling the following parallel tiers: break index, stress index, and intonation construction tier (Chen Yudong 2004, Zou Yu 2004).</Paragraph>
      <Paragraph position="3">  Based on Cao Jianfen's (1999, 2001) categories of prosodic hierarchy structure combined with the practical needs of broadcast speech, we identified five break levels (0-4): 0 indicates the silence or the boundary of default internal syllables amidst the prosodic words. 1 stands for the boundaries of the prosodic words including the short breaks with silent pause and breaks with filled pause. The prosodic words are the fundamental prosodic units in broadcast speech. Simple prosodic words are composed of 1~3 syllables. Complex prosodic words normally contain 5~9 syllables, e.g., &amp;quot;Shang4hai3 he2zuo4 zu3zhi1&amp;quot; (i.e. the Shanghai Cooperation Organization). Break level 2 designates the boundaries of the prosodic phrases, most of which are apparent breaks with silent pause, and their patterns of pitch have also changed. Break level 3 represent the boundaries of intonational phrases, or the boundaries of sentences. Break level 4 stands for the boundaries of intonation groups, similar to the boundary of the entire piece of news in a news broadcast, or of a talker turn in dialogue.</Paragraph>
      <Paragraph position="4"> At indefinite boundaries, the code &amp;quot;-&amp;quot; is added after the numbers. The labels of the break tier occurring times are shown in table 2:  Stress is a significant prosodic feature. In training materials for broadcast announcers, emphasis is laid on labeling the stress on the basis of the purpose of the utterance, the pattern and rhythm of stresses, and the changes of emotions. Zhang Song's (1983) classification of nuclear stresses can be the guideline for broadcasting production and practice. However, there are some shortcomings in his classifications, for instances, the vague hierarchies between the sentences and discourses. This gets in the way of the formal description of the stresses by the computers. Nevertheless, his theories on the judgment of primary and minor stresses (i.e. non-stresses, minor stresses, primary stresses etc.) have some reference value for stress annotations, because distinguishing the hierarchies of stress is a crucial practical problem for annotation.</Paragraph>
      <Paragraph position="5"> As to the problems with the hierarchies of stress, most of the experimental phonetics and speech processing researchers adopt Lin Maocan's (2001, 2002) classifications of stress hierarchies or some similar classifications. That is to say, the levels of stress include prosodic word stress, prosodic phrase stress, and sentence stress (or nuclear stress) in Chinese. According to real life broadcasting productions, this paper identi- null fies four categories of stresses in broadcast speech: the rhythm unit, the cross rhythm unit, the clause, and the discourse. Among them, the discourse stress often occurs at the place of an accented syllable, but they are relatively more important than the other sentence stresses. The labeling methods of all the ranks are listed as follows (Chen Yudong 2004): Table 3 the stress levels in the stress indices tier  ria for stress annotation (utterance purpose and emotion change), while perceptually important, are meta-linguistic or para-linguistic in character, and will therefore not be addressed in this paper.</Paragraph>
      <Paragraph position="6"> 3.2.3 Intonation construction tier In line with Shen Jiong's view about intonation (Shen Jiong 1994), we found that the intonation construction tier is an important component of the annotation of discourses (Chen Yudong 2004). It can display the changes of sentence intonation structures. The annotation of the intonation construction is mainly to label the relationship of other syllables to the nuclear stress apart from prehead, dissociation etc. For example: Table 5 the labels of the intonation construction tier occurring in 4 hours annotated corpora  A sentence can have one nuclear stress, or multiple nuclear stresses.</Paragraph>
      <Paragraph position="7"> Single nuclear stress: representing the foreand-aft places of the nuclear stress, the steepness of nuclear stress, and the length of nuclear stress. Examples are listed as follows:</Paragraph>
      <Paragraph position="9"> Among the above examples, long nuclear splitting type &amp;quot;H-N-T-H-N'-T&amp;quot;, with the features of multi-nuclear &amp;quot;H-N1-T-H-N2-T&amp;quot; is greatly similar to multi-nuclear. However, &amp;quot;H-N-T-H-N'-T&amp;quot; differs from multi-nuclear in its dependent grammar unit.</Paragraph>
      <Paragraph position="10"> Multi-nuclear stress: The two or more nuclear stresses in a multi-nuclear sentence take the patterns of like independent sentence intonation constructions, each with its own nucleus, preceded by a head and optional prehead, and followed by a tail. In other words, these relatively independent patterns already have the features of relatively independent intonation constructions, with the apparent features of &amp;quot;prehead, head, and nuclear ending&amp;quot;. This kind of nuclear stress often occurs in relatively longer and more complex constructions. Intonation constructions can be labeled separately. A case in point is the contrastive sentence &amp;quot;zai4 wen3 ding4 de0 ji1 chu3 shang0, qu3 de2 bi3 jiao4 gao1 su4 de0 fa1 zhan3&amp;quot; (i.e. It got a comparative high-speed development on the stable conditions) that can be annotated as &amp;quot;H-N1-T, H-N2-T&amp;quot;. For example: Figure 3 the contrastive sentence &amp;quot;zai4 wen3</Paragraph>
    </Section>
    <Section position="3" start_page="104" end_page="105" type="sub_section">
      <SectionTitle>
3.3 Other items of annotation
</SectionTitle>
      <Paragraph position="0"> Some spoken language corpus can have some additional annotation information. For example, turn talking, paralinguistic and non-linguistic  information (e.g. spot, background music, coughing, sobbing and sneezing) and some hosts' accents (e.g. Shanghai accent) can be annotated in talk show corpus. There are 82 times of spot and 31 times of background music in 4 hours annotated data. Furthermore, some .wav files, .mpeg files can be annotated together for discourse analysis.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="105" end_page="106" type="metho">
    <SectionTitle>
4 Distribution of annotated items
</SectionTitle>
    <Paragraph position="0"> We conducted a statistic analysis of some annotated items using 4 hours of annotated data in our corpus.</Paragraph>
    <Paragraph position="1"> The syllables (initials and finals) of the 20 top frequent occurring are given in Table 6. In addition to this, the duration and variance distribution for them are calculation shown as follows.</Paragraph>
    <Paragraph position="2"> Table 6 the mean of duration and variance of the  The occurrence distribution of initial, final, and tone are calculated. These are shown in table 7, 8 and 9 respectively.</Paragraph>
    <Paragraph position="3"> We also measured the mean duration and F0 of each tone in three speaking styles are listed in  To summarize, we conclude that the mean duration of tones of reading style is longer than that of presentation style; that of talk style is the shortest among three styles. As for the F0 of each  tone, the F0 and pitch range of presentation style is high and has big fluctuation; that of talk style is high and has small fluctuation. However, the F0 of tone 3 of presentation style is lower than that of reading and talk styles.</Paragraph>
  </Section>
  <Section position="7" start_page="106" end_page="106" type="metho">
    <SectionTitle>
5 Further study
</SectionTitle>
    <Paragraph position="0"> The broadcast audio and video bimodal corpus  is a presentation art-oriented corpus with radio and television news as its basis. This paper probes the development and compilation of broadcast audio and video bimodal corpus. Firstly, on the collection of the corpus, what sort of audio and video corpus can represent the features of radio and television speech language? How can we auto-annotate the audio and video corpus? ...These are the problems that have always been bothering us.</Paragraph>
    <Paragraph position="1"> Secondly, this corpus can be a platform for further research into non-accented or accented syllables, intonation construction, the prosodic functions of paragraphs and discourses, the emotions of speech, and genre styles.</Paragraph>
    <Paragraph position="2"> Finally, we can statistically analyze the spectral and prosodic characteristics of various speaking styles by the corpora, such as presentation, reading and talk. All speaking styles would be synthesized based on the analysis results. This is also work for the future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML