File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0114_intro.xml

Size: 1,466 bytes

Last Modified: 2025-10-06 14:03:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0114">
  <Title>Broadcast Audio and Video Bimodal Corpus Exploitation and Application</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Since the year of 2002, we've been engaged in setting up the Media Language Corpus aimed to provide the language resources for the researchers who are interested in broadcasting and television media language, for teachers and for researchers of presentation art. Up till now, we have established a 50 million word text corpus involving 40 million word television program text corpora and 10 million word radio program text corpora with 10 million annotated word corpora. The work of this paper is to introduce a ten-hour segmented and prosodic labeled broadcast audio &amp; video bimodal corpus that we built just now.</Paragraph>
    <Paragraph position="1"> Section 2 of this paper describes a method for selection of radio and television programs to record according to program features on radio and television stations. Recording conditions are proposed to record a quality spoken language corpus. Section 3 is dedicated to annotation methods.</Paragraph>
    <Paragraph position="2"> Section 4 shows the distribution of syllables, initials, finals and tones etc. Finally, section 5 contains the conclusion and outlines of our future work in this field.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML