File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-4023_intro.xml
Size: 2,947 bytes
Last Modified: 2025-10-06 14:02:18
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4023"> <Title>Feature Selection for Trainable Multilingual Broadcast News Segmentation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> The data for our segmentation research consists of a set of news broadcasts recorded directly from a satellite dish between September 2002 and February 2003.</Paragraph> <Paragraph position="1"> The data set contains roughly equal amounts (8-12 hours) of news broadcasts from seven sources in three languages: Aljazeera (Arabic), BBC America (UK English), China Central TV (Mandarin Chinese), CNN Headline News (US English), CNN International (US/UK English), Fox News (US English), and Newsworld International (US/UK English).</Paragraph> <Paragraph position="2"> Each broadcast was manually segmented with the labels &quot;story&quot; and &quot;commercial&quot; by one annotator and verified by a second, at least one of whom was a native speaker of the broadcast language. We found that a very good segmentation is possible by a non-native speaker based solely on video and acoustic cues, but a native speaker is required to verify story boundaries that require language knowledge, such as a single-shot video sequence of several stories read by a news anchor without pausing. The definition of &quot;story&quot; in our experiments corresponds with the Topic Detection and Tracking definition: a segment of a news broadcast with a coherent news focus, containing at least two independent, declarative clauses (LDC, 1999). The segments within broadcasts briefly summarizing several stories were not assigned a &quot;story&quot; label, nor were anchor introductions, signoffs, banter, and teasers for upcoming stories. Each individual story within blocks of contiguous stories was labeled &quot;story.&quot; A sequence of contiguous commercials was annotated with a single &quot;commercial&quot; label with a single pair of boundaries for the entire block.</Paragraph> <Paragraph position="3"> Table 1 shows the details of our experimental data set.</Paragraph> <Paragraph position="4"> The first two columns show the broadcast source and the language. The next two columns show the total number of hours and the number of hours labeled &quot;story&quot; for each source. It is interesting to note that the percentage of broadcast time devoted to news stories varies widely by source, from 62% for CNN Headline News to 90% for CNN International. Similarly, the average story length varies widely, as shown in the final column of Table 1, from 52 seconds per story for CNN Headline News to 171 seconds per story for Fox News. These large differences are extremely important when modeling the distributions of stories (and commercials) within news broadcasts from various sources.</Paragraph> </Section> class="xml-element"></Paper>