XML Viewer - n04-4023

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4023_metho.xml
Size: 7,135 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4023">
  <Title>Feature Selection for Trainable Multilingual Broadcast News Segmentation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Feature extraction
</SectionTitle>
    <Paragraph position="0"> In order to analyze audio and video events that are relevant to story segmentation, we encoded the news broadcasts described in Section 2 as MPEG files, then automatically processed the files using a range of media analysis software components. The software components repre- null hours, hours of stories, number of stories, average story length).</Paragraph>
    <Paragraph position="1"> sented state-of-the-art technology for a range of audio, language, image, and video processing applications.</Paragraph>
    <Paragraph position="2"> The audio and video analysis produced time-stamped metadata such as &amp;quot;face Chuck Roberts detected at time=2:38&amp;quot; and &amp;quot;speaker Bill Clinton identified between start=12:56 and end=16:28.&amp;quot; From the raw metadata we created a set of features that have previously been used in story segmentation work, as well as some novel features that have not been used in previous published work. The software components and resulting features are described in the following sections.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Audio and language processing
</SectionTitle>
      <Paragraph position="0"> A great deal of the information in a news broadcast is contained in the raw acoustic portion of the news signal.</Paragraph>
      <Paragraph position="1"> Much of the information is contained in the spoken audio, both in the characteristics of the human speech signal and in the sequence of words spoken. This information can also take the form of non-spoken audio events, such as music, background noise, or even periods of silence. We ran the following audio and language processing components on each of the data sources described in Section 2.</Paragraph>
      <Paragraph position="2"> Audio type classification segments and labels the audio signal based on a set of acoustic models: speech, music, breath, lip smack, and silence. Speaker identification models the speech-specific acoustic characteristics of the audio and seeks to identify speakers from a library of known speakers. Automatic speech recognition (ASR) provides an automatic transcript of the spoken words. Topic classification labels segments of the ASR output according to predefined categories. The audio processing software components listed above are described in detail in (Makhoul et al., 2000). Closed captioning is a human-generated transcript of the spoken words that is often embedded in a broadcast video signal. null Story segmentation features automatically extracted from audio and language processing components were: speech segment, music segment, breath, lip smack, silence segment, topic classification segment, closed captioning segment, speaker ID segment, and speaker ID change. In addition we analyzed the ASR word sequences in all broadcasts to automatically derive a set of source-dependent cue phrase n-gram features. To determine cue n-grams, we extracted all relatively frequent unigrams, bigrams, and trigrams from the training data and compared the likelihood of observing each n-gram near a story boundary vs. elsewhere in the data. Cue n-gram phrases were deemed to be those that were significantly more likely near the start of a story.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Video and image processing
</SectionTitle>
      <Paragraph position="0"> The majority of the bandwidth in a video broadcast signal is devoted to video content, and this content is a rich source of information about news stories. The composition of individual frames of the video can be analyzed to determine whether specific persons or items are shown, and the sequence of video frames can be analyzed to determine a pattern of image movement. We ran the following image and video processing components on each of the data sources described in Section 2.</Paragraph>
      <Paragraph position="1"> Face identification detects human faces in the image and compares the face to a library of known faces. Color screen detection analyzes the frame to determine if it is likely to be primarily a single shade, like black or blue.</Paragraph>
      <Paragraph position="2"> Logo detection searches the video frame for logos in a library of known logos. Shot change classification detects several categories of shot changes within a sequence of video frames.</Paragraph>
      <Paragraph position="3"> Story segmentation features automatically extracted from image and video processing components were: anchor face ID, blue screen detection, black screen detection, logo detection, fast scene cut detection, slow scene transition detection, gradual scene transient detection, and scene fade-to-black detection.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Feature analysis methodology
</SectionTitle>
      <Paragraph position="0"> Each feature in our experiments took the form of a binary response to a question that related the presence of raw time-stamped metadata within a window of time around each story and commercial boundary, e.g., &amp;quot;Did an anchor face detection occur within 5 seconds of a story boundary?&amp;quot; For processing components that produce metadata with an explicit duration (such as a speaker ID segment), we defined separate features for the start and end of the segment plus a feature for whether the metadata segment &amp;quot;persisted&amp;quot; throughout the time window around the boundary. For example, a speaker ID segment that begins at t=12 and ends at t=35 would result in a true value for the feature &amp;quot;Speaker ID segment persists,&amp;quot; for a time window of 5 seconds around a story boundary at t=20.</Paragraph>
      <Paragraph position="1"> For each binary feature, we calculated the maximum likelihood (ML) probability of observing the feature near a story boundary. For example, if there were 100 stories, and the anchor face detection feature was true for 50 of the stories, then p(anchor|story)=50/100 = 0.5.</Paragraph>
      <Paragraph position="2"> We similarly calculated the ML probabilities of an anchor face detection near a commercial boundary, outside of both story and commercial, and inside a story but outide the window of time near the boundary.</Paragraph>
      <Paragraph position="3"> Useful features for segmentation in general are those which occur primarily near only one type of boundary, which would result in a large relative magnitude difference between these four probabilities. Ideal features, f, for story segmentation would be those for which p(f|story) is much larger than the other values. For our experiments we identified features for which there was at least an order of magnitude spread in the observation probabilities across categories.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML