XML Viewer - p06-2097

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2097_intro.xml
Size: 4,164 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2097">
  <Title>Visual Information Based on Hidden Markov Models</Title>
  <Section position="3" start_page="0" end_page="755" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Recent years have seen the rapid increase of multimedia contents with the continuing advance of information technology. To make the best use of multimedia contents, it is necessary to segment them into meaningful segments and annotate them. Because manual annotation is extremely expensive and time consuming, automatic annotation technique is required.</Paragraph>
    <Paragraph position="1"> In the field of video analysis, there have been a number of studies on shot analysis for video retrieval or summarization (highlight extraction) using Hidden Markov Models (HMMs) (e.g., (Chang et al., 2002; Nguyen et al., 2005; Q.Phung et al., 2005)). These studies first segmented videos into shots, within which the camera motion is continuous, and extracted features such as color histograms and motion vectors. Then, they classified the shots based on HMMs into several classes (for baseball sports video, for example, pitch view, running overview or audience view). In these studies, to achieve high accuracy, they relied on handmade domain-specific knowledge or trained HMMs with manually labeled data. Therefore, they cannot be easily extended to new domains on a large scale. In addition, although linguistic information, such as narration, speech of characters, and commentary, is intuitively useful for shot analysis, it is not utilized by many of the previous studies. Although some studies attempted to utilize linguistic information (Jasinschi et al., 2001; Babaguchi and Nitta, 2003), it was just keywords.</Paragraph>
    <Paragraph position="2"> In the field of Natural Language Processing, Barzilay and Lee have recently proposed a probabilistic content model for representing topics and topic shifts (Barzilay and Lee, 2004). This content model is based on HMMs wherein a state corresponds to a topic and generates sentences relevant to that topic according to a state-specific language model, which are learned from raw texts via analysis of word distribution patterns.</Paragraph>
    <Paragraph position="3"> In this paper, we describe an unsupervised topic identification method integrating linguistic and visual information using HMMs. Among several types of videos, in which instruction videos (howto videos) about sports, cooking, D.I.Y., and others are the most valuable, we focus on cooking TV programs. In an example shown in Figure 1, preparation, sauteing, and dishing up are automatically labeled in sequence. Identified topics lead to video segmentation and can be utilized for video summarization.</Paragraph>
    <Paragraph position="4"> Inspired by Barzilay's work, we employ HMMs for topic identification, wherein a state corresponds to a topic, like preparation and frying, and various features, which include visual and audio information as well as linguistic information (instructor's utterances), are observed. This study considers a clause as an unit of analysis and the following eight topics as a set of states: preparation, sauteing, frying, baking, simmering, boiling, dishing up, steaming.</Paragraph>
    <Paragraph position="5"> In Barzilay's model, although domain-specific  their model cannot utilize discourse features, such as cue phrases and lexical chains. We incorporate domain-independent discourse features such as cue phrases, noun/verb chaining, which indicate topic change/persistence, into the domain-specific word distribution.</Paragraph>
    <Paragraph position="6"> Our main claim is that we utilize visual and audio information to achieve robust topic identification. As for visual information, we can utilize background color distribution of the image. For example, frying and boiling are usually performed on a gas range and preparation and dishing up are usually performed on a cutting board. This information can be an aid to topic identification. As for audio information, silence can be utilized as a clue to a topic shift.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML