File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2028_metho.xml

Size: 11,379 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2028">
  <Title>Extracting Salient Keywords from Instructional Videos Using Joint Text, Audio and Visual Cues</Title>
  <Section position="3" start_page="109" end_page="109" type="metho">
    <SectionTitle>
2 A Text-based Keyword Extraction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
System
</SectionTitle>
      <Paragraph position="0"> This section describes the text-based keyword extraction system, GlossEx, which we developed in our earlier work (Park et al, 2002). GlossEx applies a hybrid method, which exploits both linguistic and statistical knowledge, to extract domain-specific keywords in a document collection. GlossEx has been successfully used in large-scale text analysis applications such as document authoring and indexing, back-of-book indexing, and contact center data analysis.</Paragraph>
      <Paragraph position="1"> An overall outline of the algorithm is given below.</Paragraph>
      <Paragraph position="2"> First, the algorithm identifies candidate glossary items by using syntactic grammars as well as a set of entity recognizers. To extract more cohesive and domain-specific glossary items, it then conducts pre-nominal modifier filtering and various glossary item normalization techniques such as associating abbreviations with their full forms, and misspellings or alternative spellings with their canonical spellings. Finally, the glossary items are ranked based on their confidence values.</Paragraph>
      <Paragraph position="3"> The confidence value of a term T,C(T), is defined as</Paragraph>
      <Paragraph position="5"> where TD and TC denote the term domain-specificity and term cohesion, respectively. a and b are two weights which sum up to 1. The domain specificity is further defined as</Paragraph>
      <Paragraph position="7"> the probability of word wi in a domain document collection, and pg(wi) is the probability of word wi in a general document collection. And the term cohesion is defined as</Paragraph>
      <Paragraph position="9"> where, f(T) is the frequency of term T, and f(wi) is the frequency of a component word wi.</Paragraph>
      <Paragraph position="10"> Finally, GlossEx normalizes the term confidence values to the range of [0,3.5]. Figure 1 shows the normalized distributions of keyword confidence values that we obtained from two instructional videos by analyzing their text transcripts with GlossEx. Superimposed on each plot is the probability density function (PDF) of a gamma distribution (Gamma(a,g)) whose two parameters are directly computed from the confidence values. As we can see, the gamma PDF fits very well with the data distribution. This observation has also been confirmed by other test videos.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="109" end_page="111" type="metho">
    <SectionTitle>
3 Salient Keyword Extraction for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="109" end_page="109" type="sub_section">
      <SectionTitle>
Instructional Videos
</SectionTitle>
      <Paragraph position="0"> In this section, we elaborate on our approach for extracting salient keywords from instructional videos based on the exploitation of audiovisual and text cues.</Paragraph>
    </Section>
    <Section position="2" start_page="109" end_page="110" type="sub_section">
      <SectionTitle>
3.1 Characteristics of Instructional Videos
</SectionTitle>
      <Paragraph position="0"> Compared to general videos, professionally produced instructional videos are usually better structured, that is, they generally contain well organized topics and sub-topics due to education nature. In fact, there are certain types of production patterns that could be observed from these videos. For instance, at the very beginning section of the video, a host will usually give an overview of the main topics (as well as a list of sub-topics) that are to be discussed throughout the video. Then each individual topic or sub-topic is sequentially presented following a pre-designed order. When one topic is completed, some informational credit pages will be (optionally) displayed, followed by either some informational title pages showing the next topic, or a host introduction. A relatively long interval of music or silence that accompanies this transitional period could usually be observed in this case.</Paragraph>
      <Paragraph position="1"> To effectively deliver the topics or materials to an audience, the video producers usually apply the following types of content presentation forms: host narration, interviews and site reports, presentation slides and information bulletins, as well as assisted content that are related with the topic under discussion. For convenience, we call the last two types as informative text and linkage scene  in this work. Figure 2 shows the individual examples of video frames that contain narrator, informative text, and the linkage scene.</Paragraph>
    </Section>
    <Section position="3" start_page="110" end_page="110" type="sub_section">
      <SectionTitle>
3.2 AudioVisual Content Analysis
</SectionTitle>
      <Paragraph position="0"> This section describes our approach on mining the aforementioned content structure and patterns for instructional videos based on the analysis of both audio and visual information. Specifically, given an instructional video, we first apply an audio classification module to partition its audio track into homogeneous audio segments. Each segment is then tagged with one of the following five sound labels: speech, silence, music, environmental sound, and speech with music (Li and Dorai, 2004). The support vector machine technique is applied for this purpose.</Paragraph>
      <Paragraph position="1"> Meanwhile, a homogeneous video segmentation process is performed which partitions the video into a series of video segments in which each segment contains content in the same physical setting. Two groups of visual features are then extracted from each segment so as to further derive its content type. Specifically, features regarding the presence of human faces are first extracted using a face detector, and these are subsequently applied to determine if the segment contains a narrator.</Paragraph>
      <Paragraph position="2"> The other feature group contains features regarding detected text blobs and sentences from the video's text overlays. This information is mainly applied to determine if the segment contains informative text. Finally, we label segments that do not contain narrators or informative text as linkage scenes. These could be an outdoor landscape, a field demonstration or indoor classroom overview. More details on this part are presented in (Li and Dorai, 2005).</Paragraph>
      <Paragraph position="3"> The audio and visual analysis results are then integrated together to essentially assign a semantic audiovisual label to each video segment. Specifically, given a segment, we first identify its major audio type by finding the one that lasts the longest. Then, the audio and visual labels are integrated in a straightforward way to reveal its semantics. For instance, if the segment contains a narrator while its major audio type is music, it will be tagged as narrator with music playing. A total of fifteen possible constructs is thus generated, coming from the combination of three visual labels (narrator, informative text and linkage scene) and five sound labels (speech, silence, music, environmental sound, and speech with music).</Paragraph>
    </Section>
    <Section position="4" start_page="110" end_page="110" type="sub_section">
      <SectionTitle>
3.3 AudioVisual and Text Cues for Salient Keyword
Extraction
</SectionTitle>
      <Paragraph position="0"> Having acquired video content structure and segment content types, we now extract important audiovisual cues that imply the existence of salient keywords. Specifically, we observe that topic-specific keywords are more likely appearing in the following scenarios (a.k.a cue context): 1) the first N1 sentences of segments that contain narrator presentation (i.e. narrator with speech), or informative text with voice-over; 2) the first N2 sentences of a new speaker (i.e. after a speaker change); 3) the question sentence; 4) the first N2 sentences right after the question (i.e. the corresponding answer); and 5) the first N2 sentences following the segments that contain silence, or informative text with music. Specifically, the first 4 cues conform with our intuition that important content subjects are more likely to be mentioned at the beginning part of narration, presentation, answers, as well as in questions; while the last cue corresponds to the transitional period between topics. Here, N1 is a threshold which will be automatically adjusted for each segment during the process. Specifically, we set N1 to min(SS,3) where SS is the number of sentences that are overlapped with each segment. In contrast, N2 is fixed to 2 for this work as it is only associated with sentences.</Paragraph>
      <Paragraph position="1"> Note that currently we identify the speaker changes and question sentences by locating the signature characters (such as &amp;quot;&gt;&gt;&amp;quot; and &amp;quot;?&amp;quot;) in the transcript. However, when this information is unavailable, numerous existing techniques on speaker change detection and prosody analysis could be applied to accomplish the task (Chen et al., 1998).</Paragraph>
    </Section>
    <Section position="5" start_page="110" end_page="111" type="sub_section">
      <SectionTitle>
3.4 Keyword Salience Adjustment
</SectionTitle>
      <Paragraph position="0"> Now, given each keyword (K) obtained from GlossEx, we recalculate its salience by considering the following three factors: 1) its original confidence value assigned by GlossEx (CGlossEx(K)); 2) the frequency of the keyword occurring in the aforementioned cue context (Fcue(K)); and 3) the number of component words in the keyword (|K|). Specifically, we give more weight or incentive (I(K)) to keywords that are originally of high confidence, appear more frequently in cue contexts, and have multiple component words. Note that if keyword K does not appear in any cue contexts, its incentive value will be zero.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the detailed incentive calculation steps.</Paragraph>
      <Paragraph position="2"> Here, mode and s denote the mode and standard deviation derived from the GlossEx 's confidence value distribution. MAX CONFIDENCE is the maximum confidence value used for normalization by GlossEx, which is set to 3.5 in this work. As we can see, the three aforementioned factors have been re-transformed into C(K), F(K) and L(K), respectively. Please also note that we  have re-adjusted the frequency of keyword K in the cue context if it is larger than 10. This intends to reduce the biased influence of a high frequency. Finally, we add a small value epsilon1 to |K |and Fcue respectively in order to avoid zero values for F(K) and L(K). Now, we have similar value scales for F(K) and L(K) ([1.09,2.xx]) and C(K) ([0,2.yy]), which is desirable.</Paragraph>
      <Paragraph position="3"> As the last step, we boost keyword K's original salience CGlossEx(K) by I(K).</Paragraph>
      <Paragraph position="4"> if (CGlossEx(K) &gt;= mode</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML