Annotation-Based Multimedia Summarization and Translation
Katashi Nagao
Dept. of Information Engineering
Nagoya University
and CREST, JST
Furo-cho, Chikusa-ku,
Nagoya 464-8603, Japan
nagao@nuie.nagoya-u.ac.jp
Shigeki Ohira
School of Science and
Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku,
Tokyo 169-8555, Japan
ohira@shirai.info.waseda.ac.jp
Mitsuhiro Yoneoka
Dept. of Computer Science
Tokyo Institute of Technology
2-12-1 O-okayama, Meguro-ku,
Tokyo 152-8552, Japan
yoneoka@img.cs.titech.ac.jp
Abstract
This paper presents techniques for multimedia
annotation and their application to video sum-
marization and translation. Our tool for anno-
tation allows users to easily create annotation
including voice transcripts, video scene descrip-
tions, and visual/auditory object descriptions.
The module for voice transcription is capable
of multilingual spoken language identiﬁcation
and recognition. A video scene description con-
sists of semi-automatically detected keyframes
of each scene in a video clip and time codes of
scenes. A visual object description is created
by tracking and interactive naming of people
and objects in video scenes. The text data in
themultimediaannotationaresyntacticallyand
semantically structuredusing linguistic annota-
tion. The proposed multimedia summarization
works upon a multimodal document that con-
sists of a video, keyframes of scenes, and tran-
scripts of the scenes. The multimedia transla-
tion automatically generates several versions of
multimedia content in diﬀerent languages.
1 Introduction
Multimedia content such as digital video is be-
coming a prevalent information source. Since
the volume of such content is growing to huge
numbers of hours, summarization is required to
eﬀectivelybrowsevideosegmentsinashorttime
withoutmissingsigniﬁcantcontent. Annotating
multimedia content with semantic information
such as scene/segment structuresandmetadata
about visual/auditory objects is necessary for
advanced multimedia content services. Since
natural language text such as a voice transcript
is highly manageable, speech and natural lan-
guage processing techniques have an essential
role in our multimedia annotation.
We have developed techniques for semi-
automatic video annotation integrating a mul-
tilingual voice transcription method, some
video analysis methods, and an interactive
visual/auditory annotation method. The
video analysis methods include automatic color
change detection, characterization of frames,
and scene recognition using similarity between
frame attributes.
Therearerelatedapproachestovideoannota-
tion. For example, MPEG-7 is an eﬀort within
the Moving Picture Experts Group (MPEG) of
ISO/IEC that is dealing with multimedia con-
tent description (MPEG, 2002). MPEG-7 can
describe indeces, notes, and so on, to retrieve
necessary parts of content speedily. However, it
takes a high cost to add these descriptions by
hands. The method of extracting them auto-
matically through the video/audio analysis is
vitally important. Our method can be inte-
grated into tools for authoring MPEG-7 data.
Thelinguistic descriptionscheme, whichwill be
a part of the amendment to MPEG-7, should
play a major role in this integration.
Using such annotation data, we have also de-
veloped a system for advanced multimedia pro-
cessing such as video summarization and trans-
lation. Ourvideo summaryis not justa shorter
version of the original video clip, but an in-
teractive multimedia presentation that shows
keyframes of important scenes and their tran-
scripts in Web pages and allow users to interac-
tively modify summary. The video summariza-
tion is customizable according to users’ favorite
size and keywords. When a user’s client device
isnotcapableofvideoplaying, oursystemtran-
forms video to a document that is the same as
a Web document in HTML format.
The multimedia annotation can make deliv-
ery of multimedia content to diﬀerent devices
very eﬀective. Dissemination of multimedia
content will be facilitated by annotation on the
usageofthecontent indiﬀerentpurposes,client
devices, and so forth. Also, it provides object-
level description of multimedia content which
allows a higher granularity of retrieval and pre-
sentation inwhichindividualregions, segments,
objects and events in image, audio and video
datacanbediﬀerentiallyaccesseddependingon
publisher and user preferences, network band-
width and client capabilities.
2 Multimedia Annotation
Multimedia annotation is an extension of doc-
ument annotation such as GDA (Global Docu-
ment Annotation) (Hasida, 2002). Since natu-
ral language text is more tractable and mean-
ingful than binary data of visual (image and
movingpicture)andauditory(soundandvoice)
content, we associate text withmultimedia con-
tent in several ways. Since most video clips
contain spoken narrations, our system converts
them into text and integrates them into video
annotation data. The text in the multimedia
annotation is linguistically annotated based on
GDA.
2.1 Multimedia Annotation Editor
We developed an authoring tool called multi-
media annotation editor capable of video scene
change detection, multilingual voice transcrip-
tion, syntactic and semantic analysis of tran-
scripts, and correlation of visual/auditory seg-
ments and text.
Figure 1: Multimedia Annotation Editor
An example screen of the editor is shown in
Figure 1. The editor screen consists of three
windows. One window(top) shows a video con-
tent, automatically detected keyframes in the
video, and an automatically generated voice
transcript. The second window (left bottom)
enablestheuserto editthetranscriptandmod-
ifyanautomatically analyzed linguistic markup
structure. The third window (right bottom)
shows graphically a linguistic structure of the
selected sentence in the second window.
The editor is capable of basic natural lan-
guage processing and interactive disambigua-
tion. The user can modify the results of the
automatically analyzed multimedia and linguis-
tic (syntactic and semantic) structures.
2.2 Linguistic Annotation
Linguistic annotation has been used to make
digitaldocumentsmachine-understandable,and
todevelopcontent-basedpresentation,retrieval,
question-answering, summarization, and trans-
lation systems with much higher quality than
is currently available. We have employed the
GDA tagset as a basic framework to describe
linguistic and semantic features of documents.
The GDA tagset is based on XML (Extensible
Markup Language) (W3C, 2002), and designed
to be as compatible as possible with TEI (TEI,
2002), CES (CES, 2002), and EAGLES (EA-
GLES, 2002).
An example of a GDA-tagged sentence fol-
lows:
<su><np opr="agt" sem="time0">Time</np>
<v sem="fly1">flies</v>
<adp opr="eg"><ad sem="like0">like</ad>
<np>an <n sem="arrow0">arrow</n></np>
</adp>.</su>
The <su> element is a sentential unit. The
other tags above, <n>, <np>, <v>, <ad> and
<adp> mean noun, noun phrase, verb, adnoun
or adverb (including preposition and postposi-
tion), and adnominal or adverbial phrase, re-
spectively.
The opr attribute encodes a relationship
in which the current element stands with
respect to the element that it semantically
depends on. Its value denotes a binary
relation, which may be a thematic role
such as agent, patient, recipient, etc., or a
rhetorical relation such as cause, concession,
etc. For instance, in the above sentence,
<np opr="agt" sem="time0">Time</np>
depends on the second element
<v sem="fly1">flies</v>. opr="agt"
means that Time has the agent role with
respect to the event denoted by flies. The sem
attribute encodes a word sense.
Linguistic annotation is generated by auto-
matic morphological analysis, interactive sen-
tence parsing, and word sense disambiguation
by selecting the most appropriate item in the
domain ontology. Some research issues on lin-
guistic annotation are related to how the anno-
tation cost can be reduced within some feasible
levels. We have been developing some machine-
guided annotation interfaces to simplify the an-
notation work. Machine learning mechanisms
also contribute to reducing the cost because
they can gradually increase the accuracy of au-
tomatic annotation.
In principle, the tag set does not depend on
language, but as a ﬁrst step we implemented a
semi-automatic tagging system for English and
Japanese.
2.3 Video Annotation
The linguistic annotation technique has an im-
portant role in multimedia annotation. Our
video annotation consists of creation of text
data related to video content, linguistic anno-
tation of the text data, automatic segmentation
of video, semi-automatic linking of video seg-
ments with corresponding text data, and inter-
active naming of people and objects in video
scenes.
To be more precise, video annotation is per-
formed through the following three steps.
First, for each video clip, the annotation sys-
tem creates the text corresponding to its con-
tent. We developed a method for creation of
voice transcripts using speech recognition en-
gines. It is called multilingual voice transcrip-
tion and described later.
Second, some video analysis techniques are
applied to characterization of visual segments
(i.e., scenes) and individual video frames. For
example, bydetecting signiﬁcant changes in the
color histogram of successive frames, frame se-
quences can be separated into scenes.
Also, by matching prepared templates to in-
dividual regions in the frame, the annotation
system identiﬁes objects. The user can specify
signiﬁcant objects in some scene in order to re-
duce the time to identify target objects and to
obtain a higher recognition accuracy. The user
can nameobjects ina frame simplybyselecting
words in the corresponding text.
Third,theuserrelates video segments to text
segments such as paragraphs, sentences, and
phrases, based on scene structures and object-
name correspondences. The system helps the
user select appropriate segments by prioritiz-
ing them based on the number of the detected
objects, camera motion, and the representative
frames.
2.4 Multilingual Voice Transcription
The multimedia annotation editor ﬁrst extracts
the audio data from a target video clip. Then,
the extracted audio data isdividedinto left and
right channels. If the average for the diﬀerence
of the audio signals of the two channels exceeds
a certain threshold, they are considerd diﬀerent
andtransferedtothemultilingualspeechidenti-
ﬁcation and recognition module. The output of
the module is a structured transcript contain-
ing time codes, word sequences, and language
information. It is described in XML format as
shown in Figure 2.
<transcript lang="en" channel="l">
<w in="20.264000" out="20.663000">Web grabber </w>
<w in="20.663000" out="21.072000">is a </w>
<w in="21.072000" out="21.611000">very simple </w>
<w in="21.611000" out="22.180000">utility </w>
<w in="22.180000" out="22.778000">that is </w>
<w in="22.778000" out="23.856000">attached to </w>
<w in="23.856000" out="24.215000">Netscape </w>
<w in="24.215000" out="24.934000">as a pull down menu </w>
<w in="24.934000" out="25.153000">and </w>
<w in="25.153000" out="25.462000">allows you </w>
<w in="25.462000" out="25.802000">take </w>
<w in="25.802000" out="26.191000">your Web content </w>
<w in="26.191000" out="27.039000">whether it’s a </w>
<w in="27.039000" out="27.538000">MPEG file </w>
...
</transcript>
Figure 2: Transcript Data
Ourmultilingual video transcriptor automat-
ically generates transcriptswith time codes and
provides their reusable data structure which al-
lows easy manual corretion. An example screen
of the mulitilingual voice transcriptor is shown
in Figure 3.
Left Channel
Right Channel
Figure 3: Multilingual Voice Transcriptor
2.4.1 Multilingual Speech Identification
and Recognition
The progress of speech recognition technol-
ogy makes it comparatively easy to transform
speech into text, but spoken language identi-
ﬁcation is needed for processing multilingual
speech, because speech recognition technology
assumes that the language used is known.
While researchers have been working on the
multilingual speech identiﬁcation, few applica-
tionsbasedonthistechnologyhasbeenactually
used except a telephony speech translation sys-
tem. In the case of the telephone translation
system, the information of the language used
is self-evident; at least, the speaker knows; so
therearelittleneedsandadvantagesofdevelop-
ing a multilingual speech identiﬁcation system.
On the other hand, speech data in video do
not always have the information about the lan-
guageused. Duetotherecentprogressofdigital
broadcasting and the signal compression tech-
nology, the information about the language is
expected to accompany the content in the fu-
ture. But most of the data available now do
nothave it, soalarge amountoflaborisneeded
to identify the language. Therefore, the multi-
lingual speech identiﬁcation has a large part to
play with unknown-language speech input.
Aprocessofmultilingualspeechidentiﬁcation
is shown in Figure 4. Our method determines
the language of input speech usinga simple dis-
criminant function based on relative scores ob-
tainedfrommultiplespeechrecognizersworking
in parallel (Ohira et al., 2001).
Figure 4: Conﬁguration of Spoken Language
Identiﬁcation Unit
Multiple speech recognition engines work si-
multaneously on the input speech. It is as-
sumed that each speech recognition engine has
thespeakerindependentmodel,andeachrecog-
nitionoutputwordhasascorewithinaconstant
range dependent on each engine.
When a speech comes, each recognition en-
gine outputs a word sequence with scores. The
discriminant unit calculates a value of a dis-
criminant function using the scores for every
language. The engine with the highest average
discriminant value is selected and the language
is determined by the engine, whose recognition
result is accepted as the transcript. If there is
no distinct diﬀerence between discriminant val-
ues, that is not higher than a certain threshold,
a judgment is entrusted to the user.
Our technique is simple, it uses the exist-
ing speech recognition engines tuned in each
language without a special model for language
identiﬁcation and acoustic features.
Combining the voice transcription and the
video image analysis, our tool enables users to
create and edit video annotation data semi-
automatically. The entire process is as shown
in Figure 5.
Figure 5: Multilingual Video Data Analysis
Our system drastically reduces the overhead
on the user who analyzes and manages a large
collection of video content. Furthermore, it
makesconventional naturallanguageprocessing
techniques applicable to multimedia processing.
2.5 Scene Detection and Visual Object
Tracking
As mentioned earlier, visual scene changes are
detected by searching for signiﬁcant changes in
the color histogram of successive frames. Then,
framesequencescanbedividedintoscenes. The
scene description consists of time codes of the
start and end frames, a keyframe (image data
in JPEG format) ﬁlename, a scene title, and
some text representing topics. Additionally,
when the user speciﬁes a particular object in a
frame by mouse-dragging a rectangular region,
an automatic object tracking is executed and
timecodesandmotiontrailsintheframe(series
of coordinates for interpolation of object move-
ment) are checked out. The user can name the
detected visualobjects interactively. Thevisual
objectdescriptionincludestheobjectname,the
relatedURL,timecodesandmotiontrailsinthe
frame.
Our multimedia annotation also contains de-
scriptionsonauditoryobjects invideo. Theau-
ditory objects can bedetected by acoustic anal-
ysisontheuserspeciﬁedsoundsequencevisual-
izedinwaveform. Anexamplescenedescription
inXML format isshowninFigure 6, andanex-
ample object description in Figure 7.
3 Multimedia Summarization and
Translation
Based on multimedia annotation, we have de-
veloped a system for multimedia (especially,
<scene>
<seg in="0.066733" out="11.945279"
keyframe="0.187643"/>
<seg in="11.945279" out="14.447781"
keyframe="12.004385"/>
<seg in="14.447781" out="18.685352"
keyframe="14.447781"/>
...
</scene>
Figure 6: Scene Description
<object>
<vobj begin="1.668335" end="4.671338" name="David"
description="anchor" img="o0000.jpg"
link="http://...">
<area time="1.668335" top="82" left="34"
width="156" height="145"/>
<area ... />
</vobj>
...
</object>
Figure 7: Object Description
video) summarization and translation. One of
the main functions of the system is to gener-
ate an interactive HTML (HyperText Markup
Language) document from multimedia content
with annotation data for interactive multime-
diapresentation, whichconsistsofanembedded
video player, hyperlinked keyframe images, and
linguistically-annotated transcripts. Our sum-
marization and translation techniques are ap-
plied to the generated document called a multi-
modal document.
There are some previous work on multime-
dia summarization such as Informedia (Smith
and Kanade, 1995) and CueVideo (Amir et al.,
1999). They create a video summary based on
automatically extracted features in video such
as scene changes, speech, text and human faces
in frames, and closed captions. They can pro-
cess video data without annotations. However,
currently, the accuracy of their summarization
is not for practical use because of the failure of
automaticvideoanalysis. Ourapproachtomul-
timediasummarizationattainssuﬃcientquality
foruseifthedatahasenoughsemanticinforma-
tion. As mentioned earlier, we have developed
a tool to help annotators to create multimedia
annotation data. Since our annotation data is
declarative, hence task-independent and versa-
tile, the annotations are worth creating if the
multimedia content will be frequently used in
diﬀerent applications such as automatic editing
and information extraction.
3.1 Multimodal Document
Video transformation is an initial process
of multimedia summarization and translation.
The transformation module retrieves the anno-
tationdataaccumulatedinanannotationrepos-
itory (XML database) and extracts necessary
information to generate a multimodal docu-
ment. The multimodal document consists of an
embedded video window, keyframes of scenes,
and transcipts aligned withthe scenes as shown
in Figure 8. The resulting document can be
summarized and translated by the modules ex-
plained later.
Figure 8: Multimodal Document
This operation is also beneﬁcial for people
with devices without video playing capabil-
ity. In this case, the system creates a simpli-
ﬁedversion ofmultimodal documentcontaining
only keyframe images of important scenes and
summarized transcripts related to the selected
scenes.
3.2 Video Summarization
The proposed video summarization is per-
formed as a by-product of text summariza-
tion. The text summarization is an appli-
cation of linguistic annotation. The method
is cohesion-based and employs spreading acti-
vation to calculate the importance values of
wordsand phrasesinthe document(Nagao and
Hasida, 1998).
Thus, the video summarization works in
terms of summarization of a transcript from
multimedia annotation data and extraction of
the video scene related to the summary. Since
a summarized transcript contains important
words and phrases, corresponding video se-
quences will produce a collection of signiﬁcant
scenes in the video. The summarization results
in a revised version of multimodal documemt
that contains keyframe images and summa-
rized transcripts of selected important scenes.
Keyframes of less important scenes are shown
in a smaller size. An example screen of a sum-
marized multimodal document is shown in Fig-
ure 9.
Figure 9: Summarized Multimodal Document
The vertical time bar in the middle of the
screenofmultimodaldocumentrepresentsscene
segments whose color indicates if the segment
is included in the summary or not. The
keyframe images are linked with their corre-
sponding scenes so that the user can see the
scene by just clicking its related image. The
user can also access information about objects
such as people in the keyframe by dragging a
rectangular region enclosing them. The infor-
mationappearsinexternalwindows. Inthecase
of auditory objects, the user can select them by
clicking any point in the time bar.
3.3 Video Translation
One type of our video translation is achieved
through the following procedure. First, tran-
scripts in the annotation data are translated
into diﬀerent languages for the user choice, and
then, the results are shown as subtitles syn-
chronized with the video. The video transla-
tion module invokes an annotation-based text
translation mechanism. Text translation is also
greatly improved by using linguistic annotation
(Watanabe et al., 2002).
The other type of translation is performed in
terms of synchronization of video playing and
speech synthesis of the translation results. This
translation makes another-language version of
the original video clip. If comments, notes, or
keywords are included in the annotation data
on visual/auditory objects, then they are also
translated and shown on a popup window.
In the case of bilingual broadcasting, since
our annotation system generates transcripts in
everyaudiochannel,multimodaldocumentscan
be coming from both channels. The user can
easily select a favorite multimodal document
created from one of the channels. We have also
developed a mechanism to change the language
to play depending on the user proﬁle that de-
scribes the user’s native language.
4 Concluding Remarks
We have developed a tool to create multime-
dia annotation data and a mechanism to ap-
plysuchdatatomultimediasummarizationand
translation. The main component of the anno-
tationtoolisamultilingualvoicetranscriptorto
generate transcriptsfrommultilingualspeechin
videoclips. Thetoolalsoextractssceneandob-
ject information semi-automatically, describes
the data in XML format, and associates the
data with content.
We also presented some advanced applica-
tions on multimedia content based on annota-
tion. We have implemented video-to-document
transformationthatgeneratesinteractive multi-
modal documents, video summarization using a
text summarization technique, and video trans-
lation.
Linguistic processing is an essential task in
those applications so that natural language
technologies are still very important in process-
ing multimedia content.
Ourfuturework includesa more eﬃcient and
ﬂexible retrieval of multimedia content for re-
quests in spoken and written natural language.
Theretrievalofspokendocumentshasalsobeen
evaluated in a subtask “SDR (Spoken Docu-
ment Retrieval) track” at TREC (Text RE-
trieval Conference) (TREC, 2002). Johnson
(Johnson, 2001) suggested from his group’s ex-
perience on TREC-9 that new challenges such
as use of non-lexical information derived di-
rectlyfromtheaudioandintegrationwithvideo
data are signiﬁcant works for the improvement
of retrieval performance and usefulness. We,
therefore, believe that our research has signiﬁ-
cant impacts and potetials on the content tech-
nology.

References

A. Amir, S. Srinivasan, D. Ponceleon, and
D. Petkovic. 1999. CueVideo: Automated index-
ing of video for searching and browsing. In Pro-
ceedings of SIGIR’99.

CES. 2002. Corpus Encoding Standard.
http://www.cs.vassar.edu/CES/.

EAGLES. 2002. EAGLES online.
http://www.ilc.pi.cnr.it/EAGLES/home.html.

Koiti Hasida. 2002. Global Document Annotation.
http://i-content.org/GDA/.

S. E. Johnson. 2001. Spoken document retrieval for
TREC-9atCambridgeUniversity. In Proceedings
of Text REtrieval Conference (TREC-9).

MPEG. 2002. MPEG-7 context and objectives.
http://drogo.cselt.stet.it/mpeg/standards/mpeg-
7/mpeg-7.htm.

Katashi Nagao and Kˆoiti Hasida. 1998. Automatic
text summarization based on the Global Doc-
ument Annotation. In Proceedings of the Sev-
enteenth International Conference on Computa-
tional Linguistics (COLING-98), pages 917–921.

Shigeki Ohira, Mitsuhiro Yoneoka, and Katashi Na-
gao. 2001. A multilingual video transcriptor and
annotation-based video transcoding. In Proceed-
ings of the Second International Workshop on
Content-Based Multimedia Indexing (CBMI-01).

Michael A. Smith and Takeo Kanade. 1995. Video
skimming for quick browsing based on audio and
image characterization. Technical Report CMU-
CS-95-186,SchoolofComputer Science, Carnegie
Mellon University.

TEI. 2002. Text Encoding Initiative.
http://www.uic.edu/orgs/tei/.

TREC. 2002. Text REtrieval Conference home
page. http://trec.nist.gov/.

W3C. 2002. Extensible Markup Language (XML).
http://www.w3.org/XML/.

Hideo Watanabe, Katashi Nagao, Michael C. Mc-
Cord, and Arendse Bernth. 2002. An annotation
system for enhancing quality of natural language
processing. In Proceedings of the Nineteenth In-
ternational Conference on Computational Lin-
guistics (COLING-2002).
