File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1017_metho.xml

Size: 23,665 bytes

Last Modified: 2025-10-06 14:07:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1017">
  <Title>The Automatic Generation of Formal Annotations in a Multimedia Indexing and Searching Environment</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 State of the Art
</SectionTitle>
    <Paragraph position="0"> MUMIS differs in many significant ways from existing technologies and already achieved or advanced projects3. Most closely related to the thematic focus of MUMIS are the HLT projects Pop-Eye [POP] and OLIVE [OLI]. Pop-Eye used subtitles to index video streams and offered time-stamped texts to satisfy a user query, on request displaying a storyboard or video fragment corresponding to the text hit. OLIVE used automatic speech recognition to generate transcriptions of the sound tracks of news reports, which were then indexed and used in ways similar to the Pop-Eye project; both projects used fuzzy matching IR algorithms to search and retrieve text, offering limited multilingual access to texts. Instead of using IR methods to index and search the transcriptions, MUMIS will create formal annotations to the information, and will fuse information annotations from different media sources. The fusion result is then used to direct retrieval, through interface techniques such as pop-up menus, keyword lists, and so on. Search takes the user direct to the storyboard and video clippings.</Paragraph>
    <Paragraph position="1"> The Informedia project at Carnegie-Mellon-University [INF] has a similar conceptual base-line to MUMIS. The innovative contribution of MUMIS is that it uses a variety of multilingual information sources and fuses them on the basis of formal domain-specific annotations. Where Informedia primarily focuses on special applications, MUMIS aims at the advancement and integratibility of HLT-enhanced modules to enable information filtering beyond the textual domain.</Paragraph>
    <Paragraph position="2"> Therefore, MUMIS can be seen as complementary to Informedia with extensions typical for Europe. null The THISL project [THI] is about spoken document retrieval, i.e., automatic speech recognition 3We are aware of more related on-going projects, at least within the IST program, but we can not compare those to MUMIS now, since we still lack first reports.</Paragraph>
    <Paragraph position="3"> is used to auto-transcribe news reports and then information retrieval is carried out on this information. One main focus of THISL is to improve speech recognition. Compared to MUMIS it lacks the strong language processing aspects, the fusion of multilingual sources, and the multimedia delivery. null Columbia university is running a project [COL] to use textual annotations of video streams to indicate moments of interest, in order to limit the scope of the video processing task which requires extreme CPU capacities. So the focus is on finding strategies to limit video processing. The University of Massachusetts (Amherst) is also running projects about video indexing [UMA], but these focus on the combination of text and images. Associated text is used to facilitate indexing of video content. Both projects are funded under the NSF Stimulate programme [NSF].</Paragraph>
    <Paragraph position="4"> Much work has been done on video and image processing (Virage [VIR], the EUROMEDIA project [EUR], Surfimage [SUR], the ISIS project [ISI], IBM's Media Miner, projects funded under the NSF Stimulate program [NSF], and many others). Although this technology in general is in its infancy, there is reliable technology to indicate, for example, scene changes using very low-level cues and to extract key frames at those instances to form a storyboard for easy video access. Some institutions are running projects to detect subtitles in the video scene and create a textual annotation.</Paragraph>
    <Paragraph position="5"> This task is very difficult, given a sequence of real scenes with moving backgrounds and so on. Even more ambitious tasks such as finding real patterns in real movies (tracing the course of the ball in a soccer match, for example) are still far from being achieved.4</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Formal Annotations for the Soccer
</SectionTitle>
    <Paragraph position="0"> Domain Soccer has been chosen as the domain to test and apply the algorithms to be developed. There are a number of reasons for this choice: availability of people willing to help in analyzing user requirements, existence of many information sources in 4The URLs of the projects mentionned above are given in the bibliography at the end of this paper.</Paragraph>
    <Paragraph position="1"> several languages5, and great economic and public interest. The prototype will also be tested by TV professionals and sport journalists, who will report on its practicability for the creation and management of their programme and information material.</Paragraph>
    <Paragraph position="2"> The principles and methods derived from this domain can be applied to other as well. This has been shown already in the context of text-based Information Extraction (IE), for which methodologies for a fast adaptation to new domains have been developed (see the MUC conferences and (Neumann et al., 2000)). And generally speaking the use of IE for automatic annotation of multimedia document has the advantage of providing, besides the results of the (shallow) syntactic processing, accurate semantic (or content/conceptual) information (and thus potential annotation) for specific predefined domains, since a mapping from the linguistically analyzed relevant text parts can be mapped onto an unambiguous conceptual description6. Thus in a sense it can be assumed that IE is supporting the word sense disambiguation task.</Paragraph>
    <Paragraph position="3"> It is also commonly assumed (see among others (Cunningham, 1999)) that IE occupies an intermediate place between Information Retrieval (with few linguistic knowledge involved) and Text Understanding (involving the full deep linguistic analysis and being still not realized for the time being.). IE being robust but offering only a partial (but mostly accurate) syntactic and content analysis, it can be said that this language technology is actually filling the gap between available low-level annotated/indexed documents and corpora and the desirable full content annotation of those documents and corpora. This is the reason why MUMIS has chosen this technology for providing automatic annotation (at distinct linguistic and domain-specific levels) of multimedia material, allowing thus to add queryable &amp;quot;content information&amp;quot; to this material.7  knowledge management tasks, but we assume that the meaningful organization of domain-specific multimedia material proposed by the project can be adapted to the organization of</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The Multimedia Material in MUMIS
</SectionTitle>
    <Paragraph position="0"> The MUMIS project is about automatic indexing of videos of soccer matches with formal annotations and querying that information to get immediate access to interesting video fragments.</Paragraph>
    <Paragraph position="1"> For this purpose the project chose the European Football Championships 2000 in Belgium and the Netherlands as its main database. A major project goal is to merge the formal annotations extracted from textual and audio material (including the audio part of videos) on the EURO 2000 in three languages: English, German, Dutch. The material MUMIS has to process can be classified in the following way:  1. Reports from Newspapers (reports about specific games, general reports) which is classified as free texts (FrT) 2. Tickers, close captions, Action-Databases which are classified as semi-formal texts (SFT) 3. Formal descriptions about specific games which are classified as formal texts (FoT) 4. Audio material recorded from radio and TV broadcasts 5. Video material recorded from TV broadcasts 1-4 will be used for automatically generating  formal annotations in order to index 5. MUMIS is investigating the precise contribution of each source of information for the overall goal of the project.</Paragraph>
    <Paragraph position="2"> Since the information contained in formal texts can be considered as a database of true facts, they play an important role within MUMIS. But nevertheless they contain only few information about a game: the goals, the substitutions and some other few events (penalties, yellow and red cards). So there are only few time points available for indexing videos. Semi-formal texts (SFT), like live tickers on the web, are offering much more time points sequences, related with a higher diversity the distributed information of an enterprise and thus support the sharing and access to companies expertise and knowhow. null of events (goals scenes, fouls etc,) and seem to offer the best textual source for our purposes. Nevertheless the quality of the texts of online tickers is often quite poor. Free texts, like newspapers articles, have a high quality but the extraction of time points and their associated events in text is more difficult. Those texts also offer more background information which might be interesting for the users (age of the players, the clubs they are normally playing for, etc.). Figures 1 and 2 in section 8 show examples of (German) formal and semi-formal texts on one and the same game.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Processing Steps in MUMIS
5.1 Media Pre-Processing
</SectionTitle>
    <Paragraph position="0"> Media material has been delivered in various formats (AudioDAT, AudioCassettes, Hi-8 video cassettes, DV video cassettes etc) and qualities.</Paragraph>
    <Paragraph position="1"> All audio signals (also those which are part of the video recordings) are digitized and stored in an audio archive. Audio digitization is done with 20 kHz sample frequency, the format generated is according to the de-facto wav standard. For digitization any available tool can be used such as SoundForge.</Paragraph>
    <Paragraph position="2"> Video information (including the audio component) of selected games have been digitized into MPEG1 streams first. Later it will be encoded in MPEG2 streams. While the quality of MPEG1 is certainly not satisfying to the end-user, its bandwidth and CPU requirements are moderate for current computer and network technology. The mean bit rate for MPEG1 streams is about 1.5 Mbps. Current state-of-the-art computers can render MPEG1 streams in real time and many network connections (Intranet and even Internet) can support MPEG1. MPEG2 is specified for about 3 to 5 Mbps. Currently the top-end personal computers can render MPEG2, but MPEG2 is not yet supported for the most relevant player APIs such as JavaMediaFramework or Quicktime. When this support is given the MUMIS project will also offer MPEG2 quality.</Paragraph>
    <Paragraph position="3"> For all separate audio recordings as for example from radio stations it has to be checked whether the time base is synchronous to that one of the corresponding video recordings. In case of larger deviations a time base correction factor has to be estimated and stored for later use. Given that the annotations cannot be created with too high accuracy a certain time base deviation will be accepted. For part of the audio signals manual transcriptions have to be generated to train the speech recognizers. These transcripts will be delivered in XML-structured files.</Paragraph>
    <Paragraph position="4"> Since keyframes will be needed in the user interface, the MUMIS project will develop software that easily can generate such keyframes around a set of pre-defined time marks. Time marks will be the result of information extraction processes, since the corresponding formal annotations is referring to to specific moments in time. The software to be written has to extract the set of time marks from the XML-structured formal annotation file and extract a set of keyframes from the MPEG streams around those time marks. A set of keyframes will be extracted around the indicated moments in time, since the estimated times will not be exact and since the video scenes at such decisive moments are changing rapidly. There is a chance to miss the interesting scene by using keyframes and just see for example spectators. Taking a number of keyframes increases the chance to grab meaningful frames.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Multilingual Automatic Speech
Recognition
</SectionTitle>
      <Paragraph position="0"> Domain specific language models will be trained.</Paragraph>
      <Paragraph position="1"> The training can be bootstrapped from written reports of soccer matches, but substantial amounts of transcribed recordings of commentaries on matches are also required. Novel techniques will be developed to interpolate the base-line language models of the Automatic Speech Recognition (ASR) systems and the domain specific models. Moreover, techniques must be developed to adapt the vocabularies and the language models to reflect the specific conditions of a match (e.g., the names players have to be added to the vocabulary, with the proper bias in the language model). In addition, the acoustic models must be adapted to cope with the background noise present in most recordings.</Paragraph>
      <Paragraph position="2"> Automatic speech recognition of the sound tracks of television and (especially) radio programmes will make use of closed caption subtitle texts and information extracted from formal texts to help in finding interesting sequences and automatically transcribing them. Further, the domain lexicons will help with keyword and topic spotting. Around such text islands ASR will be used to transcribe the spoken soundtrack. The ASR system will then be enriched with lexica containing more keywords, to increase the number of sequence types that can be identified and automatically transcribed.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Multilingual Domain Lexicon Building
</SectionTitle>
      <Paragraph position="0"> All the collected textual data for the soccer domain are used for building the multilingual domain lexicons. This data can be in XML, HTML, plain text format, etc. A number of automatic processes are used for the lexicon building, first on a monolingual and secondly on a multilingual level. Manual browsing and editing is taking place, mainly in order to provide the semantic links to the terms, but also for the fine-tuning of the lexicon according to the domain knowledge.</Paragraph>
      <Paragraph position="1"> Domain lexicons are built for four languages, namely English, German, Dutch and Swedish. The lexicons will be delivered in a fully structured, XML-compliant, TMX-format (Translation Memory eXchange format). For more information about the TMX format see http://www.lisa.org/tmx/tmx.htm.</Paragraph>
      <Paragraph position="2"> We will also investigate how far EUROWORDNET resources (see http://www.hum.uva.nl/ ewn/) can be of use for the organization of the domain-specific terminology.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Building of Domain Ontology and Event
Table
</SectionTitle>
      <Paragraph position="0"> The project is currently building an ontology for the soccer domain, taking into consideration the requirements of the information extraction and merging components, as well as users requirements. The ontology will be delivered in an XML format8.</Paragraph>
      <Paragraph position="1"> 8There are still on-going discussions within the project consortium wrt the best possible encoding format for the domain ontology, the alternative being reduced probably to RDFS, OIL and IFF, see respectively, and among others, http://www.w3.org/TR/rdfschema/, http://www.oasis-open.org/cover/oil.html and http://www.ontologos.org/IFF/The%20IFF%20Language.</Paragraph>
      <Paragraph position="2"> html In parallel to building the ontology an event table is being described. It contains the major event types that can occur in soccer games and their attributes. This content of the table is matching with the content of the ontology. The event table is a flat structure and guides the information extraction processes to generate the formal event annotations. The formal event annotations build the basis for answering user queries. The event table is specified as an XML schema to constrain the possibilities of annotation to what has been agreed within the project consortium.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Generation of Formal Annotations
</SectionTitle>
      <Paragraph position="0"> The formal annotations are generated by the IE technology and are reflecting the typical output of IE systems, i.e.instantiated domain-specific templates or event tables. The slots to be filled by the systems are basically entities (player, teams etc.), relations (player of, opponents etc.) and events (goal, substitution etc.), which are all derived from the current version of the domain ontology and can be queried for in the online component of the MUMIS prototype. All the templates associated with an event are including a time slot to be filled if the corresponding information is available in a least one of the sources consulted during the IE procedure. This time information is necessary for the indexing of the video material.</Paragraph>
      <Paragraph position="1"> The IE systems are applying to distinct sources (FoT, FrT etc.) but they are not concerned with achieving consistency in the IE result on distinct sources about the same event (game): this is the task of the merging tools, described below.</Paragraph>
      <Paragraph position="2"> Since the distinct textual sources are differently structured, from &amp;quot;formal&amp;quot; to &amp;quot;free&amp;quot; texts, the IE systems involved have adopted a modular approach: regular expressions for the detection of Named Entities in the case of formal texts, full shallow parsing for the free texts. On the base of the factual information extracted from the formal texts, the IE systems are also building dynamic databases on certain entities (like name and age of the players, the clubs they are normally playing for, etc.) or certain metadata (final score), which can be used at the next level of processing.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.6 The Merging Tool
</SectionTitle>
      <Paragraph position="0"> The distinct formal annotations generated are passed to a merging component, which is responsible for avoiding both inconsistencies and redundancies in the annotations generated on one event (in our case a soccer game).</Paragraph>
      <Paragraph position="1"> In a sense one can consider this merging component as an extension of the so-called co-reference task of IE systems to a cross-document (and cross-lingual) reference resolution task. The database generated during the IE process will help here for operating reference resolution for more &amp;quot;verbose&amp;quot; types of texts, which in the context of soccer are quite &amp;quot;poetic&amp;quot; with respect to the naming of agents (the &amp;quot;Kaiser&amp;quot; for Beckenbauer, the &amp;quot;Bomber&amp;quot; for Mueller etc...), which would be quite difficult to achieve within the sole referential information available within the boundary of one document. The project will also investigate here the use of inferential mechanisms for supporting reference resolution. So for example, &amp;quot;knowing&amp;quot; from the formal texts the final score of a game and the names of the scorers, following formulation can be resolved form this kind of formulation in a free text (in any language): &amp;quot;With his decisive goal, the &amp;quot;Bomber&amp;quot; gave the victory to his team.&amp;quot;, whereas the special naming &amp;quot;Bomber&amp;quot; can be further added to the entry &amp;quot;Mueller&amp;quot; The merging tools used in MUMIS will also take into consideration some general representation of the domain-knowledge in order to filter out some annotations generated in the former phases.</Paragraph>
      <Paragraph position="2"> The use of general representations9 (like domain frames), combined with inference mechanisms, might also support a better sequential organization of some event templates in larger scenarios. It will also allow to induce some events which are not explicitly mentioned in the sources under consideration (or which the IE systems might not have detected).</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.7 User Interface Building
</SectionTitle>
      <Paragraph position="0"> The user first will interact with a web-portal to start a MUMIS query session. An applet will be 9Like for example the Type Description Language (TDL), a formalism supporting all kind of operations on (typed) features as well as multiple inheritance, see (Krieger and Schaefer, 1994).</Paragraph>
      <Paragraph position="1"> down-line loaded in case of showing the MUMIS demonstration. This applet mainly offers a query interface. The user then will enter a query that either refers to metadata, formal annotations, or both. The MUMIS on-line system will search for all formal annotations that meet the criteria of the query. In doing so it will find the appropriate meta-information and/or moments in some media recording. In case of meta-information it will simply offer the information in scrollable text widgets. This will be done in a structured way such that different type of information can easily be detected by the user. In case that scenes of games are the result of queries about formal annotations the user interface will first present selected video keyframes as thumbnails with a direct indication of the corresponding metadata.</Paragraph>
      <Paragraph position="2"> The user can then ask for more metadata about the corresponding game or for more media data. It has still to be decided within the project whether several layers of media data zooming in and out are useful to satisfy the user or whether the step directly to the corresponding video fragment is offered. All can be invoked by simple user interactions such as clicking on the presented screen object. Playing the media means playing the video and corresponding audio fragment in streaming mode requested from a media server.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Standards for Multimedia Content
</SectionTitle>
    <Paragraph position="0"> MUMIS is looking for a compliance with existing standards in the context of the processing of multimedia content on the computer and so will adhere to emerging standards such as MPEG4, which defines how different media objects will be decoded and integrated at the receiving station, and MPEG7, which is about defining standards for annotations which can be seen as multimedia objects. Further, MUMIS will also maintain awareness of international discussions and developments in the aerea of multimedia streaming (RTP, RTSP, JMF...), and will follow the discussions within the W3C consortium and the EBU which are also about standardizing descriptions of media content.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML