File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1508_metho.xml

Size: 7,698 bytes

Last Modified: 2025-10-06 14:07:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1508">
  <Title>for Multimedia Language Resources</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Metada for Language Resources
</SectionTitle>
    <Paragraph position="0"> The idea of describing a whole document with the help of a few characteristic metadata elements is not new. Well-known corpora such as Childes [5] have used header information to describe the content, the speakers and the language being spoken etc. The Text Encoding Initiative [6] and the CES group [7] have specified in detail the tag set with which a whole text document can be described. However, all early initiatives were not meant to be a general standard for the description of MM LRs and allow the formation of a searchable and browsable space on the Internet IMDI desires.</Paragraph>
    <Paragraph position="1"> This is what recent initiatives in other domains such as Dublin Core (DC) [8] and MPEG7 [9] want to achieve: XML-based machine-readable information about certain documents that is openly accessible in the net such that easy retrieval is possible. New initiatives by the W3C such as RDF [10] support these intentions.</Paragraph>
    <Paragraph position="2"> Within IMDI we have made an overview about header and metadata elements used so far by the language resource community. This overview and the concrete needs within large European projects will be used to develop and test a first proposal on the way to come to a hopefully widely accepted standard. Compliance with the standard has to guarantee that metadata descriptions created by different people at different locations adhere to the same syntax and to the same semantic definitions of the metadata elements included. The standard has to offer possibilities of adding metadata elements defined by sub communities, projects or even individuals. From other initiatives we know that these goals can only be achieved if the set of metadata elements is not too exhaustive. This does not mean that only limited information can be stored. For instance the metadata description standard certainly includes an element to enter the name of the language spoken, but other very elaborate information about that language can be made available in other data types pointed to by hyperlinks to other data perhaps conforming to more specialised schemas.</Paragraph>
    <Paragraph position="3"> IMDI is now entering a phase where the metadata element categories and the metadata elements to be included have been discussed with interested members of the MMLR community for about a year and become stable.</Paragraph>
    <Paragraph position="4"> Two resource types were selected to start with: (1) multimedia corpora and (2) lexicons, the discussion about the lexicon resources is at the moment less far developed than that concerning the corpora. Within the IMDI initiative we started the search for a suitable set of metadata elements by trying to identify the characteristics of such resources that people such as researchers, developers, students, or even the general public would choose to use to find exactly those resources they are looking for.</Paragraph>
    <Paragraph position="5"> Very helpful was the study of the creation process and the construction of a structured metadata set as a reflection of an ontology of these resources. We know that resources themselves are not openly available, but at least the metadata description should inform the community about their existence, about intellectual property rights and modes of usage.</Paragraph>
    <Paragraph position="6"> Two main categories of metadata can be identified: * Basic information on the content of the resource: the content language of the resource, and administrative information about the resource.</Paragraph>
    <Paragraph position="7"> * Resource descriptions that define the type and structure of the resource.</Paragraph>
    <Paragraph position="8"> A full listing of all IMDI elements is given in Appendix A, but for definitions and substructures we refer to [1]. The relevant elements for the resources themselves are:  For annotation units multiple units may reside in one file. The relevant elements for characterising the resources in a way that is important to tools (a discussion that we will come to later) are:  o Reference to the resource itself o Size (of media file, if the tool has a limit) o Format (for media files somewhat more simple then for annotation units) o Type (for annotation unit the type of analysis result e.g. morphology,</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Strategies for Metadata Standards
</SectionTitle>
    <Paragraph position="0"> The way IMDI has developed its metadata vocabulary can be described as bottom up.</Paragraph>
    <Paragraph position="1"> IMDI chose to try to first understand the linguistic community's needs by making an overview of metadata used by different projects and corpora, speak with representatives of many institutions and try to distill a metadata set from it that focuses on retrieval aspects. For IMDI the needs of the creators are the start and end point since the creators are also the major consumer group of language resources. So the question for IMDI posed itself was &amp;quot;how to enable resource discovery of useful language resources that can be used for certain studies etc&amp;quot;. This approach leads to a metadata set whose terminology fits the domain and a vocabulary that is considerable richer than the for instance the DC set.</Paragraph>
    <Paragraph position="2"> Interesting enough another initiative named OLAC [11] that wants to create metadata for language resources has taken the DC set as a starting point. The OLAC approach can be called a top-down one and seems motivated by the wish to join the &amp;quot;very important&amp;quot; Open Archives Initiative (OAI) [12] without having too much work in mapping different metadata sets. OLAC wants to use a slightly more specialised version of the OAI metadata set and because OAI uses Dublin Core as default metadata set the choice of an extended DC set for OLAC is understandable. Of course the question remains if this is sufficient to characterise language resources in a sufficient specific manner.</Paragraph>
    <Paragraph position="3"> The discussion showed that both approaches are important especially when the ontology of the domain is not very well understood. IMDI starts with analyzing the domain and leads to a more narrow and specialised categorization scheme.</Paragraph>
    <Paragraph position="4"> DC on the other hand offers very broad categories the semantics of which are often sloppily defined. Both approaches lead to specific inherent retrieval problems. We have to consider two views; (1) People from inside the domain searching for resources (2) People from outside the domain. People from inside have intimate knowledge of the domain ontology and want more specific categories. People from outside need broader categories to assure that the resources they search fall in the larger &amp;quot;hit list&amp;quot;. The discussion about OLAC DC qualifiers led partly to the same discussions that were carried out in IMDI. This is not surprising, since OLAC somehow has to address the needs of the field and the participants at the meeting were mainly linguists. The OLAC top-down approach which starts with a smaller vocabulary than IMDI will have less problems when addressing the interoperability with metadata sets such as DC. However this advantage disappears when OLAC will add more specific elements and qualifiers to accurately describe the domain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML