File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1506_metho.xml
Size: 21,614 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1506"> <Title>The OLAC Metadata Set and Controlled Vocabularies</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> SIL International 7500 West Camp Wisdom Road Dallas, TX 75236, USA Gary Simons@sil.org Abstract </SectionTitle> <Paragraph position="0"> As language data and associated technologies proliferate and as the language resources community rapidly expands, it has become difficult to locate and reuse existing resources.</Paragraph> <Paragraph position="1"> Are there any lexical resources for such-and-such a language? What tool can work with transcripts in this particular format? What is a good format to use for linguistic data of this type? Questions like these dominate many mailing lists, since web search engines are an unreliable way to find language resources.</Paragraph> <Paragraph position="2"> This paper describes a new digital infrastructure for language resource discovery, based on the Open Archives Initiative, and called OLAC - the Open Language Archives Community.</Paragraph> <Paragraph position="3"> The OLAC Metadata Set and the associated controlled vocabularies facilitate consistent description and focussed searching. We report progress on the metadata set and controlled vocabularies, describing current issues and soliciting input from the language resources community.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Language technology and the linguistic sciences are confronted with a vast array of language resources, richly structured, large and diverse.</Paragraph> <Paragraph position="1"> Multiple communities depend on language resources, including linguists, engineers, teachers and actual speakers. Many individuals and institutions provide key pieces of the infrastructure, including archivists, software developers, and publishers. Today we have unprecedented opportunities to connect these communities to the language resources they need. First, inexpensive mass storage technology permits large resources to be stored in digital form, while the Extensible Markup Language (XML) and Unicode provide flexible ways to represent structured data and ensure its long-term survival. Second, digital publication - both on and off the world wide web - is the most practical and efficient means of sharing language resources. Finally, a standard resource description model, the Dublin Core Metadata Set, together with an interchange method provided by the Open Archives Initiative (OAI), make it possible to construct a union catalog over multiple repositories and archives.</Paragraph> <Paragraph position="2"> In December 2000, an NSF-funded workshop on Web-Based Language Documentation and Description, held in Philadelphia, brought together a group of nearly 100 language software developers, linguists, and archivists who are responsible for creating language resources in North America, South America, Europe, Africa, the Middle East, Asia and Australia http://www.ldc.upenn.edu/ exploration/expl2000/. The outcome of the workshop was the founding of the Open Language Archives Community (OLAC), an application of the OAI to digital archives of language resources, with the following purpose: OLAC, the Open Language Archives Community, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources.</Paragraph> <Paragraph position="3"> This paper will describe the leading ideas that motivate OLAC, before focussing on the metadata set and the controlled vocabularies which implement part (ii) of OLAC's statement of purpose. Metadata elements of special interest to the language resources community include such things as language identification and language resource type. The corresponding controlled vocabularies ensure consistent description.</Paragraph> <Paragraph position="4"> For example, French language resources are specified using an official RFC-3066 designation (Alvestrand, 2001), instead of multiple distinct text strings like &quot;French&quot;, &quot;Francais&quot; and &quot;Franc,ais&quot;. A separate controlled vocabulary exists for resource type, and has items such as annotation/phonetic and description/grammar. Services for end-users can map controlled vocabularies onto convenient terminology for any target language. (A live demonstration accompanies this presentation.)</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Locating Data, Tools and Advice </SectionTitle> <Paragraph position="0"> We can observe that the individuals who use and create language resources are looking for three things: data, tools, and advice. By DATA we mean any information that documents or describes a language, such as a published monograph, a computer data file, or even a shoebox full of hand-written index cards.</Paragraph> <Paragraph position="1"> The information could range in content from unanalyzed sound recordings to fully transcribed and annotated texts to a complete descriptive grammar. By TOOLS we mean computational resources that facilitate creating, viewing, querying, or otherwise using language data.</Paragraph> <Paragraph position="2"> Tools include not just software programs, but also from here the digital resources that the programs depend on, such as fonts, stylesheets, and document type definitions. By ADVICE we mean any information about what data sources are reliable, what tools are appropriate in a given situation, what practices to follow when creating new data, and so forth. In the context of OLAC, the term language resource is broadly construed to include all three of these: data, tools and advice.</Paragraph> <Paragraph position="3"> Unfortunately, today's user does not have ready access to the resources that are needed. Figure 1 offers a diagrammatic view of the reality. Some archives (e.g. Archive 1) do have a site on the internet which the user is able to find, so the resources of that archive are accessible. Other archives (e.g. Archive 2) are on the internet, so the user could access them in theory, but the user has no idea they exist so they are not accessible in practice. Still other archives (e.g. Archive 3) are not even on the internet. And there are potentially hundreds of archives (e.g. Archive n) that the user needs to know about. Tools and advice are out there as well, but are at many different sites.</Paragraph> <Paragraph position="4"> There are many other problems inherent in the current situation. For instance, the user may not be able to find all the existing data about the language of interest because different sites have called it by different names (low recall). The user may be swamped with irrelevant resources because search terms have important meanings in other domains (low precision). The user may not be able to use an accessible data file for lack of being able to match it with the right tools. The user may locate advice that seems relevant but have no basis for judging its merits.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Bridging the gap </SectionTitle> <Paragraph position="0"> 2.1.1 Why improved web-indexing is not enough As the internet grows and web-indexing technologies improve one might hope that a general-purpose search engine should be sufficient to bridge the gap between people and the resources they need, but this is a vain hope. The first reason is that many language resources, such as audio files and software, are not text-based. The second reason concerns language identification, the single most important property for describing language resources.</Paragraph> <Paragraph position="1"> If a language has a canonical name which is distinctive as a character string, then the user has a chance of finding any online resources with a search engine. However, the language may have multiple names, possibly due to the vagaries of Romanization, such as a language known variously as Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca (giving low recall). The language name may collide with a word which has other interpretations that are vastly more frequent, e.g. the language names Mango and Santa Cruz (giving low precision).</Paragraph> <Paragraph position="2"> The third reason why general-purpose search engines are inadequate is the simple fact that much of the material is not, and will not, be documented in free prose on the web. Either people will build systematic catalogues of their resources, or they won't do it at all. Of course, one can always export a back-end database as HTML and let the search engines index the materials. Indeed, encouraging people to document resources and make them accessible to search engines is part of our vision. However, despite the power of web search engines, there remain many instances where people still prefer to use more formal databases to house their data. This last point bears further consideration. The challenge is to build a system for &quot;bringing like things together and differentiating among them&quot; (Svenonius, 2000). There are two dominant storage and indexing paradigms, one exemplified by traditional databases and one exemplified by the web. In the case of language resources, the metadata is coherent enough to be stored in a formal database, but sufficiently distributed and dynamic that it is impractical to maintain it centrally. Language resources occupy the middle ground between the two paradigms, neither of which will serve adequately. A new framework is required that permits the best of both worlds, namely bottom-up, distributed initiatives, along with consistent, centralized finding aids. The Dublin Core (DC) and the Open Archives Initiative provide the framework we need to &quot;bridge the gap.&quot; The Dublin Core Metadata Initiative began in 1995 to develop conventions for resource discoveryontheweb[dublincore.org]. The Dublin Core metadata elements represent a broad, interdisciplinary consensus about the core set of elements that are likely to be widely useful to support resource discovery. The Dublin Core consists of 15 metadata elements, where each element is optional and repeatable: Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, Rights. This set can be used to describe resources that exist in digital or traditional formats.</Paragraph> <Paragraph position="3"> In &quot;Dublin Core Qualifiers&quot; (DCMI, 2000a) two kinds of qualifications are allowed: encoding schemes and refinements. An encoding scheme specifies a particular controlled vocabulary or notation for expressing the value of an element.</Paragraph> <Paragraph position="4"> The encoding scheme serves to aid a client system in interpreting the exact meaning of the element content. A refinement makes the meaning of the element more specific. For example, a Language element can be encoded using the conventions of RFC 3066 to unambiguously identify the language in which the resource is written (or spoken). A Subject element can be given a language refinement to restrict its interpretation to concern the language the resource is about.</Paragraph> <Paragraph position="5"> The Open Archives Initiative (OAI) was launched in October 1999 to provide a common framework across electronic preprint archives, and it has since been broadened to include digital repositories of scholarly materials regardless of their type [www.openarchives.org] (Lagoze and de Sompel, 2001).</Paragraph> <Paragraph position="6"> infrastructure In the OAI infrastructure, each participating archive implements a repository - a network accessible server offering public access to archive holdings. The primary object in an OAI-conformant repository is called an item, having a unique identifier and being associated with one or more metadata records. Each metadata record describes an archive holding, which is any kind of primary resource such as a document, raw data, software, a recording, a physical artifact, a digital surrogate, and so forth. Each metadata record will usually contain a reference to an entry point for the holding, such as a URL or a physical location, as shown in Figure 2.</Paragraph> <Paragraph position="7"> To implement the OAI infrastructure, a participating archive must comply with two standards: the OAI shared metadata set (Dublin Core), which facilitates interoperability across all repositories participating in the OAI, and the OAI metadata harvesting protocol, which allows software services to query a repository using HTTP requests.</Paragraph> <Paragraph position="8"> OAI archives are called &quot;data providers,&quot; though they are strictly just metadata providers. Typically, data providers will also have a submission procedure, together with a long-term storage system, and a mechanism permitting users to obtain materials from the archive. An OAI &quot;service provider&quot; is a third party that provides end-user services (such as search functions over union catalogs) based on metadata harvested from one or more OAI data providers.</Paragraph> <Paragraph position="9"> Figure 3 illustrates a single service provider accessing three data providers (using the OAI metadata harvesting protocol). End-users only</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Data Providers </SectionTitle> <Paragraph position="0"> Over the past decade, the Linguist List has become the primary source of online information for the linguistics community, reaching out to over 13,000 subscribers worldwide, and having four complete mirror sites. The Linguist List will be augmenting its service by hosting the primary service provider for OLAC, and permitting end-users to browse distributed language resources at a single place.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Applying the OAI to language resources </SectionTitle> <Paragraph position="0"> The OAI infrastructure is a new invention; it has the bottom-up, distributed character of the web, while simultaneously having the efficient, structured nature of a centralized database. This combination is well-suited to the language resource community, where the available data is growing rapidly and where a large user-base is fairly consistent in how it describes its resource needs.</Paragraph> <Paragraph position="1"> The primary outcome of the Philadelphia workshop was the founding of the Open Language Archives Community, and with it the identification of an advisory board, alpha testers and member archives. Details of these groups are available from the OLAC site [www.language-archives.</Paragraph> <Paragraph position="2"> org].</Paragraph> <Paragraph position="3"> Recall that the OAI community is defined by the archives which comply with the OAI metadata harvesting protocol and that register with the OAI.</Paragraph> <Paragraph position="4"> Any compliant repository can register as an Open Archive, and the metadata provided by an Open Archive is open to the public. OAI data providers may support metadata standards in addition to the Dublin Core. Thus, a specialist community can define a metadata format which is specific to its domain. Service providers, data providers and users that employ this specialized metadata format constitute an OAI subcommunity.The workshop participants agreed unanimously that the OAI provides a significant piece of the infrastructure needed for the language resources community. null In the same way that OLAC represents a specialized subcommunity with respect to the entire Open Archives community, there are specialized subcommunities within the scope of OLAC. For instance, the ISLE Meta Data Initiative is developing a detailed metadata scheme for corpora of recorded speech events and their associated descriptions (MPI ISLE Team, 2000). Similarly, the language data centers - the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA) - are using OLAC metadata as the basis of a joint catalog, and will add elements and vocabularies for their specialized needs (price, rights, and categories of membership and use). For archived language resources that are of this kind, such a metadata scheme would support a richer description. This specialized subcommunity can implement its own service provider that offers focused searching based on its own rich metadata set. At the same time, the data providers will exposing OLAC and Dublin Core versions of the metadata, permitting the resources to be discovered by users of OLAC and OAI service providers.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Federation and integration of language </SectionTitle> <Paragraph position="0"> resource archives The OAI framework permits archives to interoperate. OAI archives support the Dublin Core metadata format and metadata harvesting protocol. OLAC archives additionally support the OLAC metadata format. Widespread adoption of these standards will permit language resource archives to be federated and integrated.</Paragraph> <Paragraph position="1"> First, a collection of archives which support the same metadata format can be federated, in the sense that a virtual meta-archive can collect all the information into a single place, and end-users can query multiple archives simultaneously. To demonstrate this, the Linguistic Data Consortium has harvested the catalogs of three language resource archives (LDC, ELRA, DFKI) and created a prototype service provider. A search for language=Bulgarian returns records from all three archives, as shown in Figure 4 (B'anik and Bird, 2001).</Paragraph> <Paragraph position="2"> Second, a collection of archives which support the same metadata format can be integrated, in the sense that relational joins can be performed across different archives. This permits queries such as: &quot;find all lexicon tools that understand a format for which Hungarian data is available.&quot;</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Core Metadata Set for Language Resources </SectionTitle> <Paragraph position="0"> The OLAC Metadata Set extends the Dublin Core set only to the minimum degree required to express basic properties of language resources which are useful as finding aids.</Paragraph> <Paragraph position="1"> All fifteen Dublin Core elements are used in the OLAC Metadata Set. In order to suit the specific needs of the language resources community, the elements have been qualified following principles articulated in &quot;Dublin Core Qualifiers&quot; (DCMI, 2000a) and exemplified in (DCMI, 2000b).</Paragraph> <Paragraph position="2"> This section describes some of the attributes, elements and controlled vocabularies of the OLAC Metadata Set. Before launching into this discussion, we first review some XML terminology and explain some aspects of the OLAC representation which follow directly from our choice of XML.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Aside: XML representation </SectionTitle> <Paragraph position="0"> The Extensible Markup Language (XML) is the universal format for structured documents and data on the Web [www.w3.org/XML].Thekey building block of an XML document is the element. An element has a name, attributes and content. Here is an example of an element Language with attributes refine and code, and free-text content: null use XML schemas to specify the OLAC metadata format.</Paragraph> <Paragraph position="1"> XML schemas make it possible for element content and attribute values to be constrained according to the element name. However, XML schemas do not permit element content to be constrained on the basis of the attribute value. Accordingly, in implementing qualified Dublin Core using XML, we are limited to using one encoding scheme (or controlled vocabulary) per element.</Paragraph> <Paragraph position="2"> There are two cases we need to consider here. In the case where all refinements of an element employ the same encoding scheme, we use the element name as is and add a refine attribute with a fixed value. This documents that the particular encoding scheme has been used, and ensures that the element cannot be confused with a corresponding unqualified Dublin Core element (see the above example). In the case where different refinements of an element employ different encoding schemes, then a unique element must be defined. Following (DCMI, 2000b), we define such elements by concatenating the Dublin Core element name and the refinement name with an intervening dot. An example is shown below: <Format.encoding code=&quot;iso-8859-1&quot;/></Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Attributes used in implementing the </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>