File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0605_metho.xml
Size: 10,964 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0605"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Frontiers in Linguistic Annotation for Lower-Density Languages</Title> <Section position="5" start_page="29" end_page="29" type="metho"> <SectionTitle> 3 Linguistically Annotated Resources </SectionTitle> <Paragraph position="0"> While the scarcity of language resources for lower-density languages is apparent for all resourcetypes(withthepossibleexceptionofmono- null lingual text ), it is particularly true of linguistically annotated texts. By annotated texts, we include the following sorts of computational linguistic resources: null * Parallel text aligned with another language at the sentence level (and/or at finer levels of parallelism, including morpheme-level glossing) null * Text annotated for named entities at various levels of granularity * Morphologically analyzed text (for nonisolating languages; at issue here is particularly inflectional morphology, and to a lesser degree of importance for most computational purposes, derivational morphology); also a morphological tag schema appropriate to the particular language * Text marked for word boundaries (for those scripts which, like Thai, do not mark most word boundaries) * POS tagged text, and a POS tag schema appropriate to the particular language sources, such as Wordnet2 There are numerous dimensions for linguistically annotated resources, and a range of research projects have attempted to identify the core properties of interest. While concepts such as the Basic Language Resource Kit (BLARK; (Krauwer, 2003; Mapelli and Choukri, 2003)) have gained considerable currency in higher-density language resource creation projects, it is clear that the base-line requirements of such schemes are significantly more advanced than we can hope for for lower-density languages in the short to medium term. Notably, the concept of a reduced BLARK ('BLARKette') has recently gained some currency in various forums.</Paragraph> </Section> <Section position="6" start_page="29" end_page="30" type="metho"> <SectionTitle> 4 Key Questions </SectionTitle> <Paragraph position="0"> Given that the vast majority of the more than seven thousand languages documented in the Ethnologue (Gordon, 2005) fall into the class of lower-density languages, what should we do? Equally important, what can we realistically do? We pose three questions by which to frame the remainder of this paper.</Paragraph> <Paragraph position="1"> 1. Status Indicators: How do we know where we are? How do we keep track of what languages are high-density or medium-density, and which are lower-density? 2. Increasing Available Resources: How (or can) we encourage the movement of languages up the scale from lower-density to medium-density or high-density? 3. Reducing Data Requirements: Given that some languages will always be relatively lower-density, can language processing applications be made smarter, so that they don't require largely unattainable resources in order to perform adequately?</Paragraph> </Section> <Section position="7" start_page="30" end_page="31" type="metho"> <SectionTitle> 5 Status Indicators </SectionTitle> <Paragraph position="0"> We have been deliberately vague up to this point about how many lower-density languages there are, or the simpler question, how my high and medium density languages there are. Of course one reason for this is that the boundary between low density and medium or high density is inherently vague. Another reason is that the situation is constantly changing; many Central and Eastern European languages which were lower-density languages a decade or so ago are now arguably medium density, if not high density. (The standard for high vs. low density changes, too; the bar is considerably higher now than it was ten years ago.) But the primary reason for being vague about how many - and which - languages are low density today is that no is keeping track of what resources are available for most languages. So we simply have no idea which languages are low density, andmoreimportantly(sincewecanguessthat in the absence of evidence to the contrary, a language is likely to be low density), we don't know which resource types most languages do or do not have.</Paragraph> <Paragraph position="1"> This lack of knowledge is not for lack of trying, although perhaps we have not been trying hard enough. The following are a few of the catalogs ofinformationaboutlanguagesandtheirresources that are available: * The Ethnologue3: This is the standard listing of the living languages of the world, but contains little or no information about what resources exist for each language.</Paragraph> <Paragraph position="2"> distributed by each organization, and these include only a small number of languages.</Paragraph> <Paragraph position="3"> Naturally, the economically important languages constitute the majority of the holdings of the LDC and ELDA.</Paragraph> <Paragraph position="4"> * AILLA (Archive of the Indigenous Languages of Latin America6), and numerous other language archiving sites: Such sites maintain archives of linguistic data for languages, often with a specialization, such as indigenous languages of a country or region.</Paragraph> <Paragraph position="5"> The linguistic data ranges from unannotated speech recordings to morphologically analyzed texts glossed at the morpheme level.</Paragraph> <Paragraph position="6"> * OLAC (Open Archives Language Community7): Given that many of the above resources (particularly those of the many language archives) are hard to find, OLAC is an attempt to be a meta-catalog (or aggregator)of such resources. It allows lookup of data by type, language etc. for all data repositories that 'belong to' OLAC. In fact, all the above resources are listed in the OLAC union catalogue.</Paragraph> <Paragraph position="7"> * Web-based catalogs of additional resources: There is a huge number of additional websites which catalog information about languages, ranging from electronic and print dictionaries (e.g. yourDictionary8), to discussion groups about particular languages9.</Paragraph> <Paragraph position="8"> Most such sites do little vetting of the resources, and dead links abound. Nevertheless, such sites (or a simple search with an Internet search engine) can often turn up usefulinformation(suchasgrammaticaldescrip- null tions of minority languages). Very few of these web sites are cataloged in OLAC, although recent efforts (Hughes et al., 2006a) are slowly addressing the inclusion of web-based low density language resources in such indexes.</Paragraph> <Paragraph position="9"> None of the above catalogs is in any sense complete, and indeed the very notion of completeness is moot when it comes to cataloging Internet resources. But more to the point of this paper, it is difficult, if not impossible, to get a picture of the state of language resources in general. How many languages have sufficient bitext (and in what genre), for example, that one could put together a statistical machine translation system? What languages have morphological parsers (and for what languages is such a parser more or less irrelevant, because the language is relatively isolating)? Where can one find character encoding converters for the Ge'ez family of fonts for languages written in Ethiopic script? The answer to such questions is important for several reasons:</Paragraph> </Section> <Section position="8" start_page="31" end_page="31" type="metho"> <SectionTitle> 1. Iftherewereacrisisthatinvolvedanarbitrary </SectionTitle> <Paragraph position="0"> language of the world, what resources could be deployed? An example of such a situation might be another tsunami near Indonesia, which could affect dozens, if not hundreds of minority languages. (The December 26, 2004 tsunami was particularly felt in the Aceh province of Indonesia, where one of the main languages is Aceh, spoken by three million people. Aceh is a lower-density language.) null 2. Which languages could, with a relatively small amount of effort, move from lower-density status to medium-density or high-density status? For example, where parallel text is harvestable, a relatively small amount of work might suffice to produce many applications, or other resources (e.g. by projecting syntactic annotation across languages). On the other hand, where the writing system of a language is in flux, or the language is politically oppressed, a great deal more effort might be necessary.</Paragraph> <Paragraph position="1"> 3. For which low density languages might related languages provide the leverage needed to build at least first draft resources? For example, one might think of using Turkish (arguably at least a medium-density language) as a sort of pivot language to build lexicons and morphological parsers for such low density Turkic languages as Uzbek or Uyghur.</Paragraph> <Paragraph position="2"> 4. For which low density languages are there extensive communities of speakers living in other countries, who might be better able to build language resources than speakers living in the perhaps less economically developed home countries? (Expatriate communities may also be motivated by a desire to maintain their language among younger speakers, born abroad.) 5. Which languages would require more work (and funding) to build resources, but are still plausible candidates for short term efforts? To our knowledge, there is no general, on-going effort to collect the sort of data that would make answers to these questions possible. A survey was done at the Linguistic Data Consortium several years ago (Strassel et al., 2003) , for text-based resources for the three hundred or so languages having at least a million speakers (an arbitrary cutoff, to be sure, but necessary for the survey to have had at least some chance of success). It was remarkably successful, considering that it was done by two linguists who did not know the vast majority of the languages surveyed. The survey was funded long enough to 'finish' about 150 languages, but no subsequent update was ever done.</Paragraph> <Paragraph position="3"> A better model for such a survey might be an edited book: one or more computational linguists would serve as 'editors', responsible for the over-all framework, and training of other participants. Section 'editors' would be responsible for a language family, or for the languages of a geographic region or country. Individual language experts would receive a small amount of training to enable them to answer the survey questions for their language, and then paid to do the initial survey, plus periodic updates. The model provided by the Ethnologue (Gordon, 2005) may serve as a starting point, although for the level of detail that would be useful in assessing language resource availability will make wholesale adoption unsuitable.</Paragraph> </Section> class="xml-element"></Paper>