XML Viewer - w01-1514

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1514_metho.xml
Size: 20,049 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1514">
  <Title>Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Common Assumptions, Needs and
</SectionTitle>
    <Paragraph position="0"> Goals in Natural Language Studies Human language resources, expensive to create and maintain, are in increasing demand among a growing number of research communities. One solution to this expanding need is to reannotate and reuse language resources created for other purposes. The now classic example is that of the Switchboard-1 Corpus (ISBN: 1-58563-121-3), a collection of 2400 two-sided telephone conversations among 543 U.S. speakers, created by Texas Instruments in 1991. Although collected for speaker identification and topic spotting research, Switchboard has been widely used to support large vocabulary conversational speech recognition. It has been extensively corrected twice, once at Penn and NIST, and once at Mississippi State. Two excerpts have been published as test corpora for government-sponsored projects. At least 6 other annotations have been created at various times and more-or-less widely distributed among research sites: part-of-speech annotation (Penn); syntactic structure annotation (Penn); dysfluency annotation (Penn); partial phonetic transcription (independently at UCLA and at Berkeley); and discourse function annotation (Colorado). These annotations use different &amp;quot;editions&amp;quot; of the underlying corpus and have sometimes silently introduced their own corrections or modified the data format to suit their needs. Thus the Colorado discourse function annotation was based on phrase structures introduced by the Penn dysfluency annotation, which in turn was based on the Penn/NIST corrections, which in turn were based on the original TI transcriptions of the underlying (and largely unchanging) audio files. Switchboard and its derivatives remain in active use worldwide, and new derivatives continue to be produced, along with (published and unpublished) corrections of old ones. This worsens the already acute problem of establishing and maintaining coherent relations among the derivatives in common use today.</Paragraph>
    <Paragraph position="1"> The Switchboard-1 case is by no means isolated (Graff &amp; Bird 2000). The Topic Detection and Tracking Corpus, TDT-2 (ISBN: 1-58563-157-4) was created in 1998 by LDC and contains newswire and more than 600 hours of transcribed broadcast news from 8 English and 3 Chinese sources sampled daily over six months with annotations to indicate story boundaries and relevance of those stories to 100 randomly selected topics. Since its release, TDT-2 has been used as training, development-test and evaluation data in the TDT evaluations; the audio has been used in TREC SDR evaluations (Garofalo, Auzanne and Voorhees 2000), TDT text has been partially re-annotated for entity detection in the Automatic Content  al. 2000) and Audio-Visual Speech Recognition (Chalapati 2000).</Paragraph>
    <Paragraph position="2"> Switchboard and TDT are just two examples of a growing trend toward reannotation and reuse of language resources, a trend that is not limited to language engineering. Miller and Walker (2001) have demonstrated the value of the CALLHOME German corpus (ISBN: 158563-117-5), developed to support speech recognition research, for language teaching.</Paragraph>
    <Paragraph position="3"> Deckert &amp; Yaeger-Dror (2000) have used Switchboard to study regional syntactic variation in American English.</Paragraph>
    <Paragraph position="4"> Reannotation and reuse of linguistic data highlight the need for common infrastructure to support resource development across disciplines and specialties.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Overlaps between Human Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Technology and Other Linguistic
Research
</SectionTitle>
      <Paragraph position="0"> Many specialties in empirical linguistics and language engineering require large volumes of language data and tools for browsing and searching the data efficiently. The sections that follow provide examples of recent efforts to address emerging needs for language resources.</Paragraph>
      <Paragraph position="1"> Interlinear Texts and Linguistic Exploration Interlinear text is a product of linguistic fieldwork often in low-density languages. The physical appearance of interlinear text typically consists of a main text line annotated with linguistic transcriptions and analyses, such as morphological representations, glosses at various levels, part-of-speech tags, and a free translation at the sentence level. Fragments of these annotation lines are vertically aligned with the corresponding fragments of text. Phrasal translations and footnotes are often presented on other lines. Interlinear texts come in many forms and can be represented digitally in many ways, e.g. plain text with hard spacing, tables, special markup, and special-purpose data structures.</Paragraph>
      <Paragraph position="2"> There are various methods for linking to audio data and lexical entries, and for including footnotes and other marginalia. This diversity of form presents problems for general-purpose software for searching, exchanging, displaying and enriching interlinear texts. Nonetheless interlinear text is a precious resource with multiple uses in natural language processing. Its various components can be used in the development of lexical and morphological resources, can support tagging and parsing and can provide training material for machine translation. Maeda and Bird (2000, 2001) demonstrated a tool for creating interlinear text. A screenshot appears in Figure 1.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Toolkit
Sociolinguistic Annotation
</SectionTitle>
      <Paragraph position="0"> The quantitative analysis of linguistic variation begins with empirical observation and statistical description of linguistic behavior.</Paragraph>
      <Paragraph position="1"> Although general computer technology encourages the collection, annotation, analysis and discussion of linguistic behavior wholly within the digital domain, few tools exist to help the sociolinguist in this effort. The project on Data and Annotations for Sociolinguistics (DASL) is investigating best practices via a case study of well-documented sociolinguistic phenomena in several large speech corpora: TIMIT , Switchboard-1, CallHome and Hub-4.</Paragraph>
      <Paragraph position="2"> Researchers are currently annotating the corpora for t/d deletion, the process by which [t] and [d] sometimes fail to be realized under certain phonological, morphological and social conditions. The case study is also a means to address broader questions: How do the specified corpora compare with the interview data typically used in sociolinguistics? Will the study of corpus data reveal new patterns not evident in the more common studies conducted within the framework of the speech community? Can empirical research on language variation be organized on a large scale with teams of nonspecialist annotators? All of the data used in DASL were originally created to support human language technology development; the datasets are currently being reannotated to support empirical studies of linguistic variation. A custom annotation tool allows users to query each corpus for tokens of potential interest greatly reducing effort relative to traditional approaches. Annotators can read or listen to each token, access demographic data and encode their observations in formats compatible with other analytical software used in the community. The web-based interface in Figure 2 promotes multi-site annotation and the study of inter-annotator consistency (Cieri and Strassel, 2001).</Paragraph>
      <Paragraph position="3"> Authoring Resources and Tools for Language Learning Although current information technology encourages new approaches in computer assisted language learning and teaching, progress in this area is hampered by an inadequate supply of language resources. The SMART (Source Media Authoring Resources and Tools) pilot project is addressing this problem by providing appropriately licensed data and software resources for preparing language-learning material. The Linguistic Data Consortium, a partner in this effort, is contributing several of its large data sets including conversational and broadcast data in Arabic, English, French and German. The language resources overlap almost completely with those used in language engineering. SMART is building upon the distribution model established in LDC Online, a service that provides network-based access to hundreds of gigabytes of text and audio data and annotations. Audio data are available digitally in files corresponding to a conversation, broadcast or other linguistic event. To facilitate searching,</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LDC Online includes, according to their
</SectionTitle>
    <Paragraph position="0"> grained access. For example, where a time-aligned trans cript of a conversation exists, users may extract, reformat and play any segment specified by the time stamps in the transcript.</Paragraph>
    <Paragraph position="1"> SMART is building upon this foundation by providing additional data resources, browsing and search customized to the needs of language teachers and additional output formats to accommodate courseware authoring tools available in the commercial market.</Paragraph>
    <Paragraph position="2"> SMART promises to benefit a wide range of language teachers and learners but only to the extent that its resources are readily available. The volume of SMART data exceeds that which can be easily transferred over a network. Even small video clips consume hundreds of megabits of bandwidth. Instead SMART data will be delivered via servers that maintain raw data and associated annotations, permit browsing and queries and allow the user to specify the format and granularity of the response. The user will have the option of downloading the data for local use or adding annotations that may be kept privately or made public via the annotation server. The technology of the annotation server coupled with the extensibility of annotation graphs described below will enables nearly unconstrained access to SMART data.</Paragraph>
    <Paragraph position="3"> These efforts to support interlinear text, sociolinguistic annotation and multimodal data in language teaching each require flexible access to signal data and associated annotations. The sections that follow describe an architecture that provides such access.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Development
</SectionTitle>
      <Paragraph position="0"> Storing and serving large amounts of annotated data via the web requires interoperable data representations and tools along with methods for handling external formats and protocols for querying and delivering annotations. Annotation graphs were presented by Bird and Liberman (1999) as a general purpose model for representing and manipulating annotations of time series data, regardless of their physical storage format. An annotation graph is a labeled, directed, acyclic graph with time offsets on some of its nodes.</Paragraph>
      <Paragraph position="1"> The formalism is illustrated below by application to the TIMIT Corpus (Garofolo et al, 1986). The original TIMIT word file contains starting and ending offsets (in 16KHz samples) and transcripts of each word in the audio file  A section of the corresponding annotation graph appears in Figure 3. Each node displays the node identifier and the time offset. The arcs are decorated with type and label information. Type W is for words and the type P is for  A large amount of annotation can be efficiently represented and indexed in this manner. This brings us to the question of converting (or loading) existing data into such a database. The LDC's catalog alone includes nearly 200 publications, where each typically has its own format (often more than one). The sheer quantity and diversity of the data presents a significant challenge to the conversion process. In addition, some corpora exist in multiple versions, or include uncorrected, corrected and re-corrected parts.</Paragraph>
      <Paragraph position="2"> The Annotation Graph Toolkit, version 1.0, contains a complete implementation of the annotation graph model, import filters for several formats, loading/storing data to an annotation server (MySQL), application programming interfaces in C++ and Tcl/tk, and example annotation tools for dialogue, ethology and interlinear text. The supported formats are: xlabel, TIMIT, BAS Partitur, Penn Treebank, Switchboard, LDC Callhome, CSV and AIF level 0. Future work will provide Python and Perl interfaces, more supported formats, a query language and interpreter, and a multi-channel transcription tool. All software is distributed under an open source license, and is available from http://www.ldc.upenn.edu/AG/.</Paragraph>
      <Paragraph position="3"> Given that the annotation data can be stored in a relational database, it can be queried directly in SQL. More convenient, a domain-specific query language will be developed (see Cassidy and Bird 2000 and the work cited there). Query expressions will be transmitted over the web in the form of a CGI request, and translated into SQL by the annotation server.</Paragraph>
      <Paragraph position="4"> The resulting annotation data will be returned in the form of an XML document. An example for the TIMIT database, using the language proposed by Cassidy and Bird (2000), will serve to illustrate: Find word arcs spanning a sequence of segments beginning with hv and containing ae:</Paragraph>
      <Paragraph position="6"> Executed on the above annotation data, this query would return the XML document in  Neither the query nor the returned document are intended for human consumption. A client-side annotation tool will initiate queries and display annotation content on behalf of an end- null This annotation tool and server are integrated using the model shown below. A simplified client-server model, working at the level of annotation files is already available with the current distribution of the Annotation Graph Toolkit. Significantly, a networked annotation tool is identical to a standalone version, except that the AG library fetches its data from a remote server instead of local disk.</Paragraph>
      <Paragraph position="7"> The annotation graph formalism, annotation servers and the emerging query language will provide basic infrastructure to store, process and deliver essentially arbitrary amounts and types of signal annotations for a wide variety of research and teaching tasks including those described above. This infrastructure will enable reuse of existing resources and coordinated development of new resources both within and across research communities working with annotated linguistic datasets.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Remaining Challenges to Language
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Resource Development
</SectionTitle>
      <Paragraph position="0"> We have described a process whereby annotated data in a variety of formats can be loaded into a central database server that interacts directly with annotation tools. The Annotation Graph Toolkit, version 1.0, is the first implementation of this architecture. As the toolkit undergoes future development, it will need to deal continually with conversion issues.</Paragraph>
      <Paragraph position="1"> Annotation data will continue to be created and manipulated by multiple tools and to be stored in incompatible file formats. Data will continue to be mapped between different formats so that appropriate tools can be used, and appropriately managed to keep inconsistencies from arising.</Paragraph>
      <Paragraph position="2"> There will still be times when we need to trace the provenance of a particular item, back through a history involving several formats.</Paragraph>
      <Paragraph position="3"> These will always be hard problems; the proposed infrastructure will address them but no infrastructure is likely to eliminate conversion, integrity and provenance issues.</Paragraph>
      <Paragraph position="4"> Annotation graphs focus on the problems of dealing with time series. They do not directly address paradigmatic data such as lexicons and demographic tables. One should note however, that time series data and paradigmatic data can be united efficiently. As already mentioned, annotation graphs may be stored trivially in relational tables, technology routinely used for paradigmatic data. In this way, conventional &amp;quot;joins&amp;quot; of relational table can convolve time-series annotations with paradigms (e.g. texts with dictionaries or utterances with speaker demographics).</Paragraph>
      <Paragraph position="5"> Through judicious compromises - such as one-time computer-assisted conversion of legacy annotation data and creating once-off interfaces to existing useful tools - and through the judicious combination of simple and well-supported formalisms and technologies as described above, we believe that the management problems can be substantially reduced in scale and severity.</Paragraph>
      <Paragraph position="6"> We can illustrate the advantages of AG with a example of the annotation of the Switchboard corpus for -t/d deletion. Switchboard contains two-channel audio of thousands of 5-minute conversations among pairs of speakers that have been transcribed with the transcripts time-aligned to the audio. A single utterance is written: 274.35 279.50 A.119 Uh, he, uh, carves out different figures in the, in the plants, giving the start and stop time of the utterance, channel, speaker ID and the transcript of the utterance. This can be converted trivially into AG format as above.</Paragraph>
      <Paragraph position="7">  The DASL tool concordances audio transcripts and identifies utterances in which the target phenomenon (eg. -t/d deletion) may occur. A line of the concordance file contains two IDs one to identify the utterance within the concordance, the other to link back to the original corpus. The &lt;annotate&gt; tags identify a potential environment for the phenomenon under study.</Paragraph>
      <Paragraph position="8"> &lt;sample id=&amp;quot;1&amp;quot; senid=&amp;quot;10194&amp;quot;&gt;uh he uh carves out &lt;annotate&gt; different figures &lt;/annotate&gt; in the in the p[lants]- plants shrubs &lt;/sample&gt; The link between the concordance and the original corpus is maintained through a table containing: Sentence_ID, File_ID, Start_Time, Stop_Time, Channel and Speaker.</Paragraph>
      <Paragraph position="9"> 10194 2141 274.35 279.50 A 1139 Speakers' demographic data appears in another table containing: Speaker_ID, Sex, Age, Region, Education_Level 1139, MALE, 50, NORTHERN, 2 The DASL interface embeds the concordance results in a template containing input fields for each parameter to be annotated (see Figure 2). The linguist's annotation of the utterance can be stored in AG formalism as in Figure 5. Note that although AGs provide an elegant and general solution to the annotation of time series data, they do not remove the need to deal with the ad hoc formats one may encounter in various corpora. Nor do they remove the need to track the relations among elements in time-series data and paradigmatic material.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML