File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1003_metho.xml
Size: 27,446 bytes
Last Modified: 2025-10-06 14:13:05
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1003"> <Title>Multi-Site Data Collection for a Spoken Language Corpus MADCOW *</Title> <Section position="4" start_page="0" end_page="2301" type="metho"> <SectionTitle> 2. Collecting the Data </SectionTitle> <Paragraph position="0"> Data collection procedures were not standardized across sites. We know that variation in these procedures can lead to vast differences in the resulting data. Though standardizing is often important and has played a crucial role in other areas of the DARPA SLS program, it is also difficult and costly. Spoken language understanding as a human-computer interface tool is a new technology, and the space of potential variations is enormous and largely unexplored. We therefore chose to sample various points in this space, and to document the differences. This decision may be revised as we learn from this experiment.</Paragraph> <Paragraph position="1"> We outline in this section those aspects of data collection shared by all systems, and provide a separate section for each data collection site to highlight the unique aspects of that site.</Paragraph> <Paragraph position="2"> The original data collected at TI and SRI used two human &quot;wizards.&quot; As the subject spoke a sentence, one person provided a fast transcription, while the other used NLParse 2 to generate an SQL query to access the database. At all sites subjects were led to believe they are talldng to a fully automated system. For data collected at SRI, this was true; all other sites used some automatic speech recognition and/or natural language understanding, with varying amounts of human transcription and error correction. AT&T used only audio outputs; all other sites used a computer screen to display tables of data and other information to the subject. The two standard microphones were the Sennheiser HMD- null able to the DARPA research community for the ATIS application You have only three days for job hunting, and you have arranged job interviews in two different cities! (The interview times will depend on your flight schedule.) Start from City-A and plan the flight and ground transportation itinerary to City-B and City-C, and back table-top microphone. Table 1 shows the total data collected by site, including training and test data 3.</Paragraph> <Paragraph position="3"> All sites used a set of air travel planning &quot;scenarios&quot; (problems) for subjects to solve; BBN supplemented these with problems more like general database query tasks. The scenarios varied greatly in complexity and in the number of queries required for a solution. For sites using a wizard, the wizard was constrained in behavior, and did not represent human-like capabilities, though the wizard's role varied from site to site. By agreement, one &quot;common scenario&quot; was designated, shown in Figure 1, and sites agreed to collect 10% of their data using this common scenario.</Paragraph> <Paragraph position="4"> All sites (except BBN) used a debriefing questionnaire which explained the nature of the experiment, unveiled the deception of the wizard, and elicited comments from the subject on the experience. All sites automatically generated logfiles documenting subject queries, system responses and time stamps for all key events. A sample log file is shown in Figure 2; the user input is marked as &quot;Utterance&quot;, the SQL is marked as &quot;Query&quot;, the wizard input as &quot;Sentence&quot;, the system display as &quot;Result&quot;, and there are timestamps to mark when speech recording begins, when the sentence is sent for processing, and when the answer is returned.</Paragraph> <Section position="1" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 2.1. BBN Data Collection </SectionTitle> <Paragraph position="0"> The BBN data collection setup employed an interactive subject and wizard interface based on X-windows. The 3The numbers in this table reflect total data collected, which differs from the amount of data released by NIST, quoted in the previous section.</Paragraph> <Paragraph position="1"> subject's queries and answers were stacked on the color screen for later examination or other manipulation by the subject. The system also used BBN's real-time BY-BLOS speech recognition system as the front-end; the wizard had the choice of using the speech recognition output or correcting it. This choice allowed the wizard to give feedback (in terms of errorful speech recognition) to the subject that may have encouraged the subject to speak more clearly. Certainly there would be such feed-back in a real system.</Paragraph> <Paragraph position="2"> The scenarios included not only trip planning scenarios, but also problem solving scenarios involving more general kinds of database access, e.g., finding the hub city for an airline X. This was done to try to elicit a richer range of language use.</Paragraph> </Section> <Section position="2" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 2.2. CMU Data Collection </SectionTitle> <Paragraph position="0"> The Carnegie Mellon University (CMU) data collection system incorporated a working ATIS system \[15\] and a wizard. The subject sat at a computer that displayed a window containing system output, and another window that acted as an interface to the &quot;recognition&quot; system which used a push-and-hold protocol to record speech.</Paragraph> <Paragraph position="1"> Two channels of data were recorded, using both the Sennheiser and the Crown microphones. An Ariel DM-N digitizer and a Symetrix 202 microphone pre-amplifier completed the equipment. The wizard, sitting two cubicles away in an open-plan lab, listened to the sub-ject directly through headphones. A modified version of the CMU ATIS system was used to assist the wizard in database access. The wizard could paraphrase the subject's query or correct recognition errors before database access. Retrieved information was previewed by the wizard before being sent to the subject's display.</Paragraph> <Paragraph position="2"> The wizard also had available a set of standard &quot;error&quot; replies to be sent to the subject when appropriate (e.g., when the subject asked questions outside the domain).</Paragraph> <Paragraph position="3"> Subjects were recruited from the university environment; they ranged in age from 18 to 38, with a mean of 24 years. The subjects were introduced to the system by an experimenter who explained the procedure and sat with the subject during the first scenario. Standard air travel planning scenarios were used. The experimenter then left the enclosure, but was available if problems arose. Subjects completed as many scenarios as fit into an hour-long session. A maximum of 6 scenarios were available; an average of 4.6 were completed in the data collected to date.</Paragraph> </Section> <Section position="3" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 2.3. MIT Data Collection </SectionTitle> <Paragraph position="0"> The MIT data collection paradigm emphasized interactive data collection and dialogue, using the MIT ATIS system \[10, 13\]. Data were collected by asking subjects to solve scenarios using the system; the experimenter sat in another room and transcribed a &quot;clean&quot; version of the subject's speech input. The transcriber eliminated hesitations, &quot;urns&quot; and false starts, but otherwise simply transmitted a transcription of what the subject said.</Paragraph> <Paragraph position="1"> The natural language component then translated the transcribed input into a database query and returned the display to the user. The MIT system produced several forms of output for the subject, including a summary of the question being answered (in both written and spoken form) and a reformatted tabular display without cryptic abbreviations. The system also supported a capability for system-initiated clarification dialogue to handle cases where the user underspecified a query. For example, if the user specified only a destination, the system would ask where the subject was departing from.</Paragraph> <Paragraph position="2"> Subjects were recruited mainly from MIT and consisted of undergraduates, graduate students and employees.</Paragraph> <Paragraph position="3"> Each subject was given a $10 gift certificate to a lo- null cal store. A data collection session lasted approximately 45 minutes; it included an introduction by the experimenter (who also acted as transcriber); practice with the push-and-hold-to-talk mechanism; the solution of three or four scenarios (moving from simple scenarios to more complex ones involving booking a flight); and completion of a debriefing questionnaire. The data were collected in an office-noise environment using an Ariel Pro-Port A/D system connected to a Sun Sparcstation.</Paragraph> </Section> <Section position="4" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 2.4. AT&T Data Collection </SectionTitle> <Paragraph position="0"> The AT&T ATIS data were collected using a partially simulated, speech-in/speech-out spoken language system \[9\]. The natural language and database access components of the AT&T system were essentially identical to those of the MIT ATIS system \[10\]. The interface with the the subject was designed to simulate an actual telephone-based dialogue: the system provided all information in the form of synthesized speech, as opposed to displaying information on a computer terminal. Speech data were captured simultaneously using (1) the Sennheiser microphone amplified by a Shure FPll microphone-to-line amplifier, and (2) a standard carbon button-based telephone handset (over local telephone lines). Digitization was performed by an Ariel Pro-Port A/D system.</Paragraph> <Paragraph position="1"> Before each recording session, the experimenter provided the subject with a brief verbal explanation of the task, a page of written instructions, a summary of the ATIS database domain, and a list of travel planning scenarios.</Paragraph> <Paragraph position="2"> The system initiated the dialogue at the beginning of the recording session, and responded after every utterance with information or with an error message. The experimenter controlled recording from the keyboard, starting recording as soon as the system response ended, and stopping recording when the subject appeared to have completed a sentence. The experimenter then transcribed what the subject said, excluding false starts, and sent the transcription to the system, which automatically generated the synthesized response. A complete session lasted about an hour, including initial instruction, a two-part recording session with a five minute break, and a debriefing questionnaire.</Paragraph> <Paragraph position="3"> Subjects for data collection were recruited from local civic organizations, and collection took place during working hours. As a result, 82 percent of the subjects were female, and subjects ranged in age from 29 to 77, with a median age of 55. In return for each subject's participation, a donation was made to the civic organization through which he or she was recruited.</Paragraph> </Section> <Section position="5" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 2.5. SRI Data Collection </SectionTitle> <Paragraph position="0"> The SKI data collection system used SKI's SLS system; there was no wizard in the loop. The basic characteristics of the DECIPHER speech recognition component are described in \[4\], \[6\], and the basic characteristics of the natural language understanding component are described in \[3\]. Two channels of data were recorded, using both the Sennheiser and the Crown microphones. Subjects clicked a mouse button to talk, and the system decided when the utterance was complete. The data were collected in an office-noise environment, using a Sonitech Spirit-30 DSP board for A/D connected to a Sun Sparcstation. null Subjects were recruited from other groups at SKI, from a nearby university, and from a volunteer organization.</Paragraph> <Paragraph position="1"> They were given a brief overview of the system and its capabilities, and were then asked to solve one or several air travel planning scenarios. The interface allowed the user to move to the context of a previous question.</Paragraph> <Paragraph position="2"> Some subjects used the real-time hardware version of the DECIPHER system \[5\], \[16\]; others used the software version of the system. Other parameters that were varied included: instructions to subjects regarding what they should do when the system made errors, the interface to the context-setting mechanism, and the number of scenarios and sessions. See \[14\] for details on the interface and the conditions that were varied from subject to subject.</Paragraph> </Section> </Section> <Section position="5" start_page="2301" end_page="2301" type="metho"> <SectionTitle> 3. Distributing the Data </SectionTitle> <Paragraph position="0"> During the MADCOW collection effort, NIST was primarily responsible for two steps in the data pipeline: (1) quality control and distribution of &quot;initial&quot; unannotated data received from the collection sites; and (2) quality control and distribution of annotated data from the SRI annotators.</Paragraph> <Section position="1" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 3.1. Distribution of Initial Data </SectionTitle> <Paragraph position="0"> Initial (unannotated) data were received on 8ram tarformatted tapes from the collection sites, logged into the file &quot;madcow-tapes.log&quot;, and placed in queue for distribution. The initial data consisted of a .log file for each subject-scenario, and .way (NIST-headered speech waveform with standard header fields) and .sro (speech recognition detailed transcription) files for each utterance. The 8ram tapes were downloaded and the initial data and filename/directory structure were verified for format compliance using a suite of shell program verifi- null cation programs. Non-compliant data were either fixed at NIST or returned to the site for correction, depending on the degree and number of problems. Twenty percent of the utterances from each collection site was then set aside as potential test data. The remaining data for training were assigned an initial release ID (date) and the textual non-waveform data were then made available to the collection and annotation sites via anonymous ftp. The tape log file, &quot;madcow-tapes.log&quot; was updated with the release date. A cumulative lexicon in the file &quot;lexicon.doc.<DATE>&quot; was also updated with each new release. During the peak of data collection activity, these releases occurred at weekly intervals. When enough waveforms (.way) had accumulated to fill a CD-ROM (630 Mb), the waveforms were premastered on an IS0-9660 8ram tape which was then sent to MIT for &quot;one-off&quot; (recordable) CD-ROM production. Upon receipt of each CD-ROM from MIT, the initial release ID(s) of the data~on the CD-ROM were recorded in the file &quot;madcow-waves.log&quot;, and the CD-ROMs were shipped overnight to the MADCOW sites.</Paragraph> </Section> <Section position="2" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 3.2. Distribution of Annotated Data </SectionTitle> <Paragraph position="0"> Annotated data from SRI were downloaded at NIST via ftp. The data were organized by initial release date in the standard ATIS file and directory structure and contained files for the query categorization (.cat), wizard input to NLParse (.win) a, the SQL for the minimal answer (.sql), the SQL for the maximal answer (.sq2, generated from the minimal SQL) and the corresponding minimal and maximal reference answers (.ref, .rf2).</Paragraph> <Paragraph position="1"> The .cat, .ref, and .rf2 files in the release were verified for format compliance using a suite of verification programs.</Paragraph> <Paragraph position="2"> A classification summary was then generated for the release and the data made available to the MADCOW sites via anonymous ftp. The &quot;madcow-answers.log&quot; file was updated with the release date.</Paragraph> </Section> <Section position="3" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 3.3. Data Distribution Summary </SectionTitle> <Paragraph position="0"> Table 2 shows a summary by site and class of the annotated MADCOW data distributed by NIST as of December 20, 1991.</Paragraph> </Section> <Section position="4" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> 3.4. Common Documentation </SectionTitle> <Paragraph position="0"> To facilitate common data exchange, MADCOW developed a set of documents which specify the formats for each common file type, listed below:</Paragraph> <Paragraph position="2"> In addition, documentation was developed to specify directory and filename structures, as well as file contents.</Paragraph> <Paragraph position="3"> To insure conformity, NIST created and distributed format verification software for each file type and for directory/filename structures. The specifications documents and verification software are maintained for public distribution in NIST's anonymous ftp directory. NIST also maintains documentation for the transcription conventions, logfile formats, categorization principles and principles of database interpretation, also available in NIST's anonymous ftp directory.</Paragraph> <Paragraph position="4"> To track the flow of data through the distribution &quot;pipeline&quot; during data collection, NIST maintained and published the data flow logs and documentation modifications in weekly electronic mail reports to MADCOW.</Paragraph> </Section> </Section> <Section position="6" start_page="2301" end_page="2301" type="metho"> <SectionTitle> 4. The Evaluation Paradigm </SectionTitle> <Paragraph position="0"> The diversity of data collection paradigms was a concern for MADCOW. To control for potential effects introduced by this diversity, it was agreed that test sets would consist of comparable amounts of data from each site (regardless of the amount of training material available from that site). In addition, benchmark test results would be displayed in an N * M matrix form (for the N systems under test from the M data collecting sites). For the February 1992 tests, the number of collecting sites (M) was 5. This format was intended to indicate if data from one collecting site were &quot;outliers&quot; and whether a site performed particularly well on locally collected data.</Paragraph> <Paragraph position="1"> The February 1992 Evaluation required sites to generate answers for data presented in units consisting of a &quot;subject-scenario'. The utterances from the scenario were presented in sequence, with no annotation as to the class of the utterances. For scoring purposes, as in previous ATIS Benchmark tests \[7\], test queries were grouped into several classes on the basis of annotations.</Paragraph> <Paragraph position="2"> Results for the context-independent sentences (Class A) and context-dependent sentences (class D) were computed and tabulated separately, along with an overall score (A + D). Class X queries (&quot;unanswerable&quot; queries) were not included in the NL or SLS tests, but were included in the SPREC tests (since valid .lsn transcriptions existed for these utterances). The matrix tabulations reported on % correct, % incorrect and % weighted error 5 defined as \[2 * (%False) + (%No_Answer)\].</Paragraph> <Paragraph position="3"> The February 1992 results also reflected a new method of computing answer &quot;correctness&quot; using both a minireal and a maximal database reference answer. The objective was to guard against &quot;overgeneration&quot;: getting answers correct by including all possible facts about a given flight or fare, rather than by understanding what specific information was requested. This method (proposed by R. Moore and implemented by Moore and E.</Paragraph> <Paragraph position="4"> Jackson of SRI) specified the maximum relevant information for any query, and required that the correct answer contain at least the minimal correct information, and no more than the maximum. This method was first used during the October 1991 &quot;dry run&quot; and was adopted as the standard scoring procedure by the DARPA Spoken</Paragraph> <Section position="1" start_page="2301" end_page="2301" type="sub_section"> <SectionTitle> Language Coordinating Committee. </SectionTitle> <Paragraph position="0"> Three types of performance assessment tests were computed on the ATIS MADCOW Benchmark Test Data: SPeech RECognition (SPREC), Natural Language (NL), and Spoken Language Systems (SLS) tests. Details of these tests, and a summary of &quot;official&quot; reported results, are to be found elsewhere in these Proceedings \[8\].</Paragraph> </Section> </Section> <Section position="7" start_page="2301" end_page="2301" type="metho"> <SectionTitle> 5. Annotation </SectionTitle> <Paragraph position="0"> The goal of annotation was to classify utterances and provide database reference answers for the subjects' queries in the ATIS domain. These reference answers were used by the system developers and by NIST to evaluate the responses of the MADCOW natural language and spoken language systems.</Paragraph> <Paragraph position="1"> The annotators began with the transcribed .sro files, and determined the possible interpretations of each utterance, classifying them as one of the following: interp#l: yes/no context-dep:Q1 interp#2: wh-ques context-dep:Q1 .win #4: List food services served on flights from Pittsburgh and to Boston and flying on 9/4/91 and whose airline code is US and whose flight number is 732 Those utterances which were evaluable (class A or D) were translated into an English-like form (.win for wizard input) that could be interpreted by NLParse, a menu-driven program that converts English-like sentences into database queries expressed in SQL. Annotation decisions about how to translate the .sros were guided by the &quot;Principles of Interpretation&quot; (see the next section). After the .sro form of an utterance was classified and translated, the work was checked thoroughly by another annotator and by various checking programs. NLParse was then run to generate an SQL form in a .sql file. Finally a series of batch programs was run on each .sql file to produce the minimal and maximal reference answers (.ref and .rf2 files) for the corresponding utterance. Figure 3 shows the annotation files created for a sample ATIS dialogue. Each line in italics identifies the file; the .sro file is the input; the .cat, .win, .ref and .rf2 files are created during the annotation procedure. Sentence #1 is class A, and has as its minimal reference answer the set of flight IDs for flights meeting the constraints. The maximal answer contains all of the columns used in the .sql query to constrain the answer; the answer is too large to be displayed here. The .sro for sentence #2 ends with a truncation (marked by a tilde ~ ), which causes it to be classified as X (unevaluable). Thus no .win, .ref or .rf2 files are generated. Sentence #3 is a context-dependent utterance, due to the anaphoric expression that flight. It depends on #2, but since #2 is class X, #3 is also classified as X, following the principle that anything that depends on a class X (unevaiuable) sentence must itself be unevaluable. Finally, sentence #4 is a yes-no question, which may have two answers: either YES or the set of entities satisfying the constraints. This sentence is also context-dependent, since it refers to flight US 732 between Pittsburgh and Boston. (Flight 732 may go to other cities, thus context is needed to establish the segment of interest). The minimal reference answer to the question about meals is defined to be the triple (meal,number,class). The maximal answer can include any information used in the .sql to generate the minimal answer.</Paragraph> </Section> <Section position="8" start_page="2301" end_page="2301" type="metho"> <SectionTitle> 6. Principles of Interpretation </SectionTitle> <Paragraph position="0"> In order to carry out an objective evaluation, it was necessary to be able to say whether an answer was right or wrong. In turn, deciding on the right answer often depended on how particular words and constructions in the query were interpreted. Thus, it was recognized early on in the development of the ATIS common task that it would be necessary to agree on specific definitions for certain vague expressions. In addition, given the current database, there was often more than one reasonable way of relating particular queries to the database. To insure objectivity in the evaluation, decisions about how to interpret queries had to be documented in such a way that all participants in the evaluation had access to them. The Principles of Interpretation document describes the interpretation of queries with respect to the ATIS database. This document was used both by system developers to train their systems and by the annotators for developing reference answers.</Paragraph> <Paragraph position="1"> Examples of decisions in the Principles of Interpretation include: the meaning of terms like early morning, classification of a snack as a meal for the purposes of answering questions about meals, and the meaning of constructions such as between X and Y, defined for ATIS to mean &quot;from X to Y&quot;.</Paragraph> <Paragraph position="2"> A subgroup on the Principles of Interpretation was formed to discuss and make decisions on new issues of interpretation as they arose data collection and annotation. A representative from each site served on this subgroup. This insured that all sites were notified when changes or additions occurred in the Principles, and allowed each site to have input into the decision process.</Paragraph> <Paragraph position="3"> It was important to make careful decisions, because any revision could cause previously annotated data to become inconsistent with the revised Principles of Interpretation. On the other hand, in many cases there was no one &quot;correct&quot; way of interpreting something, for example, the classification of a snack as a meal. In cases like this, the main goal was to make sure that all participants understood the chosen interpretation.</Paragraph> <Paragraph position="4"> It Was agreed that reference answers should emphasize literal understanding of an utterance, rather than a co-operative answer which might cause more information to be included than what was actually requested. However, to support systems used for demonstrations and for data collection as well as for evaluation, answers needed to be minimally cooperative, since otherwise demonstration systems would have to answer differently from evaluation systems. Thus the main criterion was how well the proposed interpretation reflected understanding of the query, with some consideration for providing a cooperative answer.</Paragraph> </Section> class="xml-element"></Paper>