File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1003_metho.xml
Size: 27,657 bytes
Last Modified: 2025-10-06 14:07:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1003"> <Title>The MATE Markup Framework</Title> <Section position="3" start_page="19" end_page="21" type="metho"> <SectionTitle> 2 The MATE Approach </SectionTitle> <Paragraph position="0"> This section first briefly describes the creation of the MATE markup framework and a set of example best practice coding schemes in accordance with the markup framework. Then it describes how a toolbox (the MATE Workbench) has been implemented to support the markup framework by enabling annotation on the basis of any coding scheme expressed according to the framework.</Paragraph> <Section position="1" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 2.1 Theory </SectionTitle> <Paragraph position="0"> The theoretical objectives of MATE were to specify a standard markup framework and to identify or, when necessary, develop a series of best practice coding schemes for implementation in the MATE Workbench. To these ends, we began by collecting information on a large number of existing annotation schemes for the levels addressed in the project, i.e. prosody, (morpho-)syntax, co-reference, dialogue acts, communication problems, and cross-level issues.</Paragraph> <Paragraph position="1"> Cross-level issues are issues which relate to more than one annotation level. Thus, for instance, prosody may provide clues for a variety of phenomena in semantics and discourse. The resulting report (Klein et al., 1998) describes more than 60 coding schemes, giving details per scheme on its coding book, the number of annotators who have worked with it, the number of annotated dialogues/segments/ utterances, evaluation results, the underlying task, a list of annotated phenomena, and the markup language used. Annotation examples are provided as well.</Paragraph> <Paragraph position="2"> We found that the amount of pre-existing work varies enormously from level to level. There was, moreover, considerable variation in the quality of the descriptions of the individual coding schemes we analysed. Some did not include a coding book, others did not provide appropriate examples, some had never been properly evaluated, etc. The differences in description made it extremely difficult to compare coding schemes even for the same annotation level, and constituted a rather confused and incomplete basis for the creation of standard re-usable tools within, as well as across, levels.</Paragraph> <Paragraph position="3"> The collected information formed the starting point for the devebpment of the MATE markup framework which is a proposal for a standard for the definition and representation of markup for spoken dialogue corpora (Dybkjmr et al., 1998). Analysis of the collected information on existing coding schemes as regards the information which came with the schemes as well as the information which was found missing, provided input to our proposal for a minimal set of information items which should be provided for a coding scheme to make it generally comprehensible and re-usable by others. For instance, a prescriptive coding procedure was included among the information items in the MATE markup framework despite the fact that most existing coding schemes did not come with this information. This list of information items which we call a coding module, is the core concept of the MATE markup framework and extends and formalises the concept of a coding scheme. The ten entries which constitute a coding module are shown in Figure 4. Roughly speaking, a coding module includes or describes everything that is needed in order to perform a certain kind of markup of spoken language corpora. A coding module prescribes what constitutes a coding, including the representation of markup and the relations to other codings.</Paragraph> <Paragraph position="4"> Thus, the MATE coding module is a proposal for a standardised description of coding schemes.</Paragraph> <Paragraph position="5"> The above-mentioned five annotation levels and the issues to do with cross-level annotation were selected for consideration in MATE because they pose very different markup problems. If a common framework can be established and shown to work for those levels and across them, it would seem likely that the framework will work for other levels as well.</Paragraph> <Paragraph position="6"> For each annotation level, one or more existing coding schemes were selected to form the basis of the best practice coding modules implemented in the MATE Workbench (Mengel et al., 2000). Common to the selected coding schemes is that these are among the most widely used coding schemes for their respective levels in current practice, each having been used by several annotators and for the annotation of many dialogues. Since all MATE best practice coding schemes are expressed in terms of coding modules, they should contain sufficient information for use by other annotators. Their uniform description in terms of coding modules makes it easy for the annotator to work on multiple coding schemes and/or levels, and to compare schemes since these all contain the same categories of information. The use of coding modules also facilitates use of the same set of software tools and enables the same interface look-and-feel independently of level.</Paragraph> </Section> <Section position="2" start_page="20" end_page="21" type="sub_section"> <SectionTitle> 2.2 Tooling </SectionTitle> <Paragraph position="0"> The engineering objective of MATE has been to specify and implement a genetic annotation tool in support of the markup framework and the selected best practice coding schemes. Several existing annotation tools were reviewed early on to gather input for MATE workbench specification (Isard et al., 1998). Building on this specification, the MATE markup framework and the selected coding schemes, a java-based workbench has been implemented (Isard et al., 2000) which includes the following major functionalities: The MATE best practice coding modules are included as working examples of the state of the art. Users can add new coding modules via the easy-to-use interface of the MATE coding module editor.</Paragraph> <Paragraph position="1"> An audio tool enables listening to speech files and having sound files displayed as a waveform. For each coding module, a default stylesheet defines how output to the user is presented visually. Phenomena of interest in the corpus may be shown in, e.g., a certain colour or in boldface. Users can modify style sheets and define new ones.</Paragraph> <Paragraph position="2"> The workbench enables information extraction of any kind from annotated corpora. Query results are shown as sets of references to the queried corpora. Extraction of statistical information from corpora, such as the number of marked-up nouns, is also supported.</Paragraph> <Paragraph position="3"> Computation of important reliability measures, such as kappa values, is enabled.</Paragraph> <Paragraph position="4"> Import of files from XLabels and BAS Partitur to XML format is supported in order to demonstrate the usefulness of importing widely used annotation formats for further work in the Workbench. Similarly, a converter from Transcriber format (http://www.etca.fr/CTA/gip/Projets/Transcriber/) to MATE format enables transcriptions made using Transcriber to be annotated using the MATE Workbench.</Paragraph> <Paragraph position="5"> Other converters can easily be added. Export to file formats other than XML can be achieved by using style sheets. For example, information extracted by the query tool may be exported to HTML to serve as input to a browser.</Paragraph> <Paragraph position="6"> On-line help is available at any time.</Paragraph> <Paragraph position="7"> The first release of the MATE Workbench appeared in November 1999 and was made available to the +80 members of the MATE Advisory Panel from across the world. Since then, several improved versions have appeared and in May 2000 access to executable versions of the Workbench was made public. The MATE Workbench is now publicly available both in an executable version and as open source software at http://mate.nis.sdu.dk. The Workbench is still being improved by the MATE consortium, so new versions will continue to appear. A discussion forum has recently been set up at the MATE web site where colleagues are invited to ask questions and provide information from their experience with the Workbench, including the new tools they have added to the MATE Workbench to enhance its functionality. * We have no exact figures on \]how many users are now using the workbench but we know that the MATE workbench is already being used by and is being considered for use in several European and national research projects.</Paragraph> </Section> </Section> <Section position="4" start_page="21" end_page="24" type="metho"> <SectionTitle> 3 The MATE Markup Framework </SectionTitle> <Paragraph position="0"> The MATE markup framework is a conceptual model which basically prescribes (i) how files are structured, for instance to enable multi-level annotation, (ii) how tag sets arc; represented in terms of elements and attributes, and (iii) how to provide essential information on markup, semantics, coding purpose etc.</Paragraph> <Section position="1" start_page="21" end_page="24" type="sub_section"> <SectionTitle> 3.1 Files, elements and attributes </SectionTitle> <Paragraph position="0"> When a coding module has been applied to a corpus, the result is a coding file. The coding file has a header which documents the coding context, such as who annotated the file, when, and the experience of the annotator, and a body which lists the coded elements. Figure 1 shows an example of how annotated communication problems are displayed to the user in the MATE Workbench. Figure 2 shows an excerpt of the internal representation of the file.</Paragraph> <Paragraph position="1"> communication problems (top left-hand panel). Guidelines for cooperative dialogue behaviour are shown in the top fight-hand panel. Communication problems are categonsed as types of violations of the coopemtivity guidelines. Violation types are shown in the bottom fight-hand panel. Notes may be added as part of the annotation. Notes are shown in the bottom left-hand panel. representation of the annotated dialogue shown in Figure 1. The tags will be explained in Section 3.1.1 below.</Paragraph> <Paragraph position="2"> As shown in Figure 2, the annotated file representation is simply a list of references to the transcription file. The underlying file structure idea is depicted in Figure 3 which shows how coding files (bottom layer) refer to a transcription file and possibly to other coding files, cf. entry 5 in the coding module in Figure 4. A transcription (which is also regarded as a coding file) refers to a resource file listing the mw data resources behind the corpus, such as sound files and log files. The resource file includes a description of the corpus: purpose of the dialogues, dialogue participants, experimenters, recording conditions, etc. A basic, sequential tirneline representation of the spoken language data is defined. The firneline may be expressed as real time, e.g. in milliseconds, or as numbers indicating, e.g., the files may refer to each other.</Paragraph> <Paragraph position="3"> Given a coding purpose, such as to identify all communication problems in a particular corpus, and a coding module, the actual coding consists in using syntactic markup to encode the relevant phenomena found in the data. A coding is defined in terms of a tag set. The tag set is conceptually specified by, and presented to, the user in terms of elements and attributes, el. entry 6 in the coding module in Figure 4. Importantly, workbench users can use this markup directly without having to know about complex formal standards, such as SGML, XML or TEI.</Paragraph> <Paragraph position="4"> The basic markup primitive is the dement (a term inherited from TEI and SGML) which represents a phenomenon such as a particular phoneme, word, utterance, dialogue act, or communication problem. Elements have attributes and relations to each other both within the cu~ent coding module and across coding modules. Considering a coding module M, the markup specification language is described as: * El ...E,: The non-empty list of tag elements. * For each element t~, the following properties may be defined: 1. Ni: The name of El.</Paragraph> <Paragraph position="5"> Example: <u> 2. Ei may contain a list of elements ~ from M.</Paragraph> <Paragraph position="6"> Example: <u> may contain <t>: <u><t>Exarnple</t></u> 3. Ei has ~ attributes Aij, where j = 1 .. n~. Example: <u> has attributes who and id, among others.</Paragraph> <Paragraph position="7"> 4. Ei may refer to elements in coding module Mj, implying that M references Mj.</Paragraph> <Paragraph position="8"> Example: a dialogue act coding may refer to phonetic or syntactic cues.</Paragraph> <Paragraph position="9"> A concrete example is the coding module for communication problems which, i.a., has the element <eomprob>, el. the XML representation in Figure 2. <eomprob> has, i.a., the attributes id uref and vtype, uref is a reference to an utterance in the transcription coding, xtype is a reference to a type of violation of a guideline in the violation type coding. Due to the inflexibility of XML, this logical structure has to be represented slightly differently internally in the workbench. Thus, the urcf corresponds to the first href in Attributes are assigned values during coding. For each attribute Aij the type of its values must be defined. There are standard attributes, user-defined attributes, and hidden attributes, as follows.</Paragraph> <Paragraph position="10"> Standard attributes are attributes prescribed by MATE.</Paragraph> <Paragraph position="11"> o id \[mandatory\]: ID. The element id is composed of the element name and a machine-generated number.</Paragraph> <Paragraph position="12"> Example: id=r~123 Time start and end are optiorml. Elements must have time information, possibly indirectly by referencing other elements (in the same coding module or in other modules) which have time information.</Paragraph> <Paragraph position="13"> * TimeStart \[optional\]: TIME. Start of event. * YimeEnd \[optional\]: TIME. End of event. User-defined attributes are used to parametrise and extend the semantics of the elements they belong to. For instance, who is an attribute of element <u> designating by whom the utterance is spoken. There will be many user-defined attributes (and elements), el., e.g., the uref and vtype mentioned above.</Paragraph> <Paragraph position="14"> Hidden attributes are attributes which the user will neither define nor see but Which are used for internal representation purposes. An example is the following of coding elements which may refer to utterances in a transcription but which depend on the technical programming choice of the underlying, non-user related representation: predefined attribute value types (attributes are typed) which are supported by the workbench. By convention, types are written in capitals. The included standard types are: *TIME: in milliseconds, as a sequence of numbers, or as named points on the timeline. Values are numbers or identifiers, and the declaration of the timeline states how to interpret them.</Paragraph> <Paragraph position="15"> Example: tirne=123200 dur=1280 (these are derived values, with time = TimeStart, and dur = TimeEnd- TirneStart).</Paragraph> <Paragraph position="16"> * HREF\[MODULE, ELEMENTLIST\]: Here MODULE is the name of another coding module, and ELEMENTLIST is a list of names of elements from MODULE. When applied as concrete attribute values, two parameters must be specified: The name of the referenced coding file which is an application of the declared MODULE coding module.</Paragraph> <Paragraph position="17"> - The id of the element occurrence that is referred to.</Paragraph> <Paragraph position="18"> The values of this attribute are of the form: ...... CodeFileName&quot;#'Elementld' ....</Paragraph> <Paragraph position="19"> Example: The declaration Occursln: href(lxanscription, u) allows an attribute used as, e.g., Occursln=&quot;base~_123&quot;, where base is a coding file using the transcription module and u_123 is the value of the id attribute of a t~ element in that file.</Paragraph> <Paragraph position="20"> Example: For the declaration who: HREF\[transcription, participant\] an actual occurrence may look like who=&quot;#participant2 '' where the omitted coding file name by convention generically means the current coding file.</Paragraph> <Paragraph position="21"> The concept of hyper-references together with parameters referencing coding modules (see point 5 in Figure 4) is what enables ccoding modules to handle cross-level markup.</Paragraph> <Paragraph position="22"> The user may be anowed to extend the set, but never to change or delete values from the set. * TEXT: Any text not containing .... (which is used to delimit the attribute value).</Paragraph> <Paragraph position="23"> Example: The declaration dese TEXT allows uses such as: <event desc=&quot;Door is slammed&quot;>.</Paragraph> </Section> <Section position="2" start_page="24" end_page="24" type="sub_section"> <SectionTitle> 3.2 Coding modules </SectionTitle> <Paragraph position="0"> In order for a coding module and the dialogues annotated using it to be usable and understandable by people other than its creator, some key information must be provided. The MATE coding module which is the central part of the markup framework, serves to capture this information. A coding module consists of the ten items shown in Figure 4.</Paragraph> <Paragraph position="1"> 1. Name of the module.</Paragraph> <Paragraph position="2"> 2. Coding purpose of the module.</Paragraph> <Paragraph position="3"> 3. Coding level.</Paragraph> <Paragraph position="4"> 4. The type of data source scoped by the module.</Paragraph> <Paragraph position="5"> 5. References to other modules, if any. For transcriptions, the reference is to a resource. 6. A declaration of the markup elements and their attributes. An element is a feature, or type of phenomenon, in the corpus for which a tag is being defined.</Paragraph> <Paragraph position="6"> 7. A supplementary informal description of the elements and their attributes, including: a. Purpose of the element, its attributes, and their values.</Paragraph> <Paragraph position="7"> b. Informal semantics describing how to interpret the element and attribute values. c. Example of each element and attribute. 8. An example of the use of the elements and their attributes.</Paragraph> <Paragraph position="8"> 9. A coding procedure.</Paragraph> <Paragraph position="9"> 10. Creation notes.</Paragraph> <Paragraph position="10"> Some coding module items have a formal role, i.e. they can be interpreted and used by the MATE workbench. Thus, items (1) and (5) specify the coding module as a named parametrised module or class which builds on certain other predefined modules (no cycles allowed). Item (6) specifies the markup to be used in the coding. All elements, attribute names, and ids have a name space restricted to their module and its coding files, but are publicly referrable by prefixing the name of the coding module or coding file in which they occur. Other items provide directives and explanations to users. Thus, (2), (3) and (4) elaborate on the module itself, (7) and (8) elaborate on the markup, and (9) recommends coding procedure and quality measures. (10) provides information about the creation of the coding module, such as by whom and when. In the following, we show an abbreviated version of a coding module for communication problems to illustrate the 10 coding module entries.</Paragraph> <Paragraph position="11"> Name: Communication_problems.</Paragraph> <Paragraph position="12"> Coding purpose: Records the different ways in which generic and specific guidelines are violated in a given corpus. A communication problems coding file refers to a problem type coding file as well as to a transcription. Coding level: Communication problems.</Paragraph> <Paragraph position="13"> Data sources: Spoken human-machine dialogue corpora.</Paragraph> <Paragraph position="14"> Module references: Module Basic orthographic transcription; Module Violation types.</Paragraph> </Section> </Section> <Section position="5" start_page="24" end_page="26" type="metho"> <SectionTitle> ATTRIBUTES </SectionTitle> <Paragraph position="0"> wref: REFERENCE(Basic_orthographic_ transcription, (w,w)+) uref: REFERENCE(Basic_orthographic_ transcription, u+) Description: In order to annotate communication problems produced by inadequate system utterance design we use the element eomprob. It refers to some kind of violation of one of the guidelines listed in Figure 1, top fight-hand panel. The comprob element may be used to mark up any part of the dialogue which caused, or might cause, a communication problem. Thus, cornprob may be used to annotate one or more words, an entire utterance, or even several utterances in which an actual or potential communication problem was detected. The eomprob element has five attributes in addition to the automatically added id.</Paragraph> <Paragraph position="1"> The attribute vtype is mandatory, vtype is a reference to a description of a guideline violation in a file which contains the different kinds of violations of the individual guidelines. Either wref or uref must be indicated. Both these attributes refer to an orthographic transcription, wref delimits the word(s) which caused or might cause a communication problem, and uref refers to one or more entire utterances which caused or might cause a problem.</Paragraph> <Paragraph position="2"> We stop the illustration here due to space limitations. The full description is available in (Mengel et al. 2000).</Paragraph> <Paragraph position="3"> Example: In the following snippet of a transcription from the Sundial corpus: <u id=&quot;Sl:7-1-sun&quot; who=&quot;S&quot;>flight information british airways good day can I help you</u> communication problems are marked up as follows: <comprob id=&quot;Y ' vtype=&quot;Sundial_problems#SG4-1&quot; uref=&quot; Sundial#S 1:7-1 -sun '7> We do not exemplify note here and do not show the violation type coding file due to space limitations. However, note that once a coding module is specified in the MATE workbench, the user does not have to bother about the markup shown in the example above. The user just selects the utterance to nark up and then clicks on the violation type palette, or, in case it is a new type, clicks on the violated cooperafivity guideline which means that a new violation type is added and text can be entered to describe it, el. Figure 1.</Paragraph> <Paragraph position="4"> Coding procedure: We recommend to use the same coding procedure for markup of cornrnunicafion problems as for violation types since the two actions are tightly connected. As a minimum, the following procedure should be followed: 1. Encode by coders 1 and 2.</Paragraph> <Paragraph position="5"> 2. Check and merge codings (performed by coders 1 and 2 until consensus).</Paragraph> <Paragraph position="6"> Creation notes: Authors: Hans Dybkj~er and Laila Dybkj~er. Version: 1 (25 November 1998), 2 (19 June 1999).</Paragraph> <Paragraph position="7"> Comments: For guidance on how to identify communication problems and for a collection of examples the reader is invited to look at (Dybkj~er 1999).</Paragraph> <Paragraph position="8"> Literature: (Bernsen et al. 1998).</Paragraph> <Paragraph position="9"> The MATE Workbench allows its users to specify a coding module via a coding module editor. A screen shot of the coding module editor is shown in Figure 5.</Paragraph> </Section> <Section position="6" start_page="26" end_page="26" type="metho"> <SectionTitle> 4 Early Experience and Future Work </SectionTitle> <Paragraph position="0"> The MATE markup framework has been well received for its transparency and flexibility by the colleagues on the MATE Advisory Panel.</Paragraph> <Paragraph position="1"> The framework has been used to ensure a common description of coding modules at the MATE coding levels and has turned out to work well for all these levels. We therefore conclude that the framework is likely to work for other annotation levels not addressed by MATE. The use of a common representation and a common information structure in all coding modules at the same level as well as across levels facilitates wilhin-level comparison, creation and use of new coding modules, and working at multiple levels.</Paragraph> <Paragraph position="2"> On the tools side, the markup framework has not been fully exploited as intended, i.e. as an intermediate layer between the user interface and the internal representation. This means that the user interface for adding new coding modules, in particular for the declaration of markup, and for defining new visualisations is still sub-optimal from a usability point of view. The coding module editor which is used for adding new coding modules, represents a major step forward compared to requiring users to write DTDs. The coding module editor automatically generates a DTD from the specified markup declaration. However, the XML format used for the underlying file representation has not been hidden completely from the editor's interface. Peculiarities and lack of flexibility in XML have been allowed to influence the way in which users must specify elements and attributes, making the process less logical and flexible than it could have been. It is high on our wish list to repair this shortcoming. As regards coding visualisation, XSLT-like style sheets are used to define haw codings are displayed to the user. Writing style sheets, however, is cumbersome and definitely not something users should be asked to do to define how codings based on a new coding module should be displayed. We either need a style sheet editor comparable to the coding module editor as regards ease of use, or, alternatively, a completely new iraerface concept should be implemented to replace the style sheets and enable users to easily define new visualisations. It is high on our wish-hst to better exploit the markup framework in the Workbench implementation in order to achieve a better user interface.</Paragraph> <Paragraph position="3"> Other frameworks have been proposed but to our knowledge the MATE markup framework is still the more comprehensive framework around.</Paragraph> <Paragraph position="4"> An example is the annotation framework recently proposed by Bird and Liberrnan (1999) which is based on annotation graphs. These are now being used in the ATLAS project (Bird et al., 2000) and in the Transcriber tool (Geoffrois et al., 2000). The annotation graphs serve as an intermediate representation layer in agreement with the argument above for having an intermediate layer of representation between the user interface and the intemal representation.</Paragraph> <Paragraph position="5"> Whilst Bird and Liberman do not consider coding modules or discuss the interface from a usability point of view, they present detailed considerations concerning time line representation and time line reference. The two frameworks may, indeed, tuna out to complement each other nicely.</Paragraph> </Section> class="xml-element"></Paper>