File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0302_metho.xml
Size: 18,438 bytes
Last Modified: 2025-10-06 14:15:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0302"> <Title>The MATE Annotation Workbench: User Requirements</Title> <Section position="5" start_page="12" end_page="12" type="metho"> <SectionTitle> 3 A Step Forward, but not Magic </SectionTitle> <Paragraph position="0"> The technological solution to the flexible representation of overlapping tag hierarchies is to use the XML markup language \[11\]. XML can be used to describe any sort of coding, and the coding structure can be described in a Document Type Definition (DTD) which describes what tags are possible, and where they can occur. It is not possible for XML tags to overlap within a document, but one can create overlapping hierarchies of markup in the same corpus by using &quot;hyperlinks&quot;, which serve not just to point to elements in a different hierarchy of markup but also to include them for structural purposes \[10\]. For example, if dialogue moves are made up of words; and sentences are made up of the same words, as in Figure 1, then both can be linked to the same copy of the words, providing a way of relating the two structures. An example XML representation of a short dialogue fragment is given in Figure 2.</Paragraph> <Paragraph position="1"> Of course, XML is designed to be machinereadable, and should rarely (if ever) be inspected directly. The MATE project will use &quot;stylesheets&quot; based on the XSL language \[12\], which allow one to express operations on an XML corpus which can be used to typeset it for the human reader, choosing both what annotations to make visible and how they should be displayed. Using MATE stylesheets, it will also be possible to specify the link between user actions and modifications to the XML which is being displayed, and thus implement coding interfaces. Stylesheets are written as a sequence of rules which match the input and give instructions about what to do for the output. A simple example of such a rule is:</Paragraph> <Paragraph position="3"> will usually be several rules in a stylesheet. The 'match' attribute tells us which elements this rule will apply to. In this case, the rule will match <dummy> elements in the input document. In the MATE stylesheet language, matching is done using the MATE query language; query construction is supported in the workbench by means of a graphical user interface. Line (2) and following lines gives the template of the rule. This is a description of the elements which will be created in the output document. In this case <result> is a literal element, and the result of this rule will be that all <d~mmy> elements will be converted into <result>A dummy. </result> elements.</Paragraph> <Paragraph position="4"> Rules in a stylesheet are applied in order of occurrence starting with the top level element in the document. There is a mechanism for top-down left-to-right traversal of the input document hierarchy, and a default rule for unmatched elements.</Paragraph> <Paragraph position="5"> Using XML and XSL allows a flexible solution to the problem of technological support, but this solution is not magical. For instance, XML and XSL do not remove the need to understand how tags relate to each other; they simply make it easier to specify a good machine-readable representation of complex tag relationships and to display these relationships for the human reader. To see what benefits this technology will bring, it is necessary to analyse the capabilities and requirements of different types of potential workbench users separately.</Paragraph> </Section> <Section position="6" start_page="12" end_page="14" type="metho"> <SectionTitle> 4 User Types </SectionTitle> <Paragraph position="0"> At least for sites involved in large scale coding exercises, data coders are typically the cheapest labour source available. They do not wish to know anything about how the coding interface works or even how different sets of tags relate to each other. Their needs are fairly simple: an intuitive coding interface so that they can concentrate on the code distinctions, documentation of how to use the interface - although, human nature being what it is, we find that no amount of written material will replace good verbal instruction - and the coding instructions nearby, preferably on-line. The MATE Workbench will encourage fulfilment of these needs by providing guidelines and slots for documentation within the coding modules which define the coding task, through the example coding schemes which come with with the workbench, and by providing sufficient coding actions for a wide range of interface designs. Thus the main benefit to coders is simply a side effect of having good support for the implementation of coding schemes via well-defined coding modules. Another potential benefit for longer-term coders is the ability to reconfigure the user interface which controls interaction with the workbench components to suit personal preferences, for instance, by rearranging the menus and buttons.</Paragraph> <Paragraph position="1"> There are several possible types of consumers of existing coded data. Some users might wish to check the relationships in the data for things which are statistically aberrant, mining the data pre-theoretically for whatever stands out. Others might wish to export some part of the data in order to train on the relationships which are present in it, in the hopes that some theoretically-motivated relationship will improve performance in, for instance, a spoken dialogue system. Straight theoreticians might wish to inspect particular relationships in order to test specific research hypotheses. Whatever the reason for interest in the corpus, consumers are united in their need to ask questions of the corpus, looking for places which match a specific form, and to display the results. Using the coded data well requires the mathematical capacity to understand the kind of structural information represented graphically in Figure 1, since otherwise the questions which the user asks will be meaningless. The more theoretically motivated the user, the more important this requirement is. Of course, this is true for all work with complex tag sets, not just those represented within MATE. In addition, where new kinds of display are required to match specific explorations of the data, new stylesheets will be required. The main benefits of the workbench for coding consumers are (a) the possibility of combining many different kinds of annotation on one data source, (b) a well-specified query language for exploring the relationships among the tags, and (c) methods for exporting different cuts on the data to other packages for further theoretical or statistical analysis.</Paragraph> <Paragraph position="2"> Many people wish to design their own coding schemes, either to improve on the reliability or suitability of an existing scheme or in order to test a particular research question. These coding developers may hire type 1 coders, but they are quite likely to do their coding themselves. This group of users has the hardest job. Designing a complete corpus requires the mathematical capacity not just for understanding structures such as that represented in Figure 1, but also for constructing new ones and mapping these into the sorts of file structures represented in Figure 2. This is true whether or not the corpus is represented in XML but is sometimes hidden away as something which only the software developer truly understands. One need not understand the relationships among all the tags on a corpus in order to install a new coding level, but one must at least be able to hook a new tag set into some part of the existing structure. Although this may seem onerous for the user, in reality most of the requirements are the same as they were before; users who wish to do something new have to understand what it is they are trying to do. The only additional requirement is that instead of developing their own ad hoc data representations and mappings between the data, the screen, and user actions, users need to understand how data is represented in XML and how to write stylesheets expressing these mappings. The benefit of XML and XSL for this activity is that they are likely to be better-structured and more flexible than anything coding designers will come up with for themselves, especially if they are not experienced software developers, and that there are many existing tools for support, with more on the way.</Paragraph> </Section> <Section position="7" start_page="14" end_page="14" type="metho"> <SectionTitle> 5 Our Solution: Pre-Implemented </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> Schemes Plus Development Tools </SectionTitle> <Paragraph position="0"> It is difficult to address all the user types at once, and since so little support currently exists for coding designers, it will be impossible to get the facilities for them right the first time. The fact that there are many users who do not necessarily want to implement their own coding schemes leads us to a staged solution. The MATE Workbench will be distributed with coding tools and basic display capabilities for a range of coding schemes at various levels of annotation from prosody and morphosyntax to dialogue acts and communication problems. These schemes are being chosen (or, in some cases, developed), based on an extensive review \[4\]. In addition to simply being practical and reliable schemes to represent their levels of coding, they are being chosen to represent as wide a spread of coding types as possible in order to test the workbench design.</Paragraph> <Paragraph position="1"> Current coding tools have also been reviewed \[2\] in order to inform us both about good design for tools for the chosen schemes and about the range of capabilities needed in the workbench. These schemes and the tools implemented for them can be used to allow users to develop a sense of the workbench capabilities, and for users who do not wish to implement new coding modules, they may be all that is required. The introduction of new schemes will be supported by tools for authoring XML corpora and stylesheets, and by the use of the existing scheme implementations as examples to modify.</Paragraph> </Section> </Section> <Section position="8" start_page="14" end_page="14" type="metho"> <SectionTitle> 6 Basic Functionalities </SectionTitle> <Paragraph position="0"> Keeping in mind our description of the basic user types, these are the basic functions which the MATE workbench will provide.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.1 Display </SectionTitle> <Paragraph position="0"> Given a coded corpus, the display function will show on screen a human-readable version of the data.</Paragraph> <Paragraph position="1"> This display will be produced from the data using an XSL stylesheet, although there is no reason for the user to know that. Display options will include the size ~nd placement of windows, text colour, font, and size, and text layout such as lists and tables. The display may include a speech waveform if one is associated with the dialogue, and user actions (such as clicking on an area of text) may be associated with further display information.</Paragraph> <Paragraph position="2"> Some users believe that our flexible approach to document typesetting means that conceptualising complex relationships within a data set will become easier - that is, that using the workbench will help to clarify their thinking about what to look for. In a sense, it will, because the ability to write specialist display stylesheets will allow the user to create views which abstract away to the right sections of the data. However, stylesheets do not enforce the creation of a usable data display. In particular, it is just as possible to overload a display with too much information using this technique as using any other (and perhaps more tempting, because the stylesheets make this easier to do). The basic limitation on conceptualising data relationships is human, and not a product of coding technologies.</Paragraph> </Section> <Section position="2" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.2 Query </SectionTitle> <Paragraph position="0"> Given a coded corpus, the query function will allow the user to construct a query which will match some part of the data set, and then will either extract that part (which can then either be exported or sent via a stylesheet for display) or count the number of matches, for performing frequency analyses. The MATE query language \[6, 7, 3\] contains constructs which allow the expression of either hierarchical or temporal relationships among a set of tags. This means that structural constraints Can be given naturally (such as asking for all response moves within * 15 dialogues of a particular type, or verbs within relative clauses), but that cross-hierarchy constraints can also be expressed (such as asking for all disfluencies which occur during a particular type of intonational phrase). The MATE workbench includes a point-and-click query formulation support tool; alternatively, queries can be typed at a command line.</Paragraph> </Section> <Section position="3" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.3 Coding </SectionTitle> <Paragraph position="0"> Coding tools will allow the user to add an annotation corresponding to a particular coding scheme.</Paragraph> <Paragraph position="1"> Coding interfaces will be specified by means of stylesheets. Typical coding actions might include using the mouse to specify a location, to sweep areas of text, or to bring up a text window or menu by which tagging details can be entered.</Paragraph> </Section> <Section position="4" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.4 Transcription </SectionTitle> <Paragraph position="0"> The transcription process should be highly individualised for a particular corpus depending on how recordings were obtained and for what purposes the corpus has been collected. Getting the process wrong can add months and great expense to a project. Good transcription requires software, such as spelling checkers, which would be difficult to provide within a Workbench. In addition, even if good transcription tools were supported by' the workbench, many projects would not be able to use them because they contract out transcription work to secretarial agencies which are only willing to quote for the work based on the model of audiotyping using standard word processing packages. As a result, we would expect most users to wish to do their transcription elsewhere, using other software, and to transfer their transcripts into the workbench when they are complete. On the other hand, many users experimenting the system will wish to input small amounts of data so that they can test out the coding schemes and the workbench on new materims. In the first instance, we intend to provide a very simple transcription facility which will suffice for this purpose but which one would not wish to use for large-scale transcription, at least not without a great deal of thought about the alternatives.</Paragraph> </Section> <Section position="5" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.5 Import </SectionTitle> <Paragraph position="0"> The more existing corpora the workbench will work with, the more useful it will be when it is introduced.</Paragraph> <Paragraph position="1"> Unfortunately, current corpora are in a wide range of formats, many of which bear little relationship to XML. We intend to supply two conversion tools with the workbench which will handle conversion from BAS-Partitur and Entropic xwaves xlabel files into XML. Corpora produced to EAGLES recommendations \[1\] require minimal conversions. Users of other formats must support their own conversion processes, the software for which can be installed in local copies of the workbench.</Paragraph> </Section> <Section position="6" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 6.6 Export </SectionTitle> <Paragraph position="0"> Just as users of existing corpora may wish to import data, they may wish to export codings into another format, for instance, so that they can apply existing automatic annotation techniques to it. As with importation, possible export formats are too numerous and varied for us to implement converters for them all. Users who supply their own converters will again be able to install them into local copies of the workbench. Printing and postscript output will be available as a function closely allied to display. Stylesheets can be used to produce other output formats which give specific views of the data; for instance, it is possible to use them to construct HTML or tabular information for input into spreadsheets or statistical software. We are still considering how best to export information for visualisation of complete data sets, as required for data mining techniques.</Paragraph> </Section> </Section> <Section position="9" start_page="14" end_page="14" type="metho"> <SectionTitle> 7 Tools for Developers </SectionTitle> <Paragraph position="0"> Ther~ are two basic functions which will be required for corpus and coding scheme developers: adding a new coding level, and creating or editing a stylesheet. For adding a new coding module, we intend to support good practice by creating a template for storing information about each type of coding which leaves space for describing working practice, who coded each file, exactly what form of the coding manual was used, and so on. We are still considering what sorts of tools will best facilitate the DTD editing and stylesheet creation essential for new coding schemes, but here the workbench may itself be of service. XSL is an XML language, and current developments in XML suggest that DTDs will soon be written in XML itself using &quot;XML schemata&quot; \[13, 14\], so that DTD and stylesheet editors could be written quickly using the workbench. Editors written in this way could abstract away from the syntactic details which users find so difficult to deal with, leaving them to concentrate on the structure of the corpus additions.</Paragraph> </Section> class="xml-element"></Paper>