File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-2018_abstr.xml
Size: 4,847 bytes
Last Modified: 2025-10-06 13:42:23
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2018"> <Title>An XML-based document suite</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We report about the current state of development of a document suite and its applications. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding input and output data as well as process information. It is organized in modules with limited responsibilities that can easily be combined into pipelines to solve complex tasks. Strong emphasis is laid on a number of techniques to deal with lexical and conceptual gaps that are typical when starting a new application. null Introduction We have designed and implemented the XDOC document suite as a workbench for the flexible processing of electronically available documents in German. We have decided to exploit XML (Bray et al., 1998) and its accompanying formalisms (e.g. XSLT (Site, 2002b)) and tools (e.g. xt (Clark, 2002) ) as a unifying framework.</Paragraph> <Paragraph position="1"> All modules in the XDOC system expect XML documents as input and deliver their results in XML format. XML - and ist precursor SGML - offers a formalism to annotate pieces of (natural language) texts. To be more precise: If a text is (as a simple first approximation) seen as a sequence of characters (alphabetic and white-space characters) then XML allows to associate arbitrary markup with arbitrary subsequences of contiguous characters. Many linguistic units of interest are represented by strings of contiguous characters (e.g. words, phrases, clauses etc.). To use XML to encode information about such a substring of a text interpreted as a meaningful linguistic unit and to associate this information directly with the occurrence of the unit in the text is a straightforward idea. The basic idea is further backed by XMLs demand that XML elements have to be properly nested. This is fully concordant with standard linguistic practice: complex structures are made up from simpler structures covering substrings of the full string in a nested way.</Paragraph> <Paragraph position="2"> The end users of our applications are domain experts (e.g. medical doctors, engineers, ...). They are interested in getting their problems solved but they are typically neither interested nor trained in computational linguistics. Therefore the barrier to overcome before they can use a computational linguistics or text technology system should be as low as possible.</Paragraph> <Paragraph position="3"> This experience has consequences for the design of the document suite. The work in the XDOC project is guided by the following design principles that have been abstracted from a number of experiments and applications with &quot;realistic&quot; documents (i.a. emails, abstracts of scientific papers, technical documentation, ...): AF The tools shall be usable for 'realistic' documents.</Paragraph> <Paragraph position="4"> One aspect of 'realistic' documents is that they typically contain domain-specific tokens that are not directly covered by classical lexical categories (like noun, verb, ...). Those tokens are nevertheless often essential for the user of the document (e.g. an enzyme descriptor like EC 4.1.1.17 for a biochemist).</Paragraph> <Paragraph position="5"> AF The tools shall be as robust as possible.</Paragraph> <Paragraph position="6"> In general it can not be expected that lexicon information is available for all tokens in a document.</Paragraph> <Paragraph position="7"> This is not only the case for most tokens from 'nonlexical' types - like telephone numbers, enzyme names, material codes, ... -, even for lexical types there will always be 'lexical gaps'. This may either be caused by neologisms or simply by starting to process documents from a new application domain with a new sublanguage. In the latter case lexical items will typically be missing in the lexicon ('lexical gap') and phrasal structures may not or not adequately be covered by the grammar.</Paragraph> <Paragraph position="8"> AF The tools shall be usable independently but shall allow for flexible combination and interoperability.</Paragraph> <Paragraph position="9"> AF The tools shall not only be usable by developers but as well by domain experts without linguistic training. null Here again XML and XSLT play a major role: XSL stylesheets can be exploited to allow different presentations of internal data and results for different target groups; for end users the internals are in many cases not helpful, whereas developers will need them for debugging. null The tools in the XDOC document suite can be grouped according to their function:</Paragraph> </Section> class="xml-element"></Paper>