File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2714_metho.xml
Size: 11,286 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2714"> <Title>Middleware for Creating and Combining Multi-dimensional NLP Markup</Title> <Section position="3" start_page="0" end_page="82" type="metho"> <SectionTitle> 2 Middleware Architecture </SectionTitle> <Paragraph position="0"> Fig. 1 gives a schematic overview of the middleware server in between applications (above) and external NLP components (below). When a new application session in Heart of Gold is started, it takes a configuration specifying NLP components to start for the session. Each component is started according to its own parameterized configuration.</Paragraph> <Paragraph position="1"> The client can send texts to the middleware and the NLP components are then queried in a numerically defined processing order ('Depth' in Fig. 4). The shallowest components (e.g. tokenizer) are assigned a low number and are started first etc. The output of each component must be XML markup.</Paragraph> <Paragraph position="2"> Each component gets the output of the previous component as input by default, but can also request (via configuration) other annotations as input. Components may produce multiple output annotations (e.g. in different formats). Thus, the and Thomas Kl&quot;ocker for their implementation work, Bernd Kiefer and the co-authors in the cited papers for fruitful cooperation, and the reviewers for valuable comments. This work has been supported by a grant from the German Federal Ministry of Education and Research (FKZ 01IWC02).</Paragraph> <Paragraph position="3"> component dependency structure in general forms a graph.</Paragraph> <Section position="1" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 2.1 Session and multi-dimensional </SectionTitle> <Paragraph position="0"> annotation management The resulting multi-dimensional annotations are stored in a per-session markup storage (Fig. 2) that groups all annotations for an input query (a sentence or text) in annotation collections. The markup storage can also be made persistent by saving it to XML files or to an XML database.</Paragraph> <Paragraph position="1"> Annotations can be accessed uniquely via a URI of</Paragraph> </Section> <Section position="2" start_page="81" end_page="81" type="sub_section"> <SectionTitle> Session Annotation </SectionTitle> <Paragraph position="0"> collection (1 per input text) Standoff annotations (computed by modules/components) storage the form hog://sid/acid/aid in XPath expressions where sid is a session ID, acid is an annotation collection ID and aid is an annotation identifier typically signifying the name of the producing component. Structured metadata like configuration and processing parameters (e.g. processing time and date, language ID etc.) are always stored within the annotation markup as first root daughter element.</Paragraph> </Section> <Section position="3" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 2.2 XML standoff markup as first-class citizen </SectionTitle> <Paragraph position="0"> Unlike other NLP architectures (e.g. GATE (Cunningham et al., 2002) etc.), Heart of Gold treats XML standoff annotations (Thompson and McKelvie, 1997) as first class citizens and natively supports XML (and only XML) markup of any kind.</Paragraph> <Paragraph position="1"> Moreover, Heart of Gold does not prescribe specific DTDs or Schemata for annotations, provided that the markup is well-formed. In this sense, it is a completely open framework that may however be constrained by requirements of the actually configured components. The advantage of this openness is easy integration of new components. Mappings need only be defined for the immediately depending annotations (see next section) which is by far not an n-to-n mapping in practical applications.</Paragraph> <Paragraph position="2"> However, the fact that a specific DTD or Schema is not imposed by the middleware does not mean that there are no minimal requirements.</Paragraph> <Paragraph position="3"> Linking between different standoff annotations is only possible on the basis of a least common entity, which we propose to be the character spans in the original text2. Moreover, we additionally propose the use of the XML ID/IDREF mechanism to facilitate efficient integration and combination of multi-dimensional markup.</Paragraph> <Paragraph position="4"> Finally, depending on the scenario, specific common, standardized markup formats are appropriate, an example is RMRS (Copestake, 2003) for deep-shallow integration in Section 3 or the XMLencoded typed feature structure markup generated by SProUT (Dro.zd.zy'nski et al., 2004).</Paragraph> <Paragraph position="5"> 2.3 XSLT as 'glue' and query language We propose and Heart of Gold heavily relies on the use of XSLT for combining and integrating multi-dimensional XML markup. The general idea has already been presented in (Sch&quot;afer, 2003), but the developments and experiences since then have encouraged us to proceed in that direction and Heart of Gold can be considered as a successful, more elaborated proof of concept. The idea is related to the open markup format framework presented above: XSLT can be used to transform XML to other XML formats, or to combine and query annotations. In particular, XSLT stylesheets may resolve conflicts resulting from multi-dimensional markup, choose among alternative readings, follow standoff links, or decide which markup source to give higher preference.</Paragraph> <Paragraph position="6"> (Carletta et al., 2003), e.g. propose the NXT Search query language that extends XPath by adding query variables, regular expressions, quantification and special support for querying temporal and structural relations. Their main argument against standard XPath is that it is impossible to constrain both structural and temporal relations within a single XPath query. Our argument is that XSLT can complement XPath where XPath alone is not powerful enough, yet providing a standardized language. Further advantages we see in the XSLT approach are portability and efficiency (in contrast to 'proprietary' and slow XPath extensions like NXT), while it has a quite simple syntax in its (currently employed) 1.0 version. XSLT can be conceived as a declarative specification language as long as an XML tree structure the input structure). However, XSLT is Turingcapable and therefore suited to solve in principle any markup integration or query problem. Finally, extensions like the upcoming XSLT/XPath 2.0 version or efficiency gains through XSLTC (translet compilation) can be taken on-the-fly and for free without giving up compatibility. Technically, the built-in Heart of Gold XSLT processor could easily replaced or complemented by an XQuery processor. However, for the combination and transformation of NLP markup, we see no advantage of XQuery over XSLT.</Paragraph> <Paragraph position="7"> Heart of Gold comes with a built-in XSL transformation service, and module adapters (Section 2.4) can easily implement transformation support by including a few lines of code. Stylesheets can also be generated automatically in Heart of Gold, provided a systematic description of the transformation input format is available. An example is mapping from named entity grammar output type definitions in scenario 1 below.</Paragraph> <Paragraph position="8"> Stylesheets are also employed to visualize the linguistic markup, e.g. by transforming RMRS to HTML (Fig. 3) or LATEX.</Paragraph> </Section> <Section position="4" start_page="82" end_page="82" type="sub_section"> <SectionTitle> 2.4 Integrated NLP components </SectionTitle> <Paragraph position="0"> NLP components are integrated through adapters called modules (either Java-based, subprocesses or via XML-RPC) that are also responsible for generating XML standoff output if this is not supported natively by the components (e.g., TnT, Chunkie).</Paragraph> <Paragraph position="1"> Various shallow and deep NLP components have already been integrated, cf. Fig. 4.</Paragraph> </Section> <Section position="5" start_page="82" end_page="82" type="sub_section"> <SectionTitle> Component Type Depth Languages </SectionTitle> <Paragraph position="0"> JTok tokenizer 10 de, en, it,. . .</Paragraph> <Paragraph position="1"> ChaSen Jap. tagger 10 ja TnT stat. tagger 20 de, en Chunkie stat. chunker 30 de, en ChunkieRmrs chunk RMRS 35 de, en LingPipe stat. NER 40 en, es,. . .</Paragraph> </Section> </Section> <Section position="4" start_page="82" end_page="83" type="metho"> <SectionTitle> 3 Scenario 1: Deep-Shallow Integration </SectionTitle> <Paragraph position="0"> The idea of hybrid deep-shallow integration is to provide robust linguistic analyses through multi-dimensional NLP markup created by shallow and deep components, e.g. those listed in Fig. 4. Robustness is achieved in two ways: (1) various shallow components perform preprocessing and partial statistical disambiguation (e.g. PoS tagging of unknown words, named entity recognition) that can be used by a deep parser by means of a so-called XML input chart (multi-dimensional markup combined through XSLT in a single XML document in a format convenient for the parser).</Paragraph> <Paragraph position="1"> (2) shallow component's output is transformed through XSLT to partial semantic representations in RMRS syntax (Copestake, 2003) that is potentially more fine-grained and structured than what is digestible by the deep parser as preprocessing input (mainly PoS/NE type and span information via the XML input chart). This allows for (a) a fallback to the shallow representation in case deep parsing fails (e.g. due to ungrammatical input), (b) combination with the RMRS generated by deep parsing or fragments of it in case deep parsing fails.</Paragraph> <Paragraph position="2"> First application scenarios have been investigated successfully in the DEEPTHOUGHT project (Uszkoreit et al., 2004). A further application (hybrid question analysis) is presented in (Frank et al., 2006). Recently, linking to ontology instances and concepts has been added (Sch&quot;afer, 2006).</Paragraph> </Section> <Section position="5" start_page="83" end_page="83" type="metho"> <SectionTitle> 4 Scenario 2: Shallow Cascades </SectionTitle> <Paragraph position="0"> The second scenario is described in (Frank et al., 2004) in detail. A robust, partial semantics representation is generated from a shallow chunker's output and morphological analysis (English and German) by means of a processing cascade consisting of four SProUT grammar instances with four interleaved XSLT transformations. The cascade is defined using the declarative system description language SDL (Krieger, 2003). An SDL architecture description is compiled into a Java class which is integrated in Heart of Gold as a sub-architecture module (Fig. 5). The scenario is equally a good example for XSLT-based annotation integration. Chunker analysis results are included in the RMRS to be built through an XSLT stylesheet using the XPath expression document($uri)/chunkie/chunks/chunk[ @cstart=$beginspan and @cend=$endspan] where $uri is a variable containing an annotation identifier of the form hog://sid/acid/aid as explained in Section 2.1.</Paragraph> </Section> class="xml-element"></Paper>