File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0807_intro.xml
Size: 9,217 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0807"> <Title>Current Issues in Software Engineering for Natural Language Processing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Reuse </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The need for reuse </SectionTitle> <Paragraph position="0"> In NLP, the global amount of reuse is low, and currently, activities of the community en large focus on reuse of data resources (via annotation standards or data repositories like LDC and ELRA). On the software side, despite similar efforts (Declerck et al., 2000), reuse rate is low, partially because the difficulty of integration is high (and often underestimated), for instance because developers use different implementation languages, deprecated environments or diverse paradigms. Especially, &quot;Far too often developers of language engineering components do not put enough effort in designing and defining the API.&quot; (Gamb&quot;ack and Olsson, 2000). Thus, re-implementation and integration cause major productivity loss.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Properties that lead to reuse </SectionTitle> <Paragraph position="0"> How can productivity loss be avoided? Researchers should build their prototypes around sound Application Programming Interfaces (APIs); all input/output should be separated from the core functionality. Then not only will the workings of the algorithms become clearer, also the re-usability will be increased, since most applications make different assumptions about data formats. Potential sloppiness (e.g. lack of error-handling) caused by time pressure can then be restricted to the prototype application shell without impairing the core code. The main LTG MUC-7 Hybrid MUC-7 Named Entity Recognizer based on maximum entropy classification and DFSTs ltchunk DFST-based English chunk parser ltpos HMM-based English POS tagger ltstop Maximum entropy-based English sentence splitter lttok DFST-based tokenizer for English text LT TTT Suite of XML/SGML-aware tools for building DFSTs fsgmatch Deterministic Finite-State Transducer (DFST) construction toolkit sgdelmarkup Remove SGML markup from text sgtr SGML replacement tool sgsed SGML stream editor LT XML LTG's XML API of Edinburgh's Language Technology Group.</Paragraph> <Paragraph position="1"> principle behind good design is to dissect the problem domain into a set of highly cohesive components that interact in a loosely coupled fashion (Sommerville, 2001).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Barriers to reuse </SectionTitle> <Paragraph position="0"> Reuse of software components can be blocked by several factors, including the lack of knowledge of existing components, lack of trust in component quality, a mis-match between component properties and project requirements, unacceptable licensing policies or patent/cost issues. Political issues include the investment needed to make and package reusable components, for which there might not be any budget provided. Technical issues contain software-platform incompatibility and dependencies, installation difficulties, lack of documentation or support, and inconsistencies with other modules.</Paragraph> <Paragraph position="1"> Considering NLP components in specific, formalisms might not be linguistically compatible. Components might differ in language coverage, accuracy and efficiency. With linguistic components, a black box integration is particularly tricky, since if the technique used internally is unknown, the component might break down in case the domain is changed (domain-specific rules/training). A further problem is posed by the fact that different paradigms perform sub-tasks on different levels (e.g. disambiguation). Case-sensitivity/case-awareness can also be problematic.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Code reuse: toolkits </SectionTitle> <Paragraph position="0"> The Edinburgh Language Technology Group's SGMLaware NLP tools (Mikheev et al., 1999) comprise a set of programs that rely on the common LT XML API2 to annotate text using cascading (deterministic) Finite-State The tools are typically used in a sequential UNIX pipeline (Figure 2, top). An integrated query language allows selective processing of parts of the XML/SGML document instance tree.</Paragraph> <Paragraph position="1"> A major advantage of the LTG pipeline toolkit approach over frameworks (described below) is the maximal decoupling of its components (communication only by means data exchange in a &quot;fat XML pipe&quot;), so no toolkit-specific &quot;glue&quot; code needs to be developed and developers can work in their programming language of choice. A disadvantage is that repeated XML parsing between components may be too time-consuming in a production scenario.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Code and design reuse: frameworks </SectionTitle> <Paragraph position="0"> A framework is a collection of pre-defined services that embody a certain, given organization, within which the user can extend the functionality provided; frameworks impose certain organizational principles on the developer (Griffel, 1998).</Paragraph> <Paragraph position="1"> The General Architecture for Text Engineering (GATE)3 is a theory-neutral framework for the management and integration of NLP components and documents on which they operate (Cunningham et al., 1996; Cunningham, 2000; Bontcheva et al., 2002; Cunningham et al., 2002; Maynard et al., forthcoming). GATE 2 is compliant with the TIPSTER architecture (Grishman, 1995), contains the example IE system ANNIE and is freely available including source (in Java, which makes it also open for all languages due to the underlying use of UNICODE).</Paragraph> <Paragraph position="2"> A data type for annotating text spans is provided, which allows for generic visualization and editing components and a graphical plug-and-play development environment.</Paragraph> <Paragraph position="3"> (Zajac et al., 1997) present Corelli, another TIPSTER compliant architecture implemented in Java (see (Basili et al., 1999) for a comparison). The WHITE-BOARD project (Crysmann et al., 2002) uses monotonic XML annotation to integrate deep and shallow processing (Figure 2, middle). Finally, the closest coupling takes place in architectures where most or all components are allowed to talk to each other, such as the German Verbmobil speech translation system (G&quot;orz et al., 1996). ALEP, the Advanced Language Engineering Platform (Simpkins and Groenendijk, 1994; Bredenkamp et al., 1997) is an early framework that focused on multilinguality. It offers an HPSG like, typed AVM-based unification formalism (and parsers for it) as well as some infrastructural support. In the LS-GRAM project, it has been used to build analyzers for nine languages. However, it has been criticized for being &quot;too committed to a particular approach to linguistic analysis and representation&quot; ( Cunningham et al., 1997). ALEP's Text Handling component (Declerck, 1997) uses a particular SGML-based annotation that can be enriched with user-defined tags. Some standard components are provided, and rules allow the mapping of SGML tags to AVMs (&quot;lifting&quot;). SRI's Open Agent Architecture (OAA) 4 (Martin et al., 1999; Cheyer and Martin, 2001) is a software platform that offers a library for distributed agent implementation with bindings for several programming languages (C/C++, Java, LISP, PROLOG etc.). Agents request services from service agents via facilitation, a coordinating service procedure of transparent delegation, whereby facilitators can consider strategic knowledge provided by requesting agents, trying to distribute and optimize goal completion. Control is specified in a PROLOG-like Interagent Communication Language (ICL), which contains, but separates, declarative and procedural knowledge (how to do and what to do).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.6 Discussion </SectionTitle> <Paragraph position="0"> Framework or Toolkit? The disadvantage of frameworks is that any such infrastructure is bound to have a steep learning curve (how to write wrapper/glue code, understand control) and developers are often reluctant to adopt existing frameworks. Using one frameworks also often excludes using another (due to the inherited &quot;design dogma&quot;).</Paragraph> <Paragraph position="1"> Toolkits, on the other hand, are typically smaller and easier to adopt than frameworks and allow for more freedom with respect to architectural choices, but of course the flip-side of this coin is that toolkits offer less guidance and reuse of architecture and infrastructure. See</Paragraph> </Section> </Section> class="xml-element"></Paper>