File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0806_intro.xml

Size: 3,319 bytes

Last Modified: 2025-10-06 14:01:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0806">
  <Title>Blueprint for a High Performance NLP Infrastructure</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Existing Systems
</SectionTitle>
    <Paragraph position="0"> There are a number of generalised NLP systems in the literature. Many provide graphical user interfaces (GUI) for manual annotation (e.g. General Architecture for Text Engineering (GATE) (Cunningham et al., 1997) and the Alembic Workbench (Day et al., 1997)) as well as NLP tools and resources that can be manipulated from the GUI. For instance, GATE currently provides a POS tagger, named entity recogniser and gazetteer and ontology editors (Cunningham et al., 2002). GATE goes beyond earlier systems by using a component-based infrastructure (Cunningham, 2000) which the GUI is built on top of. This allows components to be highly configurable and simplifies the addition of new components to the system.</Paragraph>
    <Paragraph position="1"> A number of stand-alone tools have also been developed. For example, the suite of LT tools (Mikheev et al., 1999; Grover et al., 2000) perform tokenization, tagging and chunking on XML marked-up text directly. These tools also store their configuration state, e.g. the transduction rules used in LT CHUNK, in XML configuration files. This gives a greater flexibility but the tradeoff is that these tools can run very slowly. Other tools have been designed around particular techniques, such as finite state machines (Karttunen et al., 1997; Mohri et al., 1998). However, the source code for these tools is not freely available, so they cannot be extended.</Paragraph>
    <Paragraph position="2"> Efficiency has not been a focus for NLP research in general. However, it will be increasingly important as techniques become more complex and corpus sizes grow.</Paragraph>
    <Paragraph position="3"> An example of this is the estimation of maximum entropy models, from simple iterative estimation algorithms used by Ratnaparkhi (1998) that converge very slowly, to complex techniques from the optimisation literature that converge much more rapidly (Malouf, 2002). Other attempts to address efficiency include the fast Transformation Based Learning (TBL) Toolkit (Ngai and Florian, 2001) which dramatically speeds up training TBL systems, and the translation of TBL rules into finite state machines for very fast tagging (Roche and Schabes, 1997).</Paragraph>
    <Paragraph position="4"> The TNT POS tagger (Brants, 2000) has also been designed to train and run very quickly, tagging between 30,000 and 60,000 words per second.</Paragraph>
    <Paragraph position="5"> The Weka package (Witten and Frank, 1999) provides a common framework for several existing machine learning methods including decision trees and support vector machines. This library has been very popular because it allows researchers to experiment with different methods without having to modify code or reformat data.</Paragraph>
    <Paragraph position="6"> Finally, the Natural Language Toolkit (NLTK) is a package of NLP components implemented in Python (Loper and Bird, 2002). Python scripting is extremely simple to learn, read and write, and so using the existing components and designing new components is simple.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML