File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0806_metho.xml

Size: 16,307 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0806">
  <Title>Blueprint for a High Performance NLP Infrastructure</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Performance Requirements
</SectionTitle>
    <Paragraph position="0"> As discussed earlier, there are two main requirements of the system that are covered by &amp;quot;high performance&amp;quot;: speed and state of the art accuracy. Efficiency is required both in training and processing. Efficient training is required because the amount of data available for training will increase significantly. Also, advanced methods often require many training iterations, for example active learning (Dagan and Engelson, 1995) and co-training (Blum and Mitchell, 1998). Processing text needs to be extremely efficient since many new applications will require very large quantities of text to be processed or many smaller quantities of text to be processed very quickly.</Paragraph>
    <Paragraph position="1"> State of the art accuracy is also important, particularly on complex systems since the error is accumulated from each component in the system. There is a speed/accuracy tradeoff that is rarely addressed in the literature. For instance, reducing the beam search width used for tagging can increase the speed without significantly reducing accuracy. Finally, the most accurate systems are often very computationally intensive so a tradeoff may need to be made here. For example, the state of the art POS tagger is an ensemble of individual taggers (van Halteren et al., 2001), each of which must process the text separately. Sophisticated modelling may also give improved accuracy at the cost of training and processing time.</Paragraph>
    <Paragraph position="2"> The space efficiency of the components is important since complex NLP systems will require many different NLP components to be executing at the same time. Also, language processors many eventually be implemented for relatively low-specification devices such as PDAs. This means that special attention will need to be paid to the data-structures used in the component implementation.</Paragraph>
    <Paragraph position="3"> The infrastructure should allow most data to be stored on disk (as a configuration option since we must tradeoff speed for space). Accuracy, speed and compactness are the main execution goals. These goals are achieved by implementing the infrastructure in C/C++, and profiling and optimising the algorithms and data-structures used.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Design Requirements
</SectionTitle>
    <Paragraph position="0"> The remaining requirements relate to the overall and component level design of the system. Following the Generative Programming paradigm, the individual components of the system must be elementary and highly configurable. This ensures minimal redundancy between components and makes them easier to understand, implement, test and debug. It also ensures components are maximally composable and extensible. This is particularly important in NLP because of the high redundancy across tasks and approaches.</Paragraph>
    <Paragraph position="1"> Machine learning methods should be interchangeable: Transformation-based learning (TBL) (Brill, 1993) and Memory-based learning (MBL) (Daelemans et al., 2002) have been applied to many different problems, so a single interchangeable component should be used to represent each method. We will base these components on the design of Weka (Witten and Frank, 1999).</Paragraph>
    <Paragraph position="2"> Representations should be reusable: for example, named entity classification can be considered as a sequence tagging task or a bag-of-words text classification task. The same beam-search sequence tagging component should be able to be used for POS tagging, chunking and named entity classification. Feature extraction components should be reusable since many NLP components share features, for instance, most sequence taggers use the previously assigned tags. We will use an object-oriented hierarchy of methods, representations and features to allow components to be easily interchanged. This hierarchy will be developed by analysing the range of methods, representations and features in the literature.</Paragraph>
    <Paragraph position="3"> High levels of configurability are also very important. Firstly, without high levels of configurability, new systems are not easy to construct by composing existing components, so reinventing the wheel becomes inevitable. Secondly, different languages and tasks show a very wide variation in the methods, representations, and features that are most successful. For instance, a truly multilingual tagger should be able to tag a sequence from left to right or right to left. Finally, this flexibility will allow for research into new tasks and languages to be undertaken with minimal coding.</Paragraph>
    <Paragraph position="4"> Ease of use is a very important criteria for an infrastructure and high quality documentation and examples are necessary to make sense of the vast array of components in the system. Preconfigured standard components (e.g. an English POS tagger) will be supplied with the infrastructure. More importantly, a Python scripting language interface and a graphical user interface will be built on top of the infrastructure. This will allow components to be configured and composed without expertise in C++.</Paragraph>
    <Paragraph position="5"> The user interface will generate code to produce stand-alone components in C++ or Python. Since the Python components will not need to be compiled, they can be distributed immediately.</Paragraph>
    <Paragraph position="6"> One common difficulty with working on text is the range of file formats and encodings that text can be stored in. The infrastructure will provide components to read/write files in many of these formats including HTML files, text files of varying standard formats, email folders, Postscript, Portable Document Format, Rich Text Format and Microsoft Word files. The infrastructure will also read XML and SGML marked-up files, with and without DTDs and XML Schemas, and provide an XPath/XSLT query interface to select particular subtrees for processing. All of these reading/writing components will use existing open source software. It will also eventually provide components to manipulate groups of files: such as iterate through directories, crawl web pages, get files from ftp, extract files from zip and tar archives. The system will provide full support to standard character sets (e.g.</Paragraph>
    <Paragraph position="7"> Unicode) and encodings (e.g. UTF-8 and UTF-16).</Paragraph>
    <Paragraph position="8"> Finally, the infrastructure will provide standard implementations, feature sets and configuration options which means that if the configuration of the components is published, it will be possible for anyone to reproduce published results. This is important because there are many small design decisions that can contribute to the accuracy of a system that are not typically reported in the literature.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Components Groups
</SectionTitle>
    <Paragraph position="0"> When completed the infrastructure will provide highly configurable components grouped into these broad areas: file processing reading from directories, archives, compressed files, sockets, HTTP and newsgroups; text processing reading/writing marked-up corpora, HTML, emails, standard document formats and text file formats used to represent annotated corpora.</Paragraph>
    <Paragraph position="1"> lexical processing tokenization, word segmentation and morphological analysis; feature extraction extracting lexical and annotation features from the current context in sequences, bag of words from segments of text data-structures and algorithms efficient lexical representations, lexicons, tagsets and statistics; Viterbi, beam-search and n-best sequence taggers, parsing algorithms; machine learning methods statistical models: Na&amp;quot;ive Bayes, Maximum Entropy, Conditional Random Fields; and other methods: Decision Trees and Lists, TBL and MBL; resources APIs to WordNet (Fellbaum, 1998), Google and other lexical resources such as gazetteers, ontologies and machine readable dictionaries; existing tools integrating existing open source components and providing interfaces to existing tools that are only distributed as executables.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Implementation
</SectionTitle>
    <Paragraph position="0"> The infrastructure will be implemented in C/C++. Templates will be used heavily to provide generality without significantly impacting on efficiency. However, because templates are a static facility we will also provide dynamic versions (using inheritance), which will be slower but accessible from scripting languages and user interfaces. To provide the required configurability in the static version of the code we will use policy templates (Alexandrescu, 2001), and for the dynamic version we will use configuration classes.</Paragraph>
    <Paragraph position="1"> A key aspect of increasing the efficiency of the system will be using a common text and annotation representation throughout the infrastructure. This means that we do not need to save data to disk, and load it back into memory between each step in the process, which will provide a significant performance increase. Further, we can use techniques for making string matching and other text processing very fast such as making only one copy of each lexical item or annotation in memory. We can also load a lexicon into memory that is shared between all of the components, reducing the memory use.</Paragraph>
    <Paragraph position="2"> The implementation has been inspired by experience in extracting information from very large corpora (Curran and Moens, 2002) and performing experiments on maximum entropy sequence tagging (Curran and Clark, 2003; Clark et al., 2003). We have already implemented a POS tagger, chunker, CCG supertagger and named entity recogniser using the infrastructure. These tools currently train in less than 10 minutes on the standard training materials and tag faster than TNT, the fastest existing POS tagger. These tools use a highly optimised GIS implementation and provide sophisticated Gaussian smoothing (Chen and Rosenfeld, 1999). We expect even faster training times when we move to conjugate gradient methods.</Paragraph>
    <Paragraph position="3"> The next step of the process will be to add different statistical models and machine learning methods. We first plan to add a simple Na&amp;quot;ive Bayes model to the system.</Paragraph>
    <Paragraph position="4"> This will allow us to factor out the maximum entropy specific parts of the system and produce a general component for statistical modelling. We will then implement other machine learning methods and tasks.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Interfaces
</SectionTitle>
    <Paragraph position="0"> Although C++ is extremely efficient, it is not suitable for rapidly gluing components together to form new tools.</Paragraph>
    <Paragraph position="1"> To overcome this problem we have implemented an interface to the infrastructure in the Python scripting language. Python has a number of advantages over other options, such as Java and Perl. Python is very easy to learn, read and write, and allows commands to be entered interactively into the interpreter, making it ideal for experimentation. It has already been used to implement a framework for teaching NLP (Loper and Bird, 2002).</Paragraph>
    <Paragraph position="2"> Using the Boost.Python C++ library (Abrahams, 2003), it is possible to reflect most of the components directly into Python with a minimal amount of coding.</Paragraph>
    <Paragraph position="3"> The Boost.Python library also allows the C++ code to access new classes written in Python that are derived from the C++ classes. This means that new and extended components can be written in Python (although they will be considerably slower). The Python interface allows the components to be dynamically composed, configured and extended in any operating system environment without the need for a compiler. Finally, since Python can produce stand-alone executables directly, it will be possible to create distributable code that does not require the entire infrastructure or Python interpreter to be installed.</Paragraph>
    <Paragraph position="4"> The basic Python reflection has already been implemented and used for large scale experiments with POS tagging, using pyMPI (a message passing interface library for Python) to coordinate experiments across a cluster of over 100 machines (Curran and Clark, 2003; Clark et al., 2003). An example of using the Python tagger interface is shown in Figure 1.</Paragraph>
    <Paragraph position="5"> On top of the Python interface we plan to implement a GUI interface for composing and configuring components. This will be implemented in wxPython which is a platform independent GUI library that uses the native windowing environment under Windows, MacOS and most versions of Unix. The wxPython interface will generate C++ and Python code that composes and configures the components. Using the infrastructure, Python and wxPython it will be possible to generate new GUI applications that use NLP technology.</Paragraph>
    <Paragraph position="6"> Because C++ compilers are now fairly standards compliant, and Python and wxPython are available for most architectures, the infrastructure will be highly portable.</Paragraph>
    <Paragraph position="7"> Further, we eventually plan to implement interfaces to other languages (in particular Java using the Java Native Interface (JNI) and Perl using the XS interface).</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 Web services
</SectionTitle>
    <Paragraph position="0"> The final interface we intend to implement is a collection of web services for NLP. A web service provides a remote procedure that can be called using XML based encodings (XMLRPC or SOAP) of function names, arguments and results transmitted via internet protocols such as HTTP. Systems can automatically discover and communicate with web services that provide the functionality they require by querying databases of standardised descriptions of services with WSDL and UDDI. This standardisation of remote procedures is very exciting from a software engineering viewpoint since it allows systems to be totally distributed. There have already been several attempts to develop distributed NLP systems for dialogue systems (Bayer et al., 2001) and speech recognition (Hacioglu and Pellom, 2003). Web services will allow components developed by different researchers in different locations to be composed to build larger systems.</Paragraph>
    <Paragraph position="1"> Because web services are of great commercial interest they are already being supported strongly by many programming languages. For instance, web services can be accessed with very little code in Java, Python, Perl, C, C++ and Prolog. This allows us to provide NLP services to many systems that we could not otherwise support using a single interface definition. Since the service arguments and results are primarily text and XML, the web service interface will be reasonably efficient for small  quantities of text (e.g. a single document). The second advantage they have is that there is no startup costs when tagger loads up, which means local copies of the web service could be run to reduce tagging latency. Finally, web services will allow developers of resources such as gazetteers to provide the most up to date resources each time their functionality is required.</Paragraph>
    <Paragraph position="2"> We are currently in the process of implementing a POS tagging web service using the gSOAP library, which will translate our C infrastructure binding into web service wrapper code and produce the necessary XML service description files.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML