File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0109_metho.xml
Size: 22,955 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0109"> <Title>NLTK: The Natural Language Toolkit</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Design Criteria </SectionTitle> <Paragraph position="0"> Several criteria were considered in the design and implementation of the toolkit. These design criteria are listed in the order of their importance. It was also important to decide what goals the toolkit would not attempt to accomplish; we therefore include an explicit set of nonrequirements, which the toolkit is not expected to satisfy.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Requirements </SectionTitle> <Paragraph position="0"> Ease of Use. The primary purpose of the toolkit is to allow students to concentrate on building natural language processing (NLP) systems. The more time students must spend learning to use the toolkit, the less useful it is.</Paragraph> <Paragraph position="1"> Consistency. The toolkit should use consistent data structures and interfaces.</Paragraph> <Paragraph position="2"> Extensibility. The toolkit should easily accommodate new components, whether those components replicate or extend the toolkit's existing functionality. The toolkit should be structured in such a way that it is obvious where new extensions would flt into the toolkit's infrastructure.</Paragraph> <Paragraph position="3"> Documentation. The toolkit, its data structures, and its implementation all need to be carefully and thoroughly documented. All nomenclature must be carefully chosen and consistently used.</Paragraph> <Paragraph position="4"> Simplicity. The toolkit should structure the complexities of building NLP systems, not hide them. Therefore, each class deflned by the toolkit should be simple enough that a student could implement it by the time they flnish an introductory course in computational linguistics. null Modularity. The interaction between difierent components of the toolkit should be kept to a minimum, using simple, well-deflned interfaces. In particular, it should be possible to complete individual projects using small parts of the toolkit, without worrying about how they interact with the rest of the toolkit. This allows students to learn how to use the toolkit incrementally throughout a course. Modularity also makes it easier to change and extend the toolkit.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Non-Requirements </SectionTitle> <Paragraph position="0"> Comprehensiveness. The toolkit is not intended to provide a comprehensive set of tools. Indeed, there should be a wide variety of ways in which students can extend the toolkit.</Paragraph> <Paragraph position="1"> E-ciency. The toolkit does not need to be highly optimized for runtime performance.</Paragraph> <Paragraph position="2"> However, it should be e-cient enough that students can use their NLP systems to perform real tasks.</Paragraph> <Paragraph position="3"> Cleverness. Clear designs and implementations are far preferable to ingenious yet indecipherable ones.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Modules </SectionTitle> <Paragraph position="0"> The toolkit is implemented as a collection of independent modules, each of which deflnes a speciflc data structure or task.</Paragraph> <Paragraph position="1"> A set of core modules deflnes basic data types and processing systems that are used throughout the toolkit. The token module provides basic classes for processing individual elements of text, such as words or sentences.</Paragraph> <Paragraph position="2"> The tree module deflnes data structures for representing tree structures over text, such as syntax trees and morphological trees. The probability module implements classes that encode frequency distributions and probability distributions, including a variety of statistical smoothing techniques.</Paragraph> <Paragraph position="3"> The remaining modules deflne data structures and interfaces for performing speciflc NLP tasks. This list of modules will grow over time, as we add new tasks and algorithms to the toolkit.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Parsing Modules </SectionTitle> <Paragraph position="0"> The parser module deflnes a high-level interface for producing trees that represent the structures of texts. The chunkparser module deflnes a sub-interface for parsers that identify non-overlapping linguistic groups (such as base noun phrases) in unrestricted text.</Paragraph> <Paragraph position="1"> Four modules provide implementations for these abstract interfaces. The srparser module implements a simple shift-reduce parser. The chartparser module deflnes a exible parser that uses a chart to record hypotheses about syntactic constituents. The pcfgparser module provides a variety of difierent parsers for probabilistic grammars.</Paragraph> <Paragraph position="2"> And the rechunkparser module deflnes a transformational regular-expression based implementation of the chunk parser interface.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Tagging Modules </SectionTitle> <Paragraph position="0"> The tagger module deflnes a standard interface for augmenting each token of a text with supplementary information, such as its part of speech or its WordNet synset tag; and provides several difierent implementations for this interface.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Finite State Automata </SectionTitle> <Paragraph position="0"> The fsa module deflnes a data type for encoding flnite state automata; and an interface for creating automata from regular expressions.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Type Checking </SectionTitle> <Paragraph position="0"> Debugging time is an important factor in the toolkit's ease of use. To reduce the amount of time students must spend debugging their code, we provide a type checking module, which can be used to ensure that functions are given valid arguments. The type checking module is used by all of the basic data types and processing classes.</Paragraph> <Paragraph position="1"> Since type checking is done explicitly, it can slow the toolkit down. However, when e-ciency is an issue, type checking can be easily turned ofi; and with type checking is disabled, there is no performance penalty.</Paragraph> <Paragraph position="2"> Visualization Visualization modules deflne graphical interfaces for viewing and manipulating data structures, and graphical tools for experimenting with NLP tasks. The draw.tree module provides a simple graphical interface for displaying tree structures. The draw.tree edit module provides an interface for building and modifying tree structures.</Paragraph> <Paragraph position="3"> The draw.plot graph module can be used to graph mathematical functions. The draw.fsa module provides a graphical tool for displaying and simulating flnite state automata. The draw.chart module provides an interactive graphical tool for experimenting with chart parsers.</Paragraph> <Paragraph position="4"> The visualization modules provide interfaces for interaction and experimentation; they do not directly implement NLP data structures or tasks. Simplicity of implementation is therefore less of an issue for the visualization modules than it is for the rest of the toolkit.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Text Classiflcation </SectionTitle> <Paragraph position="0"> The classifier module deflnes a standard interface for classifying texts into categories.</Paragraph> <Paragraph position="1"> This interface is currently implemented by two modules. The classifier.naivebayes module deflnes a text classifler based on the Naive Bayes assumption. The classifier.maxent module deflnes the maximum entropy model for text classiflcation, and implements two algorithms for training the model: Generalized Iterative Scaling and Improved Iterative Scaling.</Paragraph> <Paragraph position="2"> The classifier.feature module provides a standard encoding for the information that is used to make decisions for a particular classiflcation task. This standard encoding allows students to experiment with the difierences between difierent text classiflcation algorithms, using identical feature sets.</Paragraph> <Paragraph position="3"> The classifier.featureselection module deflnes a standard interface for choosing which features are relevant for a particular classiflcation task. Good feature selection can signiflcantly improve classiflcation performance.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Documentation </SectionTitle> <Paragraph position="0"> The toolkit is accompanied by extensive documentation that explains the toolkit, and describes how to use and extend it. This documentation is divided into three primary categories: Tutorials teach students how to use the toolkit, in the context of performing speciflc tasks. Each tutorial focuses on a single domain, such as tagging, probabilistic systems, or text classiflcation. The tutorials include a high-level discussion that explains and motivates the domain, followed by a detailed walk-through that uses examples to show how NLTK can be used to perform speciflc tasks.</Paragraph> <Paragraph position="1"> Reference Documentation provides precise deflnitions for every module, interface, class, method, function, and variable in the toolkit. It is automatically extracted from docstring comments in the Python source code, using Epydoc (Loper, 2002).</Paragraph> <Paragraph position="2"> Technical Reports explain and justify the toolkit's design and implementation. They are used by the developers of the toolkit to guide and document the toolkit's construction. Students can also consult these reports if they would like further information about how the toolkit is designed, and why it is designed that way.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Uses of NLTK </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Assignments </SectionTitle> <Paragraph position="0"> NLTK can be used to create student assignments of varying di-culty and scope. In the simplest assignments, students experiment with an existing module. The wide variety of existing modules provide many opportunities for creating these simple assignments. Once students become more familiar with the toolkit, they can be asked to make minor changes or extensions to an existing module. A more challenging task is to develop a new module. Here, NLTK provides some useful starting points: predeflned interfaces and data structures, and existing modules that implement the same interface.</Paragraph> <Paragraph position="1"> As an example of a moderately di-cult assignment, we asked students to construct a chunk parser that correctly identifles base noun phrase chunks in a given text, by deflning a cascade of transformational chunking rules. The NLTK rechunkparser module provides a variety of regular-expression based rule types, which the students can instantiate to construct complete rules.</Paragraph> <Paragraph position="2"> For example, ChunkRule('<NN.*>') builds chunks from sequences of consecutive nouns; ChinkRule('<VB.>') excises verbs from existing chunks; SplitRule('<NN>', '<DT>') splits any existing chunk that contains a singular noun followed by determiner into two pieces; and MergeRule('<JJ>', '<JJ>') combines two adjacent chunks where the flrst chunk ends and the second chunk starts with adjectives.</Paragraph> <Paragraph position="3"> The chunking tutorial motivates chunk parsing, describes each rule type, and provides all the necessary code for the assignment. The provided code is responsible for loading the chunked, part-of-speech tagged text using an existing tokenizer, creating an unchunked version of the text, applying the chunk rules to the unchunked text, and scoring the result. Students focus on the NLP task only { providing a rule set with the best coverage.</Paragraph> <Paragraph position="4"> In the remainder of this section we reproduce some of the cascades created by the students.</Paragraph> <Paragraph position="5"> The flrst example illustrates a combination of several rule types: The next example illustrates a brute-force statistical approach. The student calculated how often each part-of-speech tag was included in a noun phrase. They then constructed chunks from any sequence of tags that occurred in a noun phrase more than 50% of the time.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Class demonstrations </SectionTitle> <Paragraph position="0"> NLTK provides graphical tools that can be used in class demonstrations to help explain basic NLP concepts and algorithms. These interactive tools can be used to display relevant data structures and to show the step-by-step execution of algorithms. Both data structures and control ow can be easily modifled during the demonstration, in response to questions from the class. Since these graphical tools are included with the toolkit, they can also be used by students.</Paragraph> <Paragraph position="1"> This allows students to experiment at home with the algorithms that they have seen presented in class.</Paragraph> <Paragraph position="2"> Example: The Chart Parsing Tool The chart parsing tool is an example of a graphical tool provided by NLTK. This tool can be used to explain the basic concepts behind chart parsing, and to show how the algorithm works. Chart parsing is a exible parsing algorithm that uses a data structure called a chart to record hypotheses about syntactic constituents.</Paragraph> <Paragraph position="3"> Each hypothesis is represented by a single edge on the chart. A set of rules determine when new edges can be added to the chart. This set of rules controls the overall behavior of the parser (e.g., whether it parses top-down or bottom-up).</Paragraph> <Paragraph position="4"> The chart parsing tool demonstrates the process of parsing a single sentence, with a given grammar and lexicon. Its display is divided into three sections: the bottom section displays the chart; the middle section displays the sentence; and the top section displays the partial syntax tree corresponding to the selected edge. Buttons along the bottom of the window are used to control the execution of the algorithm. The main display window for the chart parsing tool is shown in Figure 1.</Paragraph> <Paragraph position="5"> This tool can be used to explain several different aspects of chart parsing. First, it can be used to explain the basic chart data structure, and to show how edges can represent hypotheses about syntactic constituents. It can then be used to demonstrate and explain the individual rules that the chart parser uses to create new edges. Finally, it can be used to show how these individual rules combine to flnd a complete parse for a given sentence.</Paragraph> <Paragraph position="6"> To reduce the overhead of setting up demonstrations during lecture, the user can deflne a list of preset charts. The tool can then be reset to any one of these charts at any time.</Paragraph> <Paragraph position="7"> The chart parsing tool allows for exible control of the parsing algorithm. At each step of the algorithm, the user can select which rule or strategy they wish to apply. This allows the user to experiment with mixing difierent strategies (e.g., top-down and bottom-up). The user can exercise flne-grained control over the algorithm by selecting which edge they wish to apply a rule to. This exibility allows lecturers to use the tool to respond to a wide variety of questions; and allows students to experiment with difierent variations on the chart parsing algorithm.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Advanced Projects </SectionTitle> <Paragraph position="0"> NLTK provides students with a exible framework for advanced projects. Typical projects involve the development of entirely new functionality for a previously unsupported NLP task, or the development of a complete system out of existing and new modules.</Paragraph> <Paragraph position="1"> The toolkit's broad coverage allows students to explore a wide variety of topics. In our introductory computational linguistics course, topics for student projects included text generation, word sense disambiguation, collocation analysis, and morphological analysis.</Paragraph> <Paragraph position="2"> NLTK eliminates the tedious infrastructurebuilding that is typically associated with advanced student projects by providing students with the basic data structures, tools, and interfaces that they need. This allows the students to concentrate on the problems that interest them.</Paragraph> <Paragraph position="3"> The collaborative, open-source nature of the toolkit can provide students with a sense that their projects are meaningful contributions, and not just exercises. Several of the students in our course have expressed interest in incorporating their projects into the toolkit.</Paragraph> <Paragraph position="4"> Finally, many of the modules included in the toolkit provide students with good examples of what projects should look like, with well thought-out interfaces, clean code structure, and thorough documentation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Example: Probabilistic Parsing </SectionTitle> <Paragraph position="0"> The probabilistic parsing module was created as a class project for a statistical NLP course.</Paragraph> <Paragraph position="1"> The toolkit provided the basic data types and interfaces for parsing. The project extended these, adding a new probabilistic parsing interface, and using subclasses to create a probabilistic version of the context free grammar data structure. These new components were used in conjunction with several existing components, such as the chart data structure, to deflne two implementations of the probabilistic parsing interface. Finally, a tutorial was written that explained the basic motivations and concepts behind probabilistic parsing, and described the new interfaces, data structures, and parsers.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Evaluation </SectionTitle> <Paragraph position="0"> We used NLTK as a basis for the assignments and student projects in CIS-530, an introductory computational linguistics class taught at the University of Pennsylvania. CIS-530 is a graduate level class, although some advanced undergraduates were also enrolled. Most students had a background in either computer science or linguistics (and occasionally both). Students were required to complete flve assignments, two exams, and a flnal project. All class materials are available from the course website http://www.cis.upenn.edu/~cis530/.</Paragraph> <Paragraph position="1"> The experience of using NLTK was very positive, both for us and for the students. The students liked the fact that they could do interesting projects from the outset. They also liked being able to run everything on their computer at home. The students found the extensive documentation very helpful for learning to use the toolkit. They found the interfaces deflned by NLTK intuitive, and appreciated the ease with which they could combine difierent components to create complete NLP systems.</Paragraph> <Paragraph position="2"> We did encounter a few di-culties during the semester. One problem was flnding large clean corpora that the students could use for their assignments. Several of the students needed assistance flnding suitable corpora for their flnal projects. Another issue was the fact that we were actively developing NLTK during the semester; some modules were only completed one or two weeks before the students used them. As a result, students who worked at home needed to download new versions of the toolkit several times throughout the semester.</Paragraph> <Paragraph position="3"> Luckily, Python has extensive support for installation scripts, which made these upgrades simple. The students encountered a couple of bugs in the toolkit, but none were serious, and all were quickly corrected.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Other Approaches </SectionTitle> <Paragraph position="0"> The computational component of computational linguistics courses takes many forms. In this section we brie y review a selection of approaches, classifled according to the (original) target audience. null Linguistics Students. Various books introduce programming or computing to linguists. These are elementary on the computational side, providing a gentle introduction to students having no prior experience in computer science.</Paragraph> <Paragraph position="1"> Examples of such books are: Using Computers in Linguistics (Lawler and Dry, 1998), and Programming for Linguistics: Java Technology for Language Researchers (Hammond, 2002).</Paragraph> <Paragraph position="2"> Grammar Developers. Infrastructure for grammar development has a long history in uniflcation-based (or constraint-based) grammar frameworks, from DCG (Pereira and Warren, 1980) to HPSG (Pollard and Sag, 1994). Recent work includes (Copestake, 2000; Baldridge et al., 2002a). A concurrent development has been the flnite state toolkits, such as the Xerox toolkit (Beesley and Karttunen, 2002). This work has found widespread pedagogical application.</Paragraph> <Paragraph position="3"> Other Researchers and Developers.</Paragraph> <Paragraph position="4"> A variety of toolkits have been created for research or R&D purposes. Examples include the CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997), the EMU Speech Database System (Harrington and Cassidy, 1999), the General Architecture for Text Engineering (Bontcheva et al., 2002), the Maxent Package for Maximum Entropy Models (Baldridge et al., 2002b), and the Annotation Graph Toolkit (Maeda et al., 2002). Although not originally motivated by pedagogical needs, all of these toolkits have pedagogical applications and many have already been used in teaching.</Paragraph> </Section> class="xml-element"></Paper>