File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-3031_intro.xml
Size: 6,052 bytes
Last Modified: 2025-10-06 14:02:27
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3031"> <Title>NLTK: The Natural Language Toolkit</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Design </SectionTitle> <Paragraph position="0"> NLTK is implemented as a large collection of minimally interdependent modules, organized into a shallow hierarchy. A set of core modules defines basic data types that are used throughout the toolkit. The remaining modules are task modules, each devoted to an individual natural language processing task. For example, the nltk.parser module encompasses to the task of parsing, or deriving the syntactic structure of a sentence; and the nltk.tokenizer module is devoted to the task of tokenizing, or dividing a text into its constituent parts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Tokens and other core data types </SectionTitle> <Paragraph position="0"> To maximize interoperability between modules, we use a single class to encode information about natural language texts - the Token class. Each Token instance represents a unit of text such as a word, sentence, or document, and is defined by a (partial) mapping from property names to values. For example, the TEXT property is used to encode a token's text content:1 >>> from nltk.token import * In a similar fashion, other language processing tasks such as word-sense disambiguation, chunking and parsing all add properties to the Token data structure. null In general, language processing tasks are formulated as annotations and transformations involving Tokens. In particular, each processing task takes a token and extends it to include new information. These modifications are typically monotonic; new information is added but existing information is not deleted or modified. Thus, tokens serve as a blackboard, where information about a piece of text is collated. This architecture contrasts with the more typical pipeline architecture where each processing task's output discards its input information. We chose the blackboard approach over the pipeline approach because it allows more flexibility when combining tasks into a single system.</Paragraph> <Paragraph position="1"> In addition to the Token class and its derivatives, NLTK defines a variety of other data types. For instance, the probability module defines classes for probability distributions and statistical smoothing techniques; and the cfg module defines classes for encoding context free grammars and probabilistic context free grammars.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The corpus module </SectionTitle> <Paragraph position="0"> Many language processing tasks must be developed and tested using annotated data sets or corpora.</Paragraph> <Paragraph position="1"> Several such corpora are distributed with NLTK, as listed in Table 1. The corpus module defines classes for reading and processing many of these corpora. The following code fragment illustrates how the Brown Corpus is accessed.</Paragraph> <Paragraph position="2"> >>> from nltk.corpus import brown >>> brown.groups() ['skill and hobbies', 'popular lore',</Paragraph> <Paragraph position="4"> A selection of 5% of the Penn Treebank corpus is included with NLTK, and it is accessed as follows: >>> from nltk.corpus import treebank</Paragraph> <Paragraph position="6"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Processing modules </SectionTitle> <Paragraph position="0"> Each language processing algorithm is implemented as a class. For example, the ChartParser and RecursiveDescentParser classes each define a single algorithm for parsing a text. We implement language processing algorithms using classes instead of functions for three reasons. First, all algorithm-specific options can be passed to the constructor, allowing a consistent interface for applying the algorithms. Second, a number of algorithms need to have their state initialized before they can be used. For example, the NthOrderTagger class must be initialized by training on a tagged corpus before it can be used. Third, subclassing can be used to create specialized versions of a given algorithm. Each processing module defines an interface for its task. Interface classes are distinguished by naming them with a trailing capital &quot;I,&quot; such as ParserI. Each interface defines a single action method which performs the task defined by the interface. For example, the ParserI interface defines the parse method and the Tokenizer interface defines the tokenize method. When appropriate, an interface defines extended action methods, which provide variations on the basic action method. For example, the ParserIinterface defines the parse n method which finds at most n parses for a given sentence; and the TokenizerI interface defines the xtokenize method, which outputs an iterator over subtokens instead of a list of subtokens.</Paragraph> <Paragraph position="1"> NLTK includes the following modules: cfg, corpus, draw (cfg, chart, corpus, featurestruct, fsa, graph, plot, rdparser, srparser, tree), eval, featurestruct, parser (chart, chunk, probabilistic), probability, sense, set, stemmer (porter), tagger, test, token, tokenizer, tree, and util. Please see the online documentation for details.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Documentation </SectionTitle> <Paragraph position="0"> Three different types of documentation are available. Tutorials explain how to use the toolkit, with detailed worked examples. The API documentation describes every module, interface, class, method, function, and variable in the toolkit. Technical reports explain and justify the toolkit's design and implementation. All are available from http:// nltk.sf.net/docs.html.</Paragraph> </Section> </Section> class="xml-element"></Paper>