File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0104_metho.xml

Size: 12,500 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0104">
  <Title>A Core-Tools Statistical NLP Course</Title>
  <Section position="4" start_page="23" end_page="23" type="metho">
    <SectionTitle>
3 Topics
</SectionTitle>
    <Paragraph position="0"> The topics covered in the course are shown in figure 1. The first week of the course was essentially a history lesson about symbolic approaches NLP, both to show their strengths (a full, unified pipeline including predicate logic semantic interpretations, while we still don't have a good notion of probabilistic interpretation) and their weaknesses (many interpretations arise from just a few rules, ambiguity poorly handled). From there, I discussed statistical approaches to problems of increasing complexity, spending a large amount of time on tree and sequence models.</Paragraph>
    <Paragraph position="1"> As mentioned above, I organized the lectures around linguistic topics rather than mathematical methods. However, given the degree to which the course focused on such foundational methods, this order was perhaps a mistake. For example, it meant that simple word alignment models like IBM models 1 and 2 (Brown et al., 1990) and the HMM model (Vogel et al., 1996) came many weeks after HMMs were introduced in the context of part-of-speech tagging. I also separated unsupervised learning into its own sub-sequence, where I now wish I had presented the unsupervised approaches to each task along with the supervised ones.</Paragraph>
    <Paragraph position="2"> I assigned readings from Jurafsky and Martin (2000) and Manning and Sch&amp;quot;utze (1999) for the first half of the course, but the second half was almost entirely based on papers from the research literature. This reflected both increasing sophistication on the part of the students and insufficient coverage of the latter topics in the textbooks.</Paragraph>
  </Section>
  <Section position="5" start_page="23" end_page="26" type="metho">
    <SectionTitle>
4 Assignments
</SectionTitle>
    <Paragraph position="0"> The key component which characterized this course was the assignments. Each assignment is described below. They are available for use by other instructors. While licensing issues with the data make it impossible to put the entirety of the assignment materials on the web, some materials will be linked from http://www.cs.berkeley.edu/~klein, and the rest can be obtained by emailing me.</Paragraph>
    <Section position="1" start_page="23" end_page="23" type="sub_section">
      <SectionTitle>
4.1 Assignment Principles
</SectionTitle>
      <Paragraph position="0"> The assignments were all in Java. In all cases, I supplied a large amount of scaffolding code which read in the appropriate data files, constructed a placeholder baseline system, and tested that baseline. The students therefore always began with a running end-to-end pipeline, using standard corpora, evaluated in standard ways. They then swapped out the baseline placeholder for increasingly sophisticated implementations. When possible, assignments also had a toy &amp;quot;miniTest&amp;quot; mode where rather than reading in real corpora, a small toy corpus was loaded to facilitate debugging. Assignments were graded entirely on the basis of write-ups.</Paragraph>
    </Section>
    <Section position="2" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
4.2 Assignment 1: Language Modeling
</SectionTitle>
      <Paragraph position="0"> In the first assignment, students built n-gram language models using WSJ data. Their language models were evaluated in three ways by  the support harness. First, perplexity on held-out WSJ text was calculated. In this evaluation, reserving the correct mass for unknown words was important. Second, their language models were used to rescore n-best speech lists (supplied by Brian Roark, see Roark (2001)). Finally, random sentences were generatively sampled from their models, giving students concrete feedback on how their models did (or did not) capture information about English. The support code intially provided an unsmoothed unigram model to get students started. They were then asked to build several more complex language models, including at least one higher-order interpolated model, and at least one model using Good-Turing or held-out smoothing. Beyond these requirements, students were encouraged to acheive the best possible word error rate and perplexity figures by whatever means they chose.1 They were also asked to identify ways in which their language models missed important trends of En1After each assignment, I presented in class an honors list, consisting of the students who won on any measure or who had simply built something clever. I initially worried about how these honors announcements would be received, but students really seemed to enjoy hearing what their peers were doing, and most students made the honors list at some point in the term.</Paragraph>
      <Paragraph position="1"> glish and to suggest solutions.</Paragraph>
      <Paragraph position="2"> As a second part to assignment 1, students trained class-conditional n-gram models (at the character level) to do the proper name identification task from Smarr and Manning (2002) (whose data we used). In this task, proper name strings are to be mapped to one of {drug, company, movie, person, location}. This turns out to be a fairly easy task since the different categories have markedly different character distributions.2 In the future, I will move this part of assignment 1 and the matching part of assignment 2 into a new, joint assignment.</Paragraph>
    </Section>
    <Section position="3" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
4.3 Assignment 2: Maximum Entropy /
POS Tagging
</SectionTitle>
      <Paragraph position="0"> In assignment 2, students first built a general maximum entropy model for multiclass classification. The support code provided a crippled maxent classifier which always returned the uniform distribution over labels (by ignoring the features of the input datum). Students replaced the crippled bits and got a correct classifier run2This assignment could equally well have been done as a language identification task, but the proper name data was convenient and led to fun error analysis, since in good systems the errors are mostly places named after people, movies with place names as titles, and so on.</Paragraph>
      <Paragraph position="1">  ning, first on a small toy problem and then on the proper-name identification problem from assignment 1. The support code provided optimization code (an L-BFGS optimizer) and feature indexing machinery, so students only wrote code to calculate the maxent objective function and its derivatives.</Paragraph>
      <Paragraph position="2"> The original intention of assignment 2 was that students then use this maxent classifier as a building block of a maxent part-of-speech tagger like that of Ratnaparkhi (1996). The support code supplied a most-frequent-tag baseline tagger and a greedy lattice decoder. The students first improved the local scoring function (keeping the greedy decoder) using either an HMM or maxent model for each timeslice. Once this was complete, they upgraded the greedy decoder to a Viterbi decoder. Since students were, in practice, generally only willing to wait about 20 minutes for an experiment to run, most chose to discard their maxent classifiers and build generative HMM taggers. About half of the students' final taggers exceeded 96% per-word tagging accuracy, which I found very impressive. Students were only required to build a trigram tagger of some kind. However, many chose to have smoothed HMMs with complex emission models like Brants (2000), while others built maxent taggers.</Paragraph>
      <Paragraph position="3"> Because of the slowness of maxent taggers' training, I will just ask students to build HMM taggers next time. Moreover, with the relation between the two parts of this assignment gone, I will separate out the proper-name classification part into its own assignment.</Paragraph>
    </Section>
    <Section position="4" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
4.4 Assignment 3: Parsing
</SectionTitle>
      <Paragraph position="0"> In assignment 3, students wrote a probabilistic chart parser. The support code read in and normalized Penn Treebank trees using the standard data splits, handled binarization of n-ary rules, and calculated ParsEval numbers over the development or test sets. A baseline left-branching parser was provided. Students wrote an agenda-based uniform-cost parser essentially from scratch. Once the parser parsed correctly with the supplied treebank grammar, students experimented with horizontal and vertical markovization (see Klein and Manning (2003)) to improve parsing accuracy. Students were then free to experiment with speed-ups to the parser, more complex annotation schemes, and so on. Most students' parsers ran at reasonable speeds (around a minute for 40 word sentences) and got final F1 measures over 82%, which is substantially higher than an unannotated tree-bank grammar will produce. While this assignment would appear to be more work than the others, it actually got the least overload-related complaints of all the assignments.</Paragraph>
      <Paragraph position="1"> In the future, I may instead have students implement an array-based CKY parser (Kasami, 1965), since a better understanding of CKY would have been more useful than knowing about agenda-based methods for later parts of the course. Moreover, several students wanted to experiment with induction methods which required summing parsers instead of Viterbi parsers.</Paragraph>
    </Section>
    <Section position="5" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
4.5 Assignment 4: Word Alignment
</SectionTitle>
      <Paragraph position="0"> In assignment 4, students built word alignment systems using the Canadian Hansards training data and evaluation alignments from the 2003 (and now 2005) shared task in the NAACL workshop on parallel texts. The support code provided a monotone baseline aligner and evaluation/display code which graphically printed gold alignments superimposed over guessed alignments. Students first built a heuristic aligner (Dice, mutual information-based, or whatever they could invent) and then built IBM model 1 and 2 aligners. They then had a choice of either scaling up the system to learn from larger training sets or implementing the HMM alignment model.</Paragraph>
    </Section>
    <Section position="6" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
4.6 Assignment Observations
</SectionTitle>
      <Paragraph position="0"> For all the assignments, I stressed that the students should spend a substantial amount of time doing error analysis. However, most didn't, except for in assignment 2, where the support code printed out every error their taggers made, by default. For this assignment, students actually provided very good error analysis. In the future, I will increase the amount of verbose er- null ror output to encourage better error analysis for the other assignments - it seemed like students were reluctant to write code to display errors, but were happy to look at errors as they scrolled by.3 A very important question raised by an anonymous reviewer was how effectively implementing tried-and-true methods feeds into new research. For students who will not be doing NLP research but want to know how the basic methods work (realistically, this is most of the audience), the experience of having implemented several &amp;quot;classic&amp;quot; approaches to core tools is certainly appropriate. However, even for students who intend to do NLP research, this hands-on tour of established methods has already shown itself to be very valuable. These students can pick up any paper on any of these tasks, and they have a very concrete idea about what the data sets look like, why people do things they way they do, and what kinds of error types and rates one can expect from a given tool. That's experience that can take a long time to acquire otherwise - it certainly took me a while. Moreover, I've had several students from the class start research projects with me, and, in each case, those projects have been in some way bridged by the course assignments. This methodology also means that all of the students working with me have a shared implementation background, which has facilitated ad hoc collaborations on research projects.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML