File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-2004_metho.xml
Size: 19,886 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2004"> <Title>Building a Large Annotated Corpus of English: The Penn Treebank</Title> <Section position="2" start_page="0" end_page="315" type="metho"> <SectionTitle> 2. Part-of-Speech Tagging 2.1 A Simplified POS Tagset for English </SectionTitle> <Paragraph position="0"> The POS tagsets used to annotate large corpora in the past have traditionally been fairly extensive. The pioneering Brown Corpus distinguishes 87 simple tags (Francis 1964; Francis and Ku~era 1982) and allows the formation of compound tags; thus, the contraction I'm is tagged as PPSS+BEM (PPSS for &quot;non-third person nominative personal pronoun&quot; and BEM for &quot;am, 'm&quot;. 2 Subsequent projects have tended to elaborate the Brown Corpus tagset. For instance, the Lancaster-Oslo/Bergen (LOB) Corpus uses about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English 197 tags. 3 The rationale behind developing such large, richly articulated tagsets is to approach &quot;the ideal of providing distinct codings for all classes of words having distinct grammatical behaviour&quot; (Garside, Leech, and Sampson 1987, p. 167).</Paragraph> <Paragraph position="1"> 2.1.1 Recoverability. Like the tagsets just mentioned, the Penn Treebank tagset is based on that of the Brown Corpus. However, the stochastic orientation of the Penn Tree-bank and the resulting concern with sparse data led us to modify the Brown Corpus tagset by paring it down considerably. A key strategy in reducing the tagset was to eliminate redundancy by taking into account both lexical and syntactic information.</Paragraph> <Paragraph position="2"> Thus, whereas many POS tags in the Brown Corpus tagset are unique to a particular lexical item, the Penn Treebank tagset strives to eliminate such instances of lexical redundancy. For instance, the Brown Corpus distinguishes five different forms for main verbs: the base form is tagged VB, and forms with overt endings are indicated by appending D for past tense, G for present participle/gerund, N for past participle, and Z for third person singular present. Exactly the same paradigm is recognized for have, but have (regardless of whether it is used as an auxiliary or a main verb) is assigned its own base tag HV. The Brown Corpus further distinguishes three forms of do--the base form (DO), the past tense (DOD), and the third person singular present (DOZ), 4 and eight forms of be--the five forms distinguished for regular verbs as well as the irregular forms am (BEM), are (BER), and was (BEDZ). By contrast, since the distinctions between the forms of VB on the one hand and the forms of BE, DO, and HV on the other are lexically recoverable, they are eliminated in the Penn Treebank, as shown in Table 1. 5 Mitchell P. Marcus et al. Building a Large Annotated Corpus of English Table 1 Elimination of lexically recoverable distinctions.</Paragraph> <Paragraph position="3"> sing/VB be/VB do/VB have/VB sings/VBZ is/VBZ does/VBZ has/VBZ sang/VBD was/VBD did/VBD had/VBD singing/VBG being/VBG doing/VBG having/VBG sung/VBN been/VBN done/VBN had/VBN A second example of lexical recoverability concerns those words that can precede articles in noun phrases. The Brown Corpus assigns a separate tag to pre-qualifiers (quite, rather, such), pre-quantifiers (all, half, many, nary) and both. The Penn Treebank, on the other hand, assigns all of these words to a single category PDT (predeterminer). Further examples of lexically recoverable categories are the Brown Corpus categories PPL (singular reflexive pronoun) and PPLS (plural reflexive pronoun), which we collapse with PRP (personal pronoun), and the Brown Corpus category RN (nominal adverb), which we collapse with RB (adverb).</Paragraph> <Paragraph position="4"> Beyond reducing lexically recoverable distinctions, we also eliminated certain POS distinctions that are recoverable with reference to syntactic structure. For instance, the Penn Treebank tagset does not distinguish subject pronouns from object pronouns even in cases where the distinction is not recoverable from the pronoun's form, as with you, since the distinction is recoverable on the basis of the pronoun's position in the parse tree in the parsed version of the corpus. Similarly, the Penn Treebank tagset conflates subordinating conjunctions with prepositions, tagging both categories as IN. The distinction between the two categories is not lost, however, since subordinating conjunctions can be recovered as those instances of IN that precede clauses, whereas prepositions are those instances of IN that precede noun phrases or prepositional phrases. We would like to emphasize that the lexical and syntactic recoverability inherent in the POS-tagged version of the Penn Treebank corpus allows end users to employ a much richer tagset than the small one described in Section 2.2 if the need arises.</Paragraph> <Paragraph position="5"> 2.1.2 Consistency. As noted above, one reason for eliminating a POS tag such as RN (nominal adverb) is its lexical recoverability. Another important reason for doing so is consistency. For instance, in the Brown Corpus, the deictic adverbs there and now are always tagged RB (adverb), whereas their counterparts here and then are inconsistently tagged as RB (adverb) or RN (nominal adverb) even in identical syntactic contexts, such as after a preposition. It is clear that reducing the size of the tagset reduces the chances of such tagging inconsistencies.</Paragraph> <Paragraph position="6"> 2.1.3 Syntactic Function. A further difference between the Penn Treebank and the Brown Corpus concerns the significance accorded to syntactic context. In the Brown Corpus, words tend to be tagged independently of their syntactic function. 6 For instance, in the phrase the one, one is always tagged as CD (cardinal number), whereas 6 An important exception is there, which the Brown Corpus tags as EX (existential there) when it is used as a formal subject and as RB (adverb) when it is used as a locative adverb. In the case of there, we did not pursue our strategy of tagset reduction to its logical conclusion, which would have implied tagging existential there as NN (common noun). Computational Linguistics Volume 19, Number 2 in the corresponding plural phrase the ones, ones is always tagged as NNS (plural common noun), despite the parallel function of one and ones as heads of the noun phrase. By contrast, since one of the main roles of the tagged version of the Penn Treebank corpus is to serve as the basis for a bracketed version of the corpus, we encode a word's syntactic function in its POS tag whenever possible. Thus, one is tagged as NN (singular common noun) rather than as CD (cardinal number) when it is the head of a noun phrase. Similarly, while the Brown Corpus tags both as ABX (pre-quantifier, double conjunction), regardless of whether it functions as a prenominal modifier (both the boys), a postnominal modifier (the boys both), the head of a noun phrase (both of the boys) or part of a complex coordinating conjunction (both boys and girls), the Penn Treebank tags both differently in each of these syntactic contexts--as PDT (predeterminer), RB (adverb), NNS (plural common noun) and coordinating conjunction (CC), respectively.</Paragraph> <Paragraph position="7"> There is one case in which our concern with tagging by syntactic function has led us to bifurcate Brown Corpus categories rather than to collapse them: namely, in the case of the uninflected form of verbs. Whereas the Brown Corpus tags the bare form of a verb as VB regardless of whether it occurs in a tensed clause, the Penn Treebank tagset distinguishes VB (infinitive or imperative) from VBP (non-third person singular present tense).</Paragraph> <Paragraph position="8"> 2.1.4 Indeterminacy. A final difference between the Penn Treebank tagset and all other tagsets we are aware of concerns the issue of indeterminacy: both POS ambiguity in the text and annotator uncertainty. In many cases, POS ambiguity can be resolved with reference to the linguistic context. So, for instance, in Katharine Hepburn's witty line Grant can be outspoken--but not by anyone I know, the presence of the by-phrase forces us to consider outspoken as the past participle of a transitive derivative of speak-outspeak rather than as the adjective outspoken. However, even given explicit criteria for assigning POS tags to potentially ambiguous words, it is not always possible to assign a unique tag to a word with confidence. Since a major concern of the Treebank is to avoid requiring annotators to make arbitrary decisions, we allow words to be associated with more than one POS tag. Such multiple tagging indicates either that the word's part of speech simply cannot be decided or that the annotator is unsure which of the alternative tags is the correct one. In principle, annotators can tag a word with any number of tags, but in practice, multiple tags are restricted to a small number of recurring two-tag combinations: JJINN (adjective or noun as prenominal modifier), JJIVBG (adjective or gerund/present participle), JJ\[VBN (adjective or past participle), NNIVBG (noun or gerund), and RBIRP (adverb or particle).</Paragraph> <Section position="1" start_page="315" end_page="315" type="sub_section"> <SectionTitle> 2.2 The POS Tagset </SectionTitle> <Paragraph position="0"> The Penn Treebank tagset is given in Table 2. It contains 36 POS tags and 12 other tags (for punctuation and currency symbols). A detailed description of the guidelines governing the use of the tagset is available in Santorini (1990). 7</Paragraph> </Section> <Section position="2" start_page="315" end_page="315" type="sub_section"> <SectionTitle> 2.3 The POS Tagging Process </SectionTitle> <Paragraph position="0"> The tagged version of the Penn Treebank corpus is produced in two stages, using a combination of automatic POS assignment and manual correction.</Paragraph> </Section> </Section> <Section position="3" start_page="315" end_page="319" type="metho"> <SectionTitle> 7 In versions of the tagged corpus distributed before November 1992, singular proper nouns, plural </SectionTitle> <Paragraph position="0"> proper nouns, and personal pronouns were tagged as &quot;NP,&quot; &quot;NPS,&quot; and &quot;PP,&quot; respectively. The current tags &quot;NNP,&quot; &quot;NNPS,&quot; and &quot;PRP&quot; were introduced in order to avoid confusion with the syntactic tags &quot;NP&quot; (noun phrase) and &quot;PP&quot; (prepositional phrase) (see Table 3).</Paragraph> <Paragraph position="1"> 1. CC Coordinating conjunction 25. TO to 2. CD Cardinal number 26. UH Interjection 3. DT Determiner 27. VB Verb, base form 4. EX Existential there 28. VBD Verb, past tense 5. FW Foreign word 29. VBG Verb, gerund/present 6. IN Preposition/subordinating participle conjunction 30. VBN Verb, past participle 7. JJ Adjective 31. VBP Verb, non-3rd ps. sing. present 8. JJR Adjective, comparative 32. VBZ Verb, 3rd ps. sing. present 9. JJS Adjective, superlative 33. WDT wh-determiner 10. LS List item marker 34. WP wh-pronoun 11. MD Modal 35. WP$ Possessive wh-pronoun 12. NN Noun, singular or mass 36. WRB wh-adverb 13. NNS Noun, plural 37. # Pound sign 14. NNP Proper noun, singular 38. $ Dollar sign 15. NNPS Proper noun, plural 39.. Sentence-final punctuation 16. PDT Predeterminer 40. , Comma 17. POS Possessive ending 41. : Colon, semi-colon 18. PRP Personal pronoun 42. ( Left bracket character 19. PP$ Possessive pronoun 43. ) Right bracket character 20. RB Adverb 44. &quot; Straight double quote 21. RBR Adverb, comparative 45. ' Left open single quote 22. RBS Adverb, superlative 46. &quot; Left open double quote 23. RP Particle 47. ' Right close single quote 24. SYM Symbol (mathematical or scientific) 48. &quot; Right close double quote 2.3.1 Automated Stage. During the early stages of the Penn Treebank project, the initial automatic POS assignment was provided by PARTS (Church 1988), a stochastic algorithm developed at AT&T Bell Labs. PARTS uses a modified version of the Brown Corpus tagset close to our own and assigns POS tags with an error rate of 3-5%. The output of PARTS was automatically tokenized 8 and the tags assigned by PARTS were automatically mapped onto the Penn Treebank tagset. This mapping introduces about 4% error, since the Penn Treebank tagset makes certain distinctions that the PARTS tagset does not. 9 A sample of the resulting tagged text, which has an error rate of 7-9%, is shown in Figure 1.</Paragraph> <Paragraph position="2"> More recently, the automatic POS assignment is provided by a cascade of stochastic and rule-driven taggers developed on the basis of our early experience. Since these taggers are based on the Penn Treebank tagset, the 4% error rate introduced as an artefact of mapping from the PARTS tagset to ours is eliminated, and we obtain error rates of 2-6%.</Paragraph> <Paragraph position="3"> is given to annotators to correct. The annotators use a mouse-based package written 8 In contrast to the Brown Corpus, we do not allow compound tags of the sort illustrated above for I'm. Rather, contractions and the Anglo-Saxon genitive of nouns are automatically split into their component morphemes, and each morpheme is tagged separately. Thus, children's is tagged &quot;children/NNS 's/POS,&quot; and won't is tagged &quot;wo-/MD n't/RB.&quot; 9 The two largest sources of mapping error are that the PARTS tagset distinguishes neither infinitives from non-third person singular present tense forms of verbs, nor prepositions from particles in cases like run up a hill and run up a bill.</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 19, Number 2</Paragraph> <Paragraph position="6"> Sample tagged text--after correction.</Paragraph> <Paragraph position="7"> in GNU Emacs Lisp, which is embedded within the GNU Emacs editor (Lewis et al. 1990). The package allows annotators to correct POS assignment errors by positioning the cursor on an incorrectly tagged word and then entering the desired correct tag (or sequence of multiple tags). The annotators' input is automatically checked against the list of legal tags in Table 2 and, if valid, appended to the original word-tag pair separated by an asterisk. Appending the new tag rather than replacing the old tag allows us to easily identify recurring errors at the automatic POS assignment stage.</Paragraph> <Paragraph position="8"> We believe that the confusion matrices that can be extracted from this information should also prove useful in designing better automatic taggers in the future. The result of this second stage of POS tagging is shown in Figure 2. Finally, in the distribution version of the tagged corpus, any incorrect tags assigned at the first, automatic stage are removed.</Paragraph> <Paragraph position="9"> The learning curve for the POS tagging task takes under a month (at 15 hours a week), and annotation speeds after a month exceed 3,000 words per hour.</Paragraph> <Paragraph position="10"> 3. Two Modes of AnnotationwAn Experiment To determine how to maximize the speed, inter-annotator consistency, and accuracy of POS tagging, we performed an experiment at the very beginning of the project to compare two alternative modes of annotation. In the first annotation mode (&quot;tagging&quot;), annotators tagged unannotated text entirely by hand; in the second mode (&quot;correcting&quot;), they verified and corrected the output of PARTS, modified as described above. Mitchell P. Marcus et al. Building a Large Annotated Corpus of English This experiment showed that manual tagging took about twice as long as correcting, with about twice the inter-annotator disagreement rate and an error rate that was about 50% higher.</Paragraph> <Paragraph position="11"> Four annotators, all with graduate training in linguistics, participated in the experiment. All completed a training sequence consisting of 15 hours of correcting followed by 6 hours of tagging. The training material was selected from a variety of nonfiction genres in the Brown Corpus. All the annotators were familiar with GNU Emacs at the outset of the experiment. Eight 2,000-word samples were selected from the Brown Corpus, two each from four different genres (two fiction, two nonfiction), none of which any of the annotators had encountered in training. The texts for the correction task were automatically tagged as described in Section 2.3. Each annotator first manually tagged four texts and then corrected four automatically tagged texts. Each annotator completed the four genres in a different permutation.</Paragraph> <Paragraph position="12"> A repeated measures analysis of annotation speed with annotator identity, genre, and annotation mode (tagging vs. correcting) as classification variables showed a significant annotation mode effect (p = .05). No other effects or interactions were significant. The average speed for correcting was more than twice as fast as the average speed for tagging: 20 minutes vs. 44 minutes per 1,000 words. (Median speeds per 1,000 words were 22 vs. 42 minutes.) A simple measure of tagging consistency is inter-annotator disagreement rate, the rate at which annotators disagree with one another over the tagging of lexical tokens, expressed as a percentage of the raw number of such disagreements over the number of words in a given text sample. For a given text and n annotators, there are disagreement ratios (one for each possible pair of annotators). Mean inter-annotator disagreement was 7.2% for the tagging task and 4.1% for the correcting task (with medians 7.2% and 3.6%, respectively). Upon examination, a disproportionate amount of disagreement in the correcting case was found to be caused by one text that contained many instances of a cover symbol for chemical and other formulas. In the absence of an explicit guideline for tagging this case, the annotators had made different decisions on what part of speech this cover symbol represented. When this text is excluded from consideration, mean inter-annotator disagreement for the correcting task drops to 3.5%, with the median unchanged at 3.6%.</Paragraph> <Paragraph position="13"> Consistency, while desirable, tells us nothing about the validity of the annotators' corrections. We therefore compared each annotator's output not only with the output of each of the others, but also with a benchmark version of the eight texts. This benchmark version was derived from the tagged Brown Corpus by (1) mapping the original Brown Corpus tags onto the Penn Treebank tagset and (2) carefully handcorrecting the revised version in accordance with the tagging conventions in force at the time of the experiment. Accuracy was then computed as the rate of disagreement between each annotator's results and the benchmark version. The mean accuracy was 5.4% for the tagging task (median 5.7%) and 4.0% for the correcting task (median 3.4%). Excluding the same text as above gives a revised mean accuracy for the correcting task of 3.4%, with the median unchanged.</Paragraph> <Paragraph position="14"> We obtained a further measure of the annotators' accuracy by comparing their error rates to the rates at which the raw output of Church's PARTS program--appropriately modified to conform to the Penn Treebank tagset--disagreed with the benchmark version. The mean disagreement rate between PARTS and the benchmark version was Computational Linguistics Volume 19, Number 2 9.6%, while the corrected version had a mean disagreement rate of 5.4%, as noted above. 1deg The annotators were thus reducing the error rate by about 4.2%.</Paragraph> </Section> class="xml-element"></Paper>