File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/j95-4004_abstr.xml

Size: 5,670 bytes

Last Modified: 2025-10-06 13:48:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-4004">
  <Title>Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging</Title>
  <Section position="2" start_page="0" end_page="544" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> It has recently become clear that automatically extracting linguistic information from a sample text corpus can be an extremely powerful method of overcoming the linguistic knowledge acquisition bottleneck inhibiting the creation of robust and accurate natural language processing systems. A number of part-of-speech taggers are readily available and widely used, all trained and retrainable on text corpora (Church 1988; Cutting et al. 1992; Brill 1992; Weischedel et al. 1993). Endemic structural ambiguity, which can lead to such difficulties as trying to cope with the many thousands of possible parses that a grammar can assign to a sentence, can be greatly reduced by adding empirically derived probabilities to grammar rules (Fujisaki et al. 1989; Sharman, Jelinek, and Mercer 1990; Black et al. 1993) and by computing statistical measures of lexical association (Hindle and Rooth 1993). Word-sense disambiguation, a problem that once seemed out of reach for systems without a great deal of handcrafted linguistic and world knowledge, can now in some cases be done with high accuracy when all information is derived automatically from corpora (Brown, Lai, and Mercer 1991; Yarowsky 1992; Gale, Church, and Yarowsky 1992; Bruce and Wiebe 1994). An effort has recently been undertaken to create automated machine translation systems in which the linguistic information needed for translation is extracted automatically from aligned corpora (Brown et al. 1990). These are just a few of the many recent applications of corpus-based techniques in natural language processing.</Paragraph>
    <Paragraph position="1">  * Department of Computer Science, Baltimore, MD 21218-2694. E-mail: brill@cs.jhu.edu.</Paragraph>
    <Paragraph position="2"> (c) 1995 Association for Computational Linguistics  Computational Linguistics Volume 21, Number 4 Along with great research advances, the infrastructure is in place for this line of research to grow even stronger, with on-line corpora, the grist of the corpus-based natural language processing grindstone, getting bigger and better and becoming more readily available. There are a number of efforts worldwide to manually annotate large corpora with linguistic information, including parts of speech, phrase structure and predicate-argument structure (e.g., the Penn Treebank and the British National Corpus (Marcus, Santorini, and Marcinkiewicz 1993; Leech, Garside, and Bryant 1994)). A vast amount of on-line text is now available, and much more will become available in the future. Useful tools, such as large aligned corpora (e.g., the aligned Hansards (Gale and Church 1991)) and semantic word hierarchies (e.g., Wordnet (Miller 1990)), have also recently become available.</Paragraph>
    <Paragraph position="3"> Corpus-based methods are often able to succeed while ignoring the true complexities of language, banking on the fact that complex linguistic phenomena can often be indirectly observed through simple epiphenomena. For example, one could accurately assign a part-of-speech tag to the word race in (1-3) without any reference to phrase structure or constituent movement: One would only have to realize that, usually, a word one or two words to the right of a modal is a verb and not a noun. An exception to this generalization arises when the word is also one word to the right of a determiner.</Paragraph>
    <Paragraph position="4">  He will not race/VERB the car.</Paragraph>
    <Paragraph position="5"> When will the race/NOUN end? It is an exciting discovery that simple stochastic n-gram taggers can obtain very high rates of tagging accuracy simply by observing fixed-length word sequences, without recourse to the underlying linguistic structure. However, in order to make progress in corpus-based natural language processing, we must become better aware of just what cues to linguistic structure are being captured and where these approximations to the true underlying phenomena fail. With many of the current corpus-based approaches to natural language processing, this is a nearly impossible task. Consider the part-of-speech tagging example above. In a stochastic n-gram tagger, the information about words that follow modals would be hidden deeply in the thousands or tens of thousands of contextual probabilities (P(Tagi I Zagi-lZagi-2) ) and the result of multiplying different combinations of these probabilities together.</Paragraph>
    <Paragraph position="6"> Below, we describe a new approach to corpus-based natural language processing, called transformation-based error-driven learning. This algorithm has been applied to a number of natural language problems, including part-of-speech tagging, prepositional phrase attachment disambiguation, and syntactic parsing (Brill 1992; Brill 1993a; Brill 1993b; Brill and Resnik 1994; Brill 1994). We have also recently begun exploring the use of this technique for letter-to-sound generation and for building pronunciation networks for speech recognition. In this approach, the learned linguistic information is represented in a concise and easily understood form. This property should make transformation-based learning a useful tool for further exploring linguistic modeling and attempting to discover ways of more tightly coupling the underlying linguistic systems and our approximating models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML