File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/87/e87-1010_metho.xml
Size: 12,073 bytes
Last Modified: 2025-10-06 14:12:02
<?xml version="1.0" standalone="yes"?> <Paper uid="E87-1010"> <Title>Contemporary English Longman</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> PATI'ERN RECOGNITION APPLIED TO THE ACQUISITION OF A GRAMMATICAL CLASSIFICATION SYSTEM FROM UNRESTRICTED ENGLISH TEXT </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="56" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> Within computational linguistics, the use of statistical pattern matching is generally restricted to speech processing.</Paragraph> <Paragraph position="1"> We have attempted to apply statistical techniques to discover a grammatical classification system from a Corpus of 'raw' English text. A discovery procedure is simpler for a simpler language model; we assume a first-order Markov model, which (surprisingly) is shown elsewhere to be sufficient for practical applications. The extraction of the parameters of a standard Markov model is theoretically straightforward; however, the huge size of the standard model for a Natural Language renders it incomputahle in reasonable time. We have explored various constrained models to reduce computation, which have yielded results of varying success.</Paragraph> <Paragraph position="2"> Pattern recognition and NLP In the area of language-related computational research, there is a perceived dichotomy between, on the one hand, &quot;Natural Language&quot; research dealing principally with syntactic and other analysis of typed text, and on the other hand, &quot;Speech Processing&quot; research dealing with synthesis, recognition, and understanding of speech signals. This distinction is nut based merely on a difference of input and/or output media, but seems also to correlate to noticeable differences in assumptions and techniques used in research.</Paragraph> <Paragraph position="3"> One example is in the use of statistical pattern recognition techniques: these are used in a wide variety of computer-based research areas, and many speech researchers take it for granted that such methods are part of their stock in trade. In contrast, statistical pattern recognition is hardly ever even considered as a technique to be used in &quot;Natural Language&quot; text analysis. One reason for this is that speech researchers deal with &quot;real&quot;, &quot;unrestricted&quot; data (speech samples), whereas much NLP research deals with highly restricted language data, such as examples intuited by theoreticians, or simplified English as allowed by a dialogue system, sach as a Natural Language Database Query system.</Paragraph> <Paragraph position="4"> Chomsky (57) did much to discredit the use of representative text samples or Corpora in syntactic research; he dismissed both statistics and semantics as being of no use to syntacticians: &quot;Despite the undeniable interest and importance of semantic and statistical studies of language, they appear to have no direct relevance to the problem of determining or characterizing the set of grammatical utterances&quot; (Chomsky 57 p.17). Subsequent research in Computational Linguistics has shown that Semantics is far more relevant and important than Chomsky gave credit for.</Paragraph> <Paragraph position="5"> Phenomenal advances in computer power and capabilities mean that we can now try statistical pattern recognition techniques which would have been incomputable in Chomsky's early days. Therefore, we felt that the case for Corpus-based statistical Pattern Recognition techniques should be reopened. Specifically, we have investigated the possibility of using Pattern Recognition techniques for the acquisition of a grammatical classification system from Unrestricted English text.</Paragraph> <Section position="1" start_page="0" end_page="56" type="sub_section"> <SectionTitle> Corpus Linguistics </SectionTitle> <Paragraph position="0"> A Corpus of English text samples can constitute a definitive source of data in the description of linguistic constructs or strnctures. Computational linguists may use their intuitions about the English language to devise a grammar of English (or of some part of the English language), and then cite example sentences from the Corpus as evidence for their grammar (or counter-evidence against someone else's grammar). Going one stage further, computational linguists may use data from a Corpus as a source of inspiration at the earlier stage of devising the rules of the grammar, relying as little as possible on intuitions about English grammatical structures (see, for example, (Leech, Garside & AtweU 83a)). With appropriate software tools to extract relevant sentences from the computerised Corpus, the process of providing evidence for (or against) a particular grammar might in theory be largely mechanised Another way to use data from a Corpus for inspiration is to manually draw parse-trees on top of example sentences taken from the Corpus, without explicitly formulating a corresponding Context-Free or other rewrite-rule grammar.</Paragraph> <Paragraph position="1"> These trees could then be used as a set of examples for a grammar-rule extraction program, since every subtree of mother and immediate daughters corresponds to a phrase-structure rewrite rule; such an experiment is described by Atwell (forthcoming b).</Paragraph> <Paragraph position="2"> However, the linguists must still use their expertise in theoretical linguistics to devise the roles for the grammar and the grammatical categories used in these roles. To completely automate the process of devising a grammar for English (or some other language), the computer system would have to &quot;know&quot; about theories of grammar, how to choose an appropriate model (e.g. context-free rules, Generalized Phrase Structure Grammar, transition network, or Markov process), and how to go about devising a set of roles in the chosen formalism which actually produces the set of sentences in the Corpus (and doesn't produce (too many) other sentences).</Paragraph> <Paragraph position="3"> Chomsky (1957), in discussing the goals of linguistic theory, considered the possibility of a discovery procedure for grammars, that is, a mechanical method for constructing a grammar, given a corpus of utterances. His conclusion was: &quot;I think it is very questionable that this goal is attainable in any interesting way&quot;. Since then, linguists have proposed various different grammatical formalisms or models for the description of natural languages, and there has been no general consensus amongst expert linguists as to the 'best' model. If even human experts can't agree on this issue, Chomak-y was probably right in thinking it unreasonable to expect a machine, even an 'intelligent' expert system, to he able to choose which theory or model to start from.</Paragraph> <Paragraph position="4"> Constrained discovery procedures However, it may still be possible to devise a discovery procedure if we constrain the computer system to a specific grammatical model. The problem is simplified further if we constrain the input to the discovery procedure, to carefully chosen example sentences (and possibly counter-example non-sentences). This is the approach used, for example, by Berwick (85); his system extracted grammar mles in a formalism based on that of Marcus's PARSIFAL (Marcus 80) from fairly simple example sentences, and managed to acquire &quot;approximately 70% of the parsing rules originally hand-written for \[Marcus's\] parser&quot;. Unfortunately, it is not at all clear that such a system could be generalised to deal with Unrestricted English text, including deviant, idiomatic and even ill-formed sentences found in a Corpus of 'real' language data. This is the kind of problem best suited to statistical pattern matching methods.</Paragraph> <Paragraph position="5"> The plausibility of a truly general discovery procedure, capable of working with unrestricted input, increases if we can use a very simple model to describe the language in question. Chomsky believed that English could only be described by a phrase structure grammar augmented with transformations, and clearly a discovery procedure for devising Transformational Generative grammars from a Corpus would have to be extremely complex and 'clever'.</Paragraph> <Paragraph position="6"> More recently, (Gazdar et al 85) and others have argued that a less powerful mechanism such as a variant of phrase structure grammar is sufficient to describe English syntax. A discovery procedure for phrase structure grammars would be simpler than one for TG grammars because phrase structure grammars are simpler (more constrained) than TG grammars.</Paragraph> </Section> </Section> <Section position="3" start_page="56" end_page="57" type="metho"> <SectionTitle> CLAWS </SectionTitle> <Paragraph position="0"> For the more limited task of assigning part-of-speech labels to words, (Leech, Garside & AtweU 83b), (Atwell 83) and (Atweii, Leech & Garside 84) showed that an even simpler model, a first-order Markov model, will suffice.</Paragraph> <Paragraph position="1"> This model was used by CLAWS, the Constituent-Likelihood Automatic Word-tagging System, to assign grammatical wordclass (part-of-speech) markers to words in the LOB Corpus. The LOB Corpus is a collection of 500 British English text samples, each of just over 2000 words, totalling over a million words in all; it is available in several formats (with or without word-tags associated with each word) from the Norwegian Computing Centre for the Humanities, Bergen University (see (lohansson et al 78), (lohansson et al 86)). The Markovian CLAWS was able to assign the correct tag to c96% of words in the LOB Corpus, leaving only a small residual of problematic constructs to be analysed manually (see (Atwell 81, 82)). Although CLAWS does not yield a full grammatical parse of input sentences, this level of analysis is still useful for some applications; for example, Atwell (83, 86C/) showed that the first-order Markov model could be used in detecting grammatical errors in ill-formed input English texL The main components of the first order Markov model or grammar used by CLAWS were; i) a set of 133 grammatical class labels or TAGS, e.g.</Paragraph> <Paragraph position="2"> NN (singular common noun) or J JR (comparative adjective) ii) a 133&quot;133 tag-pair matrix, giving the frequency of cooccurrence of every possible pair of tags (the mwsums or columnsums giving frequencies of individual tags) iii) a wordlist associating each word with a list of possible tags (with some indication of relative frequency of each tag where a word has more than one), supplememed by a suffixlist, prefixlist, and other default routines to deal with input words not found in the wordlist iv) a set of formulae to use in calculating likelihood-incontext, to disambiguate word-tags in tagging new text. The last item, the formulae underlying the CLAWS system (see (Atwell 83)), constitutes the Markovian mathematical model, and it is too much to ask of any expert system to devise or extract this from data. At least in theory, the first three components could be automatically extracted from sample text WHICH HAS ALREADY BEEN TAGGED, providing there is enough of it (in particular, there should be many examples of each word in the wordlist, to ensure relative tag likelihoods are accurate). However, this is effectively &quot;learning by example&quot;: the tagged texts constitute examples of correct analyses, and the program extracting word-tag and tag-pair frequencies could be said to be &quot;learning&quot; the parameters of a Markov model compatible with the example data. Such a learning system is not a truly generalised discovery procedure. Ideally, we would like to be able to extract the parameters of a compatible Markov model from RAW, untagged text.</Paragraph> </Section> class="xml-element"></Paper>