File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/87/e87-1010_concl.xml

Size: 14,691 bytes

Last Modified: 2025-10-06 13:56:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="E87-1010">
  <Title>Contemporary English Longman</Title>
  <Section position="4" start_page="57" end_page="59" type="concl">
    <SectionTitle>
RUNNEWTAGSET
</SectionTitle>
    <Paragraph position="0"> Statistical patXem recognition techniques have been used in many fields of scientific computing for data classification and pattern detection. In a typical application, there will be a large number of data records, each of which will have a fairly complex internal structure; the task is to somehow group together sets of data records with 'similar' internal structures, and/or to note types of internal structures which occur frequently in data records. For example, a speech pattern recognition system is 'trained' with repeated examples of each word in its vocabulary to recognise the stereotypical structure of the given speech signal, and then when given a 'new' sound it must classify it in terms of the 'known' patterns. In attempting to devise a grarranaticai classification system for words in text, a record consists of the word itself, and its grammatical context A reasonably large sample of text such as the million-word LOB Corpus corresponds to a huge amount of data if the 'grammatical context' considered with each word is very large. The simplest model is to assume that only the single word immediately to the left and/or right of each TARGET word is important in the context; and even this oversimplification of context entails vast amounts of processing.</Paragraph>
    <Paragraph position="1"> If we assume that each word can belong to one and only one word*class, then whenever two words tend to occur in the same set of immediate (lexical) contexts, they will probably belong to the s~Lme word*class. This idea was tested using a suite of programs called RUNNEWTAGSET to group words in a c200,000-word subsection of the LOB Corpus into word*classes. The system only attempted to classify wordforms which occurred a hundred times or more, the minimum sample size for lexical collocation analysis suggested by Sinclair et al (70). All possible pairings of one wordfurm with another wordform (wl,w2) were compared: if the immediate lexical contexts in which wl occurred were significantly similar to the immediate contexts of w2, the two were deemed to belong to the same word*class, and the two context-sets were merged. A threshold was used to test &amp;quot;significant similarity&amp;quot;; initially, only words which occurred very frequently in the same contexts were classified together, but then the threshold was lowered in stages, allowing less and less similar context-sets to be merged at each stage.</Paragraph>
    <Paragraph position="2"> Unfortunately, the 200,000-word sample turned out to be far too small for conclusive results: even in a sample of this size, only 175 words occur 1(30 times or more. However, this program run took several weeks, so it was impractical to try a much larger text sample. There were some promising trends; for example, at the initial threshold level, &lt;will should could must may might&gt;, &lt;in for on by at during&gt;, &lt;is was&gt;, &lt;had has:,, &lt;it he there&gt;, &lt;they we&gt;, &lt;but if when while&gt;, &lt;make take&gt;, &lt;end use point question&gt;, and &lt;sense number&gt; were grouped into word-classes on the basis of their immediate lexical contexts, and in subsequent reductions of the threshold these classes were enlarged and new classes were added. However, even if the mammoth computing requirements could be met, this approach to automatic generation of a tagset or word*classification system is unlikely to be wholely successful because it tries to assign every word to one and only one word*class, whereas intuitively many words can have more than one possible tag.</Paragraph>
    <Paragraph position="3"> For example, this technique will tend to form three separate classes for nouns, verbs, and words which can function in both ways. For further details of the RUNNEWTAGSET experiment, see (Atwell 86a, 86b).</Paragraph>
    <Paragraph position="4"> Baker's algorithm Baker (75, 79) gives a technique which might in theory solve this problem. Baker showed that if we assume that a language is generated by a Markov process, then it is theoretically possible, given a sufficiently large sample of data, to automatically calculate the parameters of a Markov model compatible with the data. Baker's method was proposed as a technique for automatic training of the parameters of a model of an acoustic processor, but it could in theory be applied to the syntactic description of text. In Baker's technique, the principle parameters of the Markov model were two matrices, a(i,j) and b(i,j,k). For the word-tagging application, i and j correspond to tags, while k corresponds to a word; a(i,j) is the probability of tag i being followed by tag j, and b(i,j,k) is the probability of a word with tag i being followed by the word k with tag j. a(i,j) is the direct equivalent of the tag-pair matrix in the CLAWS model above, b(i,j,k) is analogous to the wordlist, except  that the information associated with each word is more detailed: instead of just a relative frequency for each tag that can appear with the word, there is a frequency for every possible pair of &lt;previous tag - this tag&gt;. Baker's model is mathematically equivalent to the one used in CLAWS; and it has the advantage that if the true matrices a(i,j) and b(i,j,k) are not known, then they can be calculated by analysing raw text. We start with initial estimates for each value, and then use an iterative procedure to repeatedly improve on these estimates of a(i,j) and b(i,j,k).</Paragraph>
    <Paragraph position="5"> Unfortunately, although this grammar discovery procedure might work in theory, the amount of computation in practice rams out to be vast We must iteratively estimate a likelihood for every &lt;tag-tag&gt; pair for a(i,j), and for every possible &lt;tag-tag-word&gt; triple for h(i,j,k). Work on tagging the LOB Corpus has shown that a tag-set of the order of 133 tags is reasonable for English (if we include separate tags for different inflections, since different inflexJons can appear in distinguishable syntactic contexts). Furthermore, the LOB Corpus has roughly 50,000 word-forms in it (counting, for example, &amp;quot;man&amp;quot;, &amp;quot;men&amp;quot;, &amp;quot;roans&amp;quot;, &amp;quot;manned&amp;quot;, &amp;quot;manning&amp;quot;, etc as separate wordfonns). Working from the 'raw' LOB Corpus, we would have to estimate c18,000 values for a(i,j), and 900,000,000 values for b(i,j,k). As the process of estimating each a(i,j) and b(i,j,k) value is in itself computationally expensive, it is impractical to use Baker's formulae unmodified to automatically extract word-classes from the LOB Corpus.</Paragraph>
    <Paragraph position="6"> Grouping by suffix To cut down the number of variables, we tried the simplifying assumption that the last five letters of a word determine which grammatical class(es) it belongs to. In other words, we assumed words ending in the same suffix shared the same wordclass; a not unreasonable assumption, at least for English. CLAWS was able to assign grammatical classes to almost any given word using a wordlist of only c7000 words supplemented by a suffixliat, so the assumption seemed intuitively reasonable for most words. To further reduce the computation, we used tag-pair probabilities from the tagged LOB Corpus to initialise a(i,j): by using 'sensible' starting values rather than completely arbitrary ones, convergence should have been much more rapid. Unfortunately, there were still far too many interdependent variables for computation in a reasonable time: we estimated that even with a single LOB text instead of the complete Corpus, the first iteration alone in Baker's scheme would take c66 hours\[ Alternative constraints An alternative approach was to abandon Baker's algorithm and introduce other constraints into the First Order Markov model. Another intuitively acceptable constraint was to allow each word to belong to only a small number of possible word classes (Baker's algorithm allowed words to belong to many different classes, up to the total number of classes in the system). This allowed us to try entirely different algorithms suggested by (Wolff 76) and (Wolff 78), based on the assumption that the claas(es) a word belongs to are determined by the immediate contexts that word appears in in the example texts. Unfortunately, these still involved prohibitive computing times. Wolffs second model was the more successful of the two, coming up with putative classes such as &lt;and at for in of to&gt;, &lt;had was&gt;, &lt;a an it one the&gt;, &lt;at by in not on to with&gt; and &lt;but he i it one there&gt;; yet our implementation took 5 hours CPU time to extract these classes from an 11,000 word sample.</Paragraph>
    <Paragraph position="7"> Heuristic constraints We are beginning to investigate alternative strategies; for instance, Artificial Intelligence techniques such as heuristics to reduce the 'search space' would seem appropriate.</Paragraph>
    <Paragraph position="8"> However, any heuristics must not be tied too closely to our intuitive knowledge of the English language, or else the resultant grammar discovery procedure will effectively have some of the grammar '&amp;quot;ouilt in&amp;quot; to it. For example, one might try constraining the number of tags allowed for each specific word (e.g &amp;quot;the&amp;quot;, &amp;quot;of&amp;quot;, &amp;quot;sexy&amp;quot; can have only one tag; &amp;quot;to&amp;quot;, &amp;quot;her&amp;quot;, &amp;quot;book&amp;quot; have two possible tags; &amp;quot;cold&amp;quot;, &amp;quot;base&amp;quot;, &amp;quot;about&amp;quot; have three tags; &amp;quot;hack&amp;quot;, &amp;quot;bid&amp;quot;, &amp;quot;according&amp;quot; have four tags; &amp;quot;hound&amp;quot;, &amp;quot;beat&amp;quot;, &amp;quot;round&amp;quot; have five tags; and so on); but this is clearly against the spirit of a tvaly automatic discovery procedure in the Chomskyan sense. A more 'acceptable' constraint would be a general limit of, say, up to five tags per word. A discovery procedure would start by assuming that the context-set of every word could be partitioned into five subsets, and then it would attempt a Prolog-style 'unification' of pairs of similar context-subsets, using belief revision techniques from Artificial Intelligence (see, for example, (Drakos 86)).</Paragraph>
    <Paragraph position="9">  Overall, we concede that the case for statistical pattern-matching for syntactic classification is not proven. However, there have been some promising results, which deserve further investigation, since there would be useful applications for any successful pattern recognition technique for the acquisition of a grammatical classification system from Unrestricted English text.</Paragraph>
    <Paragraph position="10"> Note that variables in formulae mentioned above such as i and j are not tag names (NN, VB, ete), but just integers denoting positions in a tag-pair matrix. In a Markov model,  a tag is defined entirely by its couccurrence likelihoods with other tags, and with words: labels like NN, VB will not be generated by a pattern recognition technique. However, if we assumed initially that there are 133 tags, e.g. if we initialised a(i,j) to a 133&amp;quot;133 matrix, then hopefully there should be some correlation between distributions of tags in the LOB tagset and the automatically generated tagset. If there is poor correlation for some tags (e.g. if the automatically-derived tagset includes some tags whose collocational distributions are unlike those of any of the tags used in the LOB Corpus), then this constitutes empirical, objective evidence that the LOB tagset could be improved upon.</Paragraph>
    <Paragraph position="11"> In general, any alternative wordclass system could be empirically assessed in an analogous way. The Longman Dictionary of Contemporary English (LDOCE; Procter 78) and the Oxford Advanced Learner's Dictionary of Cunent English (OALD; Hornby 74) give detailed grammatical codes with each entry, but the two classification systems are quite different; if samples of text tagged according to the LDOCE and OALD tag.sets were available, a pattern recognition technique might give us an empirical, objective way to compare and assess the classification systems, and suggest particular areas for improvement in forthcoming revised editions of LPSX~E and OALD. This would be particularly useful for Machine Readable versions of such dictionaries, for use in Natural Language Processing systems (see, for example, (Akkerman et al 85), (Alshawi et ai 85), (Atweil forthcoming a)); these could be tailored to a given application domain (semi-)automatically.</Paragraph>
    <Paragraph position="12"> Even though the experiments mentioned achieved only limited success in discovering a complete grammatical classification system, a more restricted (and hence more achievable) aim is to concentrate on specific word classes which are traditionally recognised as difficult to define. For example, the techniques were particularly successful at finding groups of words corresponding to invariant function word classes, such as particles; Atwell (forthcoming c) explores this further.</Paragraph>
    <Paragraph position="13"> A bottleneck in commercial exploitation of current research ideas in NIP is the problem of tailoring systems to specialised linguistic registers, that is, application-specific variations in lexicon and grammar. This research, we hope, points the way to (semi-)automating the solution for a wide range of applications (such as described, for example, by Atwell (86d)). Particularly appropriate to the approach outlined in this paper are applications systems based on statistical models of grammar, such as (Atwell 86c). If grammar discovery can be made to work not just for variant registers of English, but for completely different languages as wall, then it may be possible to automate (or at least greatly simplify) the transfer of systems such as that described by Atweil (86c) to a wide variety of natural languages.</Paragraph>
    <Paragraph position="14"> Conclusion Automatic grammar discovery procedures are a tantalising possibility, but the techniques we have tried so far are far from perfect. It is worth continuing the search because of the enormous potential benefits: a discovery procedure would provide a solution to a major bottleneck in commercial exploitation of NLP technology. We are keen to find collaborators and sponsors for further research.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML