File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1103_metho.xml

Size: 16,230 bytes

Last Modified: 2025-10-06 14:13:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1103">
  <Title>CLAWS4: THE TAGGING OF THE BRITISH NATIONAL CORPUS</Title>
  <Section position="3" start_page="0" end_page="623" type="metho">
    <SectionTitle>
2 THE DESIGN OF THE GRAM-
MATICAL TAGGER (CLAWS4)
</SectionTitle>
    <Paragraph position="0"> The CLAWS4 tagger is a successor of the CLAWS 1 tagger described in outline in Marshall (1983), and more fully in Garside et al (1987), and has the same basic architecture. The system (if we include input ,and output procedures) has five major sections:  (c) rule-driven contextual part-of-speech assignment null (d) probabilistie tag disambiguation \[Markov process\] null \[(c') second pass of (c)\] (e) output in intermediate form.</Paragraph>
    <Paragraph position="1">  The intermediate form of text output is the form suitable for post-editing (see 1 above; also Table 1), which can then be converted into other formats according to particular output needs, as already noted. The pre-processing section (a) is not trivial, since, in any large and varied corpus, there is a need to handle unusual text structures (such as those of many popular and technical magazines), less usual graphic features (e.g. non-roman alphabetic characters, mathematical symbols), and features of conversation transcriptions: e.g. false starts, incomplete words and utterances, unusual expletives, unplanned repetitions, and (sometimes multiple) overlapping speech.</Paragraph>
    <Paragraph position="2"> Sections (b) and (d) apply essentially a Hidden Markov Model (HMM) to the assignmeut and disambiguation of tags. But file intervening section (c) has become increasingly important as CLAWS4 tuLs developed the need for versatility across a range of text types. This task of rule-driven contextualpartof-speech assignment began in 1981 as an 'idiom-tagging' program for dealing, in the main, with parts of speech extending over more than one orthographic word (e.g. complex prepositions such as according to and complex conjunctions such as so that). In the more fully developed form it now has, this section utilises several different idiom lexicons dealing, for example, with (i) general idioms such as as much as (which is on one analysis a single coordinator, and on another analysis, a word sequence), (it) complex munes such as Dodge City and Mrs Charlotte Green (where the capital letter alone would not be enough to show that Dodge mad Green are proper nouns), (iii) foreign expressions such as annus horribilis.</Paragraph>
    <Paragraph position="3"> These idiom lexicons (with over 3000 entries in all) can match on both tags and word-tokens, employing a regular expression formalisnl at the level both of the individual item and of the sequence of items. Recognition of unspecified words with initial capitals is also incorporated. Conceptually, each entry has two parts: (a) a regular-expression-based 'template' specifying a set of conditions on sequences of word-tag pairs, and (b) a set of tag assigmnents or substitutions to be performed on any sequence matching the set of conditions in (a). Examples of entries from each of tile above kinds of idiom lexicon entry are:  Let &amp;quot;IT be any tag, and let ww be any word.</Paragraph>
    <Paragraph position="4"> Let n,m be arbitrary integers. Then:  ww TT represents a word and its associated tag , separates a word from its predecessor \[TT\] represents :m already assigned tag \[WICI represents an unspecified word with a Word Initial Capilal &amp;quot;I&amp;quot;I'/'I'T means 'either '71&amp;quot; or TT'; ww/ww means 'either WW or WW' ww'13'TT represenls an unresolved ambiguity between &amp;quot;lq' and TI&amp;quot; TT* represents a tag wilh * marking the location of unspecified ch:u'acters (\[TTI)n represents the number of words (up to n) which may optionally intervene at a giveq point in the template TTnm represents the 'ditto tag' attached to ~m orthographic word to indicate it is part of a complex sequence (e.g.so that is tagged so CJS21 , that CJS22). The variable n indicates the number of orthographic words in the sequence, and m indicates that the current word is in tile ttzth position in that sequence.</Paragraph>
    <Paragraph position="5"> (b) Ex~unples of word lags (in the C5 'basic' tagset):  (c) Explanation of the three rules above: Rule (i) ensures that following a finite form of do and (optionally) up to two adverbs, negators or ordinals, a base form of the verb is tagged as an infinitive. Rule (ii) ensures that in complex names such as Monte Alegre, Mount Pleasant, Mount Palomar Observatory, Mt Rushmore National Memorial, all the  words with word-initi,'d caps are tagged as proper nouns.</Paragraph>
    <Paragraph position="6"> Rule (iil) ensures that the Latin expression ad hoc is tagged as a single word, either an adjective or an adverb.</Paragraph>
    <Paragraph position="7"> We have also now moved to a more complex, two-pass application of these idiomlist entries. It is possible, onthe first pass, to specify ambiguous output of an idiom assignment (as is necessary, e.g., for as much as, mentioned earlier), so that this can then be input to the probabilistic disambiguation process (d). On the second pass, however, after probabilistic disambiguation, the idiom entry is deterministic in both its input and output conditions, replacing one or more tags by others. In effect, this last kind of idiom application is used to correct a tagging error arising from earlier procedures. For exanlple, a not uncommon result from Sections (a)-(d) is that the base form of the verb (e.g. carry) is wrongly tagged as a finite present tense form, rather than an infinitive. This can be retrospectively corrected by replacing VVB (= finite base form) by VVI (= infinitive) in appropriate circumstances.</Paragraph>
    <Paragraph position="8"> While the HMM-type process employed in Sections (b) and (d) affirms our faith in probabilistic methods, the growing importance of the contextual part-of-speech assigxwaent in (c) and (c') demonstrates the extent to which it is important to transcend the limitations of the orthographic word, as the basic unit of grammatical tagging, and also to selectively adopt non-probabilistic solutions. The term 'idiom-tagging' originally used was never particularly appropriate for these sections, which now handle more generally the interdependence between grammatical and lexical processing which NLP systems ultimately have to cope with, and are also able to incorporate parsing information beyond the range of the one-step Markov process (based on tag bi-gram frequences) employed in (d). 3 Perhaps the term 'phraseological component' would be more appropriate here. The need to combine probabilistic 3We have experimented with a two-step Markov process model (using tag trigrams), and found little benefit over the one-step model (using tag bigrams).</Paragraph>
    <Paragraph position="9"> and non-probabilistic methods in tagging has been widely noted (see, e.g., Voutilainen et al. 1992:14).</Paragraph>
  </Section>
  <Section position="4" start_page="623" end_page="624" type="metho">
    <SectionTitle>
3 EXTENDING ADAPTABILITY:
SPOKEN DATA AND TAGSETS
</SectionTitle>
    <Paragraph position="0"> The tagging of 10 million words of spoken data (including c.4.6 million words of conversation) presents particular challenges to the versatility of the system: renderings of spoken pronunciations such ,as 'avin' (for having) cause difficulties, as do unplanned repetitions such as I et, mean, I mean, 1 mean to go. Our solution to the latter pn~blem has been to recognize such repetitions by a special procedure, and to disregard, in most cases, the repeated occurrences of tile same word or phrase for the purposes of tagging. It has become clear that the CLAWS4 resources (lexicon, idiomllsts, and tag transition matrix), developed for written English, need to be adapted if certain frequent and rather consistent errors in the tagging of spoken data are to be avoided (words such as I, well, and right are often wrongly tagged, because their distribution in conversation differs mazkedly from that in written texts). We have moved in this direction by allowing CLAWS4 to 'slot in' different resources according to the text type being processed, by e.g. providing a separate supplementary lexicon and idlomlist for the spoken material. Eventually, probabilistic analysis of the tagged BNC will provide the necessary information for adapting datastructures at run time to the special demands of particular types of data, but there is much work to be done before this potential benefit of having tagged a large corpus is realised.</Paragraph>
    <Paragraph position="1"> The BNC tagging takes place within the context of a larger project, in which a major task (undertaken by OUCS at Oxford) is to encode the texts in a TEIconformant mark-up (CDIF). Two tagsets have been employed: one, more detailed than the other, is used for tagging a 2-million-word Core Corpus (an epitome of the whole BNC), which is being post-edited for maximum accuracy. Thus tagsets, like text formats and resources, are among the features which are task-definable in CLAWS4. In general, the system has been revised to allow many adaptive decisions to be made at run time, and to render it suitable for non-specialist researchers to use.</Paragraph>
  </Section>
  <Section position="5" start_page="624" end_page="624" type="metho">
    <SectionTitle>
4 ERROR RATES AND WHAT
THEY MEAN
</SectionTitle>
    <Paragraph position="0"> Cun~ntly, judged in terms of major categories, 4 the system has an error-rate of approximately 1.5%, and leaves c.3.3% ,'unbiguitics unresolved (as portmanteau tags) in the output. However, it is all too easy to quote error rates, without giving enough information to enable them to be properly assessed. ~ We believe that any evaluation of the accuracy of automatte grammatical tagging should take account of a number of factors, some of which arc extremely difficult to measure:</Paragraph>
    <Section position="1" start_page="624" end_page="624" type="sub_section">
      <SectionTitle>
4.1 Consistency
</SectionTitle>
      <Paragraph position="0"> It is necessary to measure tagging practice against some standard of what is an appropriate tag for a given word in a given context. For example, is horrifying in a horrifying adventure, or washing in a washing machine an adjective, a norm, or a verb participle? Only if this is specified independently, by an annotation scheme, c:m we feel confident in judging where the tagger is 'correct' or 'incorrect'.</Paragraph>
      <Paragraph position="1"> For the tagging of the LOB Corpus by the earliest version of CLAWS, the :umotation scheme was published in some detail (Johzmssou et al 1986). We are working on a similar annotation scheme document (at present a growing in-house document) for the tagging of the BNC.</Paragraph>
      <Paragraph position="2"> 4&amp;quot;lqqe error rate and ambiguity rate are less favourable if we take account of errors and ambiguities which occtu&amp;quot; within major categories. E.g. the porlmanteau tag NP0-NN1 records contidently that a word is a noun, but not whether it is a proper or common noun. If such cases are added to the count, then the estimated error rate rises lo 1.78%, and the estimated ambiguity tale to 4.60%.</Paragraph>
      <Paragraph position="3"> 5One reasonable attempt to evaluate competing accuracy of different taggcrs is that in Voutilaincn ct al (1992:11-13), where it is argued, on the basis of tlte tagging of sample written texls, that the performance of the llclsinki constraint grammar p:u'ser ENGCG is superior to that of CLAWS 1 (the e:u-liest version of CLAWS, completed in 1983), which is in turn is somewhat superior to Church's Parts tagger. While recognizing that the accuracy of the tlelsinki system is impressive, we note also that the method of evaluation (in terms of 'precision' and 'recall') employed by Voutilainen ct al in not easy to comp~u'e with the method employed here. Further, a strict attempt at measuring compmability would have to take fuller account of the 'consistency' and 'qu'dily' criteria we mention, and of the need it) compare across a broader rtmge of texts, spoke~ and written. This issue c,-umot be taken further in this paper, but will hopefully be tile bztsis of future research.</Paragraph>
    </Section>
    <Section position="2" start_page="624" end_page="624" type="sub_section">
      <SectionTitle>
4.2 Size of Tagset
</SectionTitle>
      <Paragraph position="0"> It might be supposed that tagging with a finer-grained tagset which contains more tags is more likely to produce error than tagging with a smaller and cruder tagset. Ill tile BNC project, we have used a tagset of 58 tags (the C5 tagsct) for the whole cor~ pus, ,'rod in addition we have used a larger tagset of 138 tags (the C6 tagset) 6 for the Core Corpus of 2 million words. The evidence so far is that this makes little difference to the error rate. llut size of tagset must, in the absence of more conclusive evidence, remain a factor to be considered.</Paragraph>
    </Section>
    <Section position="3" start_page="624" end_page="624" type="sub_section">
      <SectionTitle>
4.3 Discriminative Value of Tags
</SectionTitle>
      <Paragraph position="0"> Tile difficulty of grammatical tagging is directly related to the nuinber of words for which a given tag distinction is made. This measure may be called 'discriminative value'. For example, in the C5 tagset, one tag (VDI) is used for the inlinitive of just one verb -- to do -- where:ts ~mother tag (VVI) is used for the infinitive of all lexical verbs. On lhe other hand, VDB is used for linite base forms of to do (including tile present tense, imperative, and subjunctive), whereas VVB is used of finite t)ase forms of all lexic:d verbs. It is clc,'u&amp;quot; the tags VDI and VDB have a low discriminative value, whereas VVI and VVB have a high one -- since there are thousands of lexieal verbs in Englisb. It is ~dso clear that a tagset of the lowest possible discriminative value -- one which assigned a single tag to each word and a single word to each tag -- would bc utterly valueless.</Paragraph>
    </Section>
    <Section position="4" start_page="624" end_page="624" type="sub_section">
      <SectionTitle>
4.4 Linguislic Qualily
</SectionTitle>
      <Paragraph position="0"> This is a very elusive, but crucial concept, tIow far are tile tags in a particular tagset valuable, by criteria either of linguistic thet~ry/description, or of usefulness in NLP? For example, tile tag VDI, mentioned ill C. above, appears trivial, but it can be argued that this is nevertheless a useful category tbr English grammar, where the verb do (unlike its equivalent in most other Europe,'m languages) has a vet-y special ftmction, e.g. in forming questions and negatives. On the other band, if we had decided to assign a special tag to the ved~ become, this would have been more questionable. Linguistic quality is,  on the face of it, determined only in a judgemental manner. Arguably, in the long term, it can be determined only by the contribution a particular tag distinction makes to Success in particular applications, such as speech recognition or machine-aided translation. At present, this issue of linguistic quality is the Achilles' heel of grammatical tagging evaluation, and we must note that without judgement on linguistic quality, evaluation in terms of b. and c. is insecurely anchored, It seems reasonable, therefore, to lump criteria b.-d. together as 'quality criteria', and to say that evaluation of tagging accuracy must be undertaken in conjunction with (i) consistency \[How far has the armotation scheme been consistently applied?\], and (ii) quality of tagging \[How good is the annotation scheme?\]. 7 Error rates are useful interim indications of success, but they have to be corroborated by checking, if only impressionistically, in terms of qualitative criteria. Our work, since 1980, has been based on the assumption that qualitative criteria count, and that it is worth building 'consensual' linguistic knowledge into the datastructures used by the tagger, to make sure that the tagger's decisions are fully informed by qualitative considerations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML