File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1022_metho.xml

Size: 15,802 bytes

Last Modified: 2025-10-06 14:14:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1022">
  <Title>A syntax-based part-of-speech analyser</Title>
  <Section position="3" start_page="157" end_page="160" type="metho">
    <SectionTitle>
2 System description
</SectionTitle>
    <Paragraph position="0"> The tagger consists of the following sequential components:</Paragraph>
    <Section position="1" start_page="157" end_page="158" type="sub_section">
      <SectionTitle>
2.1 Morphological analysis
</SectionTitle>
      <Paragraph position="0"> The tokeniser is a rule-based system for identifying words, punctuation marks, document markers, and fixed syntagms (multiword prepositions, certain compounds etc.).</Paragraph>
      <Paragraph position="1"> The morphological description consists of two rule components: (i) the lexicon and (ii) heuristic rules for analysing unrecognised words.</Paragraph>
      <Paragraph position="2"> The English Koskenniemi-style lexicon contains over 80,000 lexical entries, each of which represents all inflected and some derived surface forms. 3However, CLAWS4 (Leech, Gazside and Bryant 1994) leaves some ambiguities unresolved; it uses portmanteau tags for representing them.</Paragraph>
      <Paragraph position="3">  The lexicon employs 139 tags mainly for part of speech, inflection and derivation; for instance:</Paragraph>
      <Paragraph position="5"> The morphological analyser produces about 180 different tag combinations. To contrast the ENGCG morphological description with the well-known Brown Corpus tags: ENGCG is more distinctive in that a part-of-speech distinction is spelled out in the description of (i) determiner-pronoun, (ii) prepositionconjunction, (iii) determiner-adverb-pronoun, and (iv) subjunctive-imperative-infinitive-present tense homographs. On the other hand, ENGCG does not spell out part-of-speech ambiguity in the description of (i) -ing and nonfinite -ed forms, (ii) noun-adjective homographs with similar core meanings, or (iii) abbreviation-proper noun-common noun homographs.</Paragraph>
      <Paragraph position="6"> &amp;quot;Morphological heuristics&amp;quot; is a rule-based module for the analysis of those 1-5% of input words. not represented in the lexicon. This module employs ordered hand-grafted rules that base their analyses on word shape. If none of the pattern rules apply, a nominal reading is assigned as a default. null</Paragraph>
    </Section>
    <Section position="2" start_page="158" end_page="158" type="sub_section">
      <SectionTitle>
2.2 ENGCG disambiguator
</SectionTitle>
      <Paragraph position="0"> A Constraint Grammar can be viewed as a collection 4 of pattern-action rules, no more than one for each ambiguity-forming tag. Each rule specifies one or more context patterns, or &amp;quot;constraints&amp;quot;, where the tag is illegitimate. If any of these context patterns are satisfied during disambiguation, the tag is deleted; otherwise it is left intact. The context patterns can be local or global, and they can refer to ambiguous or unambiguous analyses. During disambiguatiop, the context can become less ambiguous. To help a pattern defining an unambiguous context match, several passes are made over the sentence during disambiguation.</Paragraph>
      <Paragraph position="1"> The current English grammar contains 1,185 linguistic constraints on the linear order of morphological tags. Of these, 844 specify a context that extends beyond the neighboring word; in this limited sense, 71% of the constraints are global.</Paragraph>
      <Paragraph position="2"> Interestingly, the constraints are partial and often negative paraphrases of 23 general, essentially syntactic generalisations about the form of the noun phrase, the prepositional phrase, the finite verb chain etc. (Voutilainen 1994).</Paragraph>
      <Paragraph position="3"> 4Actually, it is possible to define additional heuristic rule collections that can optionally be applied after the more reliable ones for resolving remahdng ambiguities.</Paragraph>
      <Paragraph position="4"> The grammar avoids risky'predictions, therefore 3-7% of all words remain ambiguous (an average 1.04-1.08 alternative analyses per output word).</Paragraph>
      <Paragraph position="5"> On the other hand, at least 99.7% of all words retain the correct morphological analysis. Note in passing that the ratio 1.04-1.08/99.7% compares very favourably with other systems; c.f. 3.0/99.3% by POST (Weischedel et al. 1993) and 1.04/97.6% or 1.09/98.6% by de Marcken (1990).</Paragraph>
      <Paragraph position="6"> There is an additional collection of 200 optionally applicable heuristic constraints that are based on simplified linguistic generalisations. They resolve about half of the remaining ambiguities, increasing the overall error rate to about 0.5%.</Paragraph>
      <Paragraph position="7"> Most of even the remaining ambiguities are structurally resolvable. ENGCG leaves them pending mainly because it is prohibitively difficult to express certain kinds of structural generalisation using the available rule formalism and grammatical representation.</Paragraph>
    </Section>
    <Section position="3" start_page="158" end_page="160" type="sub_section">
      <SectionTitle>
2.3 Syntactic analysis
</SectionTitle>
      <Paragraph position="0"> Syntactic analysis is carried out in another reductionistic parsing framework known as Finite-State Intersection Grammar (Koskenniemi 1990; Koskenniemi, Tapanainen and Voutilainen 1992; Tapanainen 1992; Voutilainen and Tapanainen 1993; Voutilainen 1994). A short introduction: * Also here syntactic analysis means resolution of structural ambiguities. Morphological, syntactic and clause boundary descriptors are introduced as ambiguities with simple mappings; these ambiguities are then resolved in parallel.</Paragraph>
      <Paragraph position="1"> * The formalism does not distinguish between various types of ambiguity; nor are ambiguity class specific rule sets needed. A single rule often resolves all types of ambiguity, though superficially it may look e.g. like a rule about syntactic functions.</Paragraph>
      <Paragraph position="2"> * The grammarian can define constants and predicates using regular expressions. For instance, the constants &amp;quot;.&amp;quot; and &amp;quot;..&amp;quot; accept any features within a morphological reading and a finite clause (that may even contain centreembedded clauses), respectively. Constants and predicates can be used in rules, e.g. implication rules that are of the form</Paragraph>
      <Paragraph position="4"> Here X, LC1, RC1, LC2 etc. are regular expressions. The rule reads: &amp;quot;X is legitimate only if it occurs in context LC1 _ RC1 or in context LC2 _ RC2 ... or in context LCn _ RCn&amp;quot;.</Paragraph>
      <Paragraph position="5">  * Also the ambiguous sentences are represented as regular expressions.</Paragraph>
      <Paragraph position="6"> * Before parsing, rules and sentences are compiled into deterministic finite-state automata. * Parsing means intersecting the (ambiguous)  sentence automaton with each rule automaton. Those sentence readings accepted by all rule automata are proposed as parses.</Paragraph>
      <Paragraph position="7"> * In addition, heuristic rules can be used for ranking alternative analyses accepted by the strict rules.</Paragraph>
      <Paragraph position="8">  The grammatical representation used in the Finite State framework is an extension of the ENGCG syntax. Surface-syntactic grammatical relations are encoded with dependency-oriented functional tags. Functional representation of phrases and clauses has been introduced to facilitate expressing syntactic generMisations. The representation is introduced in (Voutilainen and Tapanainen 1993; Voutilainen 1994); here, only  the main characteristics are given: * Each word boundary is explicitly represented as one of five alternatives: - the sentence boundary &amp;quot;@@&amp;quot; - the boundary separating juxtaposed finite clauses &amp;quot;@/&amp;quot; -centre-embedded (sequences of) finite clauses are flanked with &amp;quot;@&lt;&amp;quot; and &amp;quot;@&gt;&amp;quot; - the plain word boundary &amp;quot;@&amp;quot; * Each word is furnished with a tag indicating a  surface-syntactic function (subject, premoditier, auxiliary, main verb, adverbial, etc.). All main verbs are furnished with two syntactic tags, one indicating its main verb status, the other indicating the function of the clause. * An explicit difference is made between finite and nonfinite clauses. Members in nonfinite clauses are indicated with lower case tags; the rest with upper case.</Paragraph>
      <Paragraph position="9"> * In addition to syntactic tags, also morphological, e.g. part-of-speech tags are provided for each word. Let us illustrate with a simplified example.</Paragraph>
      <Paragraph position="10">  Here Mary is a subject in a finite clause (hence the upper case); told is a main verb in a main clause; ghe, fag and bugcher's are premodifiers; wife and daughgers are indirect objects; that is a subordinating conjunction; remembers is a main verb in a finite clause that serves the Object role in a finite clause (the regent being gold); seeing is a main verb in a nonfinite clause (hence the lower case) that also serves the Object role in a finite clause; dream is an object in a nonfinite clause; night is an adverbial. Because only boundaries separating finite clauses are indicated, there is only one sentence-internal clause boundary, &amp;quot;@/&amp;quot; between daughters and that.</Paragraph>
      <Paragraph position="11"> This kind of representation seeks to be (i) sufficiently expressive for stating grammatical generMisations in an economical and transparent fashion and (ii) sufficiently underspecific to make for a structurally resolvable grammatical representation. For example, the present way of functionally accounting for clauses enables the grammarian to . express rules about the coordination of formally different but functionally similar entities. Regarding the resolvability requirement, certain kinds of structurMly unresolvable distinctions are never introduced. For instance, the premodifier tag @&gt;N only indicates that its head is a nominal in the right hand context.</Paragraph>
      <Paragraph position="12">  Here is a realistic implication rule that partially defines the form of prepositional phrases:</Paragraph>
      <Paragraph position="14"> A preposition is followed by a coordination or a preposition complement (here hidden in the constant ..PrepComp that accepts e.g. noun phrases, nonfinite clauses and nominal clauses), or it (as a 'deferred' preposition) is preceded by a passive verb chain Pass VChain.. or a postmodifying clause PostModiCl.. (the main verb in a postmodifying clause is furnished with the postmodifier tag N&lt; @) or of a WH-question (i.e. in the same clause, there is a WH-word). If the tag PREP occurs in none of the specified contexts, the sentence reading containing it is discarded.</Paragraph>
      <Paragraph position="15"> A comprehensive parsing grammar is under development. Currently it accounts for all major syntactic structures of English, but in a somewhat underspecific fashion. Though the accuracy of the  grammar at the level of syntactic analysis can still be considerably improved, the syntactic grammar is already capable of resolving morphological ambiguities left pending by ENGCG.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="160" end_page="160" type="metho">
    <SectionTitle>
3 An experiment with
</SectionTitle>
    <Paragraph position="0"> part-of-speech disambiguation The system was tested against a 38,202-word test corpus consisting of previously unseen journalistic, scientific and manual texts.</Paragraph>
    <Paragraph position="1"> The finite-state parser, the last module in the system, can in principle be &amp;quot;forced&amp;quot; to produce an unambiguous analysis for each input sentence, even for ungrammatical ones. In practice, the present implementation sometimes fails to give an analysis to heavily ambiguous inputs, regardless of their grammaticality. 5 Therefore two kinds of output were accepted for the evaluation: (i) the unambiguous analyses actually proposed by the finite-state parser, and (ii) the ENGCG analysis of those sentences for which the finite-state parser gave no analyses. From this nearly unambiguous combined output, the success of the hybrid was measured, by automatically comparing it with a benchmark version of the test corpus at the level.</Paragraph>
    <Paragraph position="2"> of morphological (including part-of-speech) analysis (i.e. the syntax tags were ignored).</Paragraph>
    <Section position="1" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
3.1 Creation of benchmark corpus
</SectionTitle>
      <Paragraph position="0"> The benchmark corpus was created by first applying the preprocessor and morphological analyser to the test text. This morphologically analysed ambiguous text was then independently disambiguated by two experts whose task also was to detect any errors potentially produced by the previously applied components. They worked independently, consulting written documentation of the grammatical representation when necessary. Then these manually disambiguated versions were automatically compared. At this stage, slightly over 99% of all analyses were identical. When the differences were collectively examined, it was agreed that virtually all were due to inattention. 6 One of these two corpus versions was modified to represent the consensus, and this 'consensus corpus' was used as the benchmark in the evaluation. 7</Paragraph>
    </Section>
    <Section position="2" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> The results are given in Figure 1 (next page).</Paragraph>
      <Paragraph position="1"> Let us examine the results. ENGCG accuracy was close to normal, except that the heuristic con- null Voutilainen and JPSrvinen (this volume).</Paragraph>
      <Paragraph position="2"> stralnts (tagger D2) performed somewhat poorer than usual.</Paragraph>
      <Paragraph position="3"> The finite-state parser gave an analysis to about 80% of all words. Overall, 0.6% of all words remained ambiguous (due to the failure of the Finite State parser; c.f. Section 3). Parsing speed varied greatly (0.1-150 words/see.) -refinement of the Finite State software is still underway.</Paragraph>
      <Paragraph position="4"> The overall success of the system is very encouraging - 99.26% of all words retained the correct morphological analysis. Compared to the 95-97% accuracy of the best competing probabilistic part-of-speech taggers, this accuracy, achieved with an entirely rule-based description, suggests that part-of-speech disambiguation is a syntactic problem. The misanalyses have not been studied in detail, but some general observations can be made: * Many misanalyses made by the Finite State parser were due to ENGCG misanalyses (the &amp;quot;domino effect&amp;quot;).</Paragraph>
      <Paragraph position="5"> * The choice between adverbs and other categories was sometimes difficult. The distributions of adverbs and certain other categories overlaps; this may explain this error type.</Paragraph>
      <Paragraph position="6"> Lexeme-oriented constraints could be formulated for some of these cases.</Paragraph>
      <Paragraph position="7"> * Some ambiguities, e.g. noun-verb and participle-past tense, were problematic. This is probably due to the fact that while the parsing grammar always requires a regent for a dependent, it is much more permissive on dependentless regents. Clause boundaries, and hence the internal structure of clauses, could probably be determined more accurately if the heuristic part of the grammar also contained rules for preferring e.g. verbs with typical complements over verbs without complements.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML