File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1023_metho.xml

Size: 11,315 bytes

Last Modified: 2025-10-06 14:14:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1023">
  <Title>Techniques for Accelerating a Grammar-Checker</Title>
  <Section position="3" start_page="0" end_page="155" type="metho">
    <SectionTitle>
2 Lexicalization of Error Search
</SectionTitle>
    <Paragraph position="0"> Very many of the errors to be discovered by the system can be traced down to mismatches of (vMues of) features projected into the synta.ctic structure from the lexicon. Even though the error searching capabilities of the systern are not lirnited m principle to these lexicMly induced errors, ibr a practicM implementation it turned out to be useful to narrow down the error search of the system to ahnost only these kinds of errors, fbr the following reasons: 1. the loss of generality of the system is in fact only minimM, since the majority of errors which the system is able to detect are of this nature anyway (the only exception being agreement errors revolving NPs with complicated internM structure, e.g., ellipses or coordina,tion) 2. this loss of error coverage (ahnost negligible tbr a real application) is outweighed by substa.ntim gain in overall (statistical) speed of the system, which is achieved by adding a preprocessing phase consisting of a finite state automaton pa,ssing through the input string and looking for a lexical trigger of a contingent error:  * if this automaton does not find any such trigger, tile time-consurrfing grammar-checking process proper (i.e. parsing, possihly also reparsing with relaxed constraints) is not started at all and the sentence is iminediately marked as one containing no detectable error * if this automaton finds such a lexical trigger of an error, it. 'remelnbers' its nature so that in tile tbllowmg phases, only the respect,ive constrailltS are relaxed (which helps to cut down the search space, a.s compa.red to reparsing with relaxing of all predefined-errors-related constraints) As an example of this idea, let us consider a system deMing with errors in subject-verb agreement in Czech (and taking - for the very purpose of this example - detection of no other errors into account). Since the realistic pa.rt of such errors in Czech is the '-I/-Y' dichotomy on homophonic past tense verb endings occurring on plural verbs ('-I' ending sta.ndrag with l:)lural masculine animate subjects, '-Y ending with plural masculine inanimate and feminine subjects), the preprocessing finite-state automaton marks all sentences not conta.ining any of these forms (i.e. all sentences containing only singular verbs, or plural verbs but in present tense or in neuter gender, or infinite verb tbrms) as 'containing no detectable error', without any actual grammar-checking taking place (it is, however, obvious that this does not necessarily mea.n that the sentences are truly correct they just do not contain the kind of error the system is able to detect).</Paragraph>
  </Section>
  <Section position="4" start_page="155" end_page="156" type="metho">
    <SectionTitle>
3 Alternative Error-Classification
</SectionTitle>
    <Paragraph position="0"> and Error Search by Finite Automata Another important step towards tile apl)lication of FSA to error-detection was developing a new dimension of cl~ussitication of errors to be detected: apart fi'om the lnore standard criteria of frequency a.nd lmrforlnance/conqmtence, we developed a scale based on the comI~lexit, y of the forlnal appara.tus needed for the detection of the particular error type (as for error typology developed for the purpose of the error detection techniques used in the l)roject, of.</Paragraph>
    <Paragraph position="1"> (l%odrignez Selles, GMvez, and Oliva, 1996)). On t, he one end of this sca.le were errors recognizable within a strictly local context, such as COmlnas missing in front of a certain kind of complementizers (subordinating conjunctions) or incorrect vocalization of a preposition (in both Bulgarian and C, zech, certain prepositions ending norlnally with a. consonant get, ~t supporting w)cal m case the word that follows them also starts with a. consonant - the parMlel in English would be the opt)osition between the two forms a and an of the indefinite article). On the other elm of the scale we put, e.g., the general case of subject-verb ~greement, errors. Pra.ctically lllOl'e imtn.~rtallt was tile question whether there exists a, (:lass of errors with complexity of detection lying between the &amp;quot;trivial errors&amp;quot; and the errors tor tile detection of which a fldl-fledged analysis is necessa.ry - in other words, the question whether there exist some errors for the recognition of which * on the one hand, a limited local context is msuf ficient (i.e. it is necessa.ry tot this end to process a substring of length which ca.nnot be set in a.dvance, in generM the whole input string), * on the other hand, it is not necessary t.o use the power of the fldl-fledged parser, al,d, in particular, it is sufficient to use the power of a finite state autolnaton or only slight augmenta.lion thereof.</Paragraph>
    <Paragraph position="2"> Following some linguistic research, two such error types have been selected for implenlentation, and while one of thenr is just a lnarginal subtype of an error in subject-verb agreernent, the other is an error type of its own, and in addition one of really crucial importance tot practical gramma.r-checking due to its high frequency of occurrence.</Paragraph>
    <Paragraph position="3"> Tile forlner error to be detected by the l~init,e state machinery is a particular instance of an error where a plural masculine animate subject is conjoined with a verb in a phlrM feminine tbrm (el. also the example above). The idea of detecting some particular cases of this error by a finite state automaton results from the combination of the following observa.tions: * the nominative plural forln of masculine animate nouns of the declension types pd'.. and pi'edse.da is not ambiguous (homonynmus) with any other case forms (apa.rt form vocative case, which we shall deal with innnediately below); this means that if such a. forn~ occurs in a sentence, then this forln ca.n be only  - either a snbject, - or a nominal predicate (with copula) -or a colnparisoli to these, adjoined by means of the conjunctions jako, .'jako~to or cosy - or an excla.lnative expression (in nominative or vocative case) * due to rules of Czech interl)unction, any excla null nlative expression ha.s to be selm.ra.t, ed from the rest, of the sentence by COllllll~lb * also, due to rules of Czech mt,erpunction, two finite verbs in Czech must be sepa.rated fi'oln each other by either a. comma, or by one of the following coordinating conjunctions: a., i and uebo Hence, if we buihl up a finite state am, oma.ton able to recognize the following substrings: 1. &lt;unambiguous masculine a.nimate noun in nominative plural&gt; followed by any string containing neither a finite verb \[~,rm nor a. COlm-ne  nor one of the conjunctions o, i and ~t, b,) fc&gt;llowed by &lt;unaml)igm,us 1,asl particii)le m plural feminine &gt; or (due to free word order) 2. &lt; unaml)iguoub past par ti&lt;'iph-, in I)lu ra.l feminine&gt; followed I)y a,ny string containing neither a tinite verb form nor a comma nor one of the conjunctions a, i, '..(bo followed by &lt;unambiguous masculine animate noun in nominative plural&gt; a,nd coml)ine it with a sin&gt; ple automaton able to detect the absence of the words jako, jako2to and toby as well as the a.bsence of any finite torm of the copula b:C/l ('to be') in the sentence, then we may conclude that we have built a device able to detect whether a sentence contains a particular instance of a subject-verb agreernent violation.</Paragraph>
    <Paragraph position="4"> The det, e&lt;'ti&lt;m of the latter error is also based on the Czech interl)unct.ion rule prescribing that there alwa.ys lIlllSt, occllr a. (:onllna (it a coordina.ting conjunction between two finite verb forms. Hence, a simi)le finite state a./itolua.toll checking whether between any two finite verb f(~rllls a COllllna or a c()of dinating conjunction occurs is a.Me to detect ma.ny cases of the omission of a. co,tuna, a.t the end of an embedded subordinated clause, which is one of the most Dequent errors at all. (Of course, tile word-forms of the verb must be mmmbiguously identifia-Me as such - i.e. such tbrms as ~eu,., j~.d'u, lral.lm, holl etc., do not qualify due to their part of sl)eech ambiguity, which means that ill senten&lt;'es containing them this stra.tegy cannot be used).</Paragraph>
  </Section>
  <Section position="5" start_page="156" end_page="156" type="metho">
    <SectionTitle>
4 Using FSA
</SectionTitle>
    <Paragraph position="0"> for Splitting a Sentence into Clauses The last idea how to gain efficiency is that of splitting the sentence (if possible) into clauses before the processing, which has a two-fold positive effect on the overall process of grammar-checking: I. it. is less time consuming to parse two 'shorter' strings than one longer (a.ssmning that l)arsing is a.t least cubic in t, ime, this fl)llows trivially fronl the inequality A a+B a &lt;A a+:C4 ~B+aAB=+B a =(A+B) a for A,B positive - length of strings) 2. it is possible to detect an error in one of the substrings (clauses) irrespective to the results of analysis of (any of) the other one(s); ill I&gt;ar ticular, also m ca.ses where a.t least one of them was not analyzed and, hence, also tile pa.rsing (including the pa.rsing with rela.xed constraints) of the whole input could not have I)een perforlned on the original st.ring, which would have hindered the error messa.ge pertinent to the sub-string successfully parsed during the parsing with constraint relaxation to be issued.</Paragraph>
    <Paragraph position="1"> Ill particular, this means that measures are to be tbund which would allow for sl)litting the input sentence into clauses by purely superficial criteria. Obviously, this is not possible in a.ll cases (Ibr all senten&lt;-es), but on tile ()tiler hand it is also clear that ill any language there exists a (statistically) huge sub-set of sentences of this language where such techniques are applicable. For Czech, such a.n al)proach might be iml)lemented using pattern matching techniques which wouht recognize for example the following patterns (and use them in an ol)vious way tbr splitting the sentence into clauses):  where the expressions have the following meaning(s): * &lt;a.ny string&gt; is a variable for any string not containing elements of the following nature: finite verb or word form honlonynlous with a finite verb, coordinating conjunction (of any kind), complementizer, any interlmnction sign * &lt;finite verb&gt; is a variaMe - tbr a main verb (not tor an auxiliary) specified for person, - or for a past participle of a n,ain verb; neither of these might be homonyntous in part of speech (but they n-light he ambiguous within the defined class - such verbs as podr'obl, proudl do qualify) * &lt;end of sentence&gt; is simI)ly either a full-stop, a question-mark, an exclamation-mark, a colon or a semi-colon.</Paragraph>
    <Paragraph position="2"> All the renlaining expressions have clear mnemonics, and also the classes which they stand for do not contain elements which are ambiguous as to part of speech.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML