File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1022_metho.xml

Size: 25,316 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1022">
  <Title>A Prototype of a Grammar Checker for Czech i</Title>
  <Section position="2" start_page="0" end_page="153" type="metho">
    <SectionTitle>
ZVOLEN6HO-SKONEi/CASE_DISAGR IN THE F
OBDOB\[-ZVOLEN6HO/CASE_DISAGR IN THE F
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> The user may get several types of messages about the correctness of the text: a) The macro changes the color of words in the text according to the type of the detected error - the unknown words are marked blue, the pairs of words involved in a syntactic error are marked red.</Paragraph>
    <Paragraph position="3"> b) The macro creates a message box with a warning each time there is an undesired result of grammar checking -- either there was no result or the sentence was too complicated.</Paragraph>
    <Paragraph position="4"> c) In case that the grammar checker identified and localized an error, it creates a message box with a short description of the error(s).</Paragraph>
    <Paragraph position="5"> Because the grammar checker is running as an independent application, the user may also look at the complete results provided by it. When a message box containing an error message appears on the screen, the user may switch to GRAMMAR and get an additional information. The main window of GRAMMAR is able to provide either the complete list of errors, the statistics concerning for example the number of different syntactic trees built during grammar checking or even the result in the form of a syntactic tree. We do not suppose that the last option is interesting for a typical user, but if we do have all this information, why should we throw it out? -&lt;.... ---.....</Paragraph>
    <Paragraph position="7"> ?degdeg j 7&amp;quot; oedmi / tfe(c)h prur~ch The architecture of the system The design of the whole system is shown in the Fig. I. The grammar checker is composed basically of three parts: I.Morphological and lexical analysis This part is in fact an extended spelling checker. The input text is first checked for spelling errors, then the lexical and morphological analysis creates data, which are combined with the information contained in a separate syntactic dictionary. It would of course be possible to use only one dictionary containing morphosyntactic information about particular words (lemmas), but for the sake of an easier update of information during the development of the system we have decided to keep morphemic and syntactic data in separate files.</Paragraph>
    <Paragraph position="9"> Fig l:The architecture of the system 2.Grammar checking (extended variant of syntactic parsing) This is the main part of the system. It tries to analyze the input sentence. There are three possible results of the analysis: a) The analysis is successful and no syntactic inconsistencies were found (at this stage of processing it is too early to use the term syntactic error, because in our terminology the term error is reserved for something what is being announced to the user after the evaluation) -- in this case the sentence is considered to be correct and no message is issued.</Paragraph>
    <Paragraph position="10"> b) The analysis is successful, but all results contain at least one syntactic inconsistency. In this case it is necessary to pass the results to the evaluation phase. c) The analysis fails and (probably for the reason of the incompleteness of the grammar) it cannot say anything about the input sentence. In such a case no error message is issued. We do not use any partial results for the evaluation of the possible source of an error. Partial results are misleading, because it is often the case that the error is buried somewhere inside the partial tree and tlo operations performed on partial trees can provide a correct error message. Besides that operations on (hundreds or thousands)  partial trees are very ineffective and they can also slow down substantially the processing of the given sentence.</Paragraph>
    <Paragraph position="11"> 3.Evaluation This phase takes the results of the previous phase in the form of syntactic trees containing markers describing individual syntactic inconsistencies. It tries to locate the source of the error using an algorithm that compares available trees. According to the settings given by the user the evaluation phase issues warnings or error messages.</Paragraph>
    <Paragraph position="12"> The core of the system is the second, grammar checking phase, therefore we will concentrate on the description of that phase.</Paragraph>
    <Paragraph position="13"> Process of grammar checking The design of our system was motivated by a simple and natural idea -- the grammar checker should not spend too much time on simple correct sentences. The composition of a grammar checking module tries to stick to this idea as much as possible. The processing of an input sentence is divided into three phases: a) Positive projective This phase is in fact a standard parser -- it checks if it is possible to represent a given input sentence by means of a projective syntactic tree not containing any negative symbol (these symbols represent the application of a grammar rule with relaxed constraints or an error anticipating rule). If the answer is positive, the sentence is considered to be correct and no error message is issued.</Paragraph>
    <Paragraph position="14"> As an example we may take the following simple sentence: &amp;quot;Karlova ~ena zal6vala kv~tiny.&amp;quot; (Word for word translation: Charles'\[fern.sing\] wife watered therefore its processing ends here. The system recognizes the structure of this sentence in the following way:</Paragraph>
    <Paragraph position="16"> b) Positive nonprojective &amp; negative projective This phase tries to find a syntactic tree which either contains negative symbols or nonprojective constructions. A nonprojective subtree is a subtree with discontinuous coverage. It is often the case -- for example in wh-sentences -- that the sentence may be considered either syntactically incorrect or nonprojective --see examples in \[COL94\]. if such a syntactic tree exists, the evaluation phase tries to decide if there should be an error message, warning or nothing.</Paragraph>
    <Paragraph position="17"> Let us present a slightly modified sentence from the previous paragraph: &amp;quot;Karlovy ~ena zal6vala kv~tiny.&amp;quot; (Word for word translation: Charles'\[fem.pl.\] wife watered flowers). This sentence is ambiguous, it is either correct and nonprojective (meaning: Woman watered Charles' flowers) or incorrect (disagreement in number between &amp;quot;Karlovy&amp;quot; and &amp;quot;~ena&amp;quot;) and projective. Both results are achieved by this phase of the grammar  Both nonprojective constructions and negative symbols are allowed. If this phase succeeds, the evaluation module issues a relevant error message or warning. In case that neither phase provides any result, no error message is issued. In case that the user wants to know which sentences were not analyzed properly, s/he may obtain a warning.</Paragraph>
    <Paragraph position="18">  Although this division into phases worked fine for short sentences (for the sentences not more than 15 words long the first phase usually took about 1 second on Pentium 75 MHz), long and complicated sentences were unacceptably slow (even tens of seconds). These results turned our attention to the problem how to speed up the processing of correct sentences even further.</Paragraph>
    <Paragraph position="19"> With the growing length of sentences the parsing will be more complex with respect both to the length of the processing and to the number of resulting syntactic structures. Let us demonstrate the problem on a sample sentence from the corpus of Czech newspaper texts from the newspaper Lidov~ noviny. Let us take the sentence: &amp;quot;KDS nep~edpokhidfi spoluprfici se stranou pana Sladka a neni pravdou, ~.e ptedseda k~est'ansk37ch demokratfi pan Benda v telefonick6m rozhovoru s Petrem Pithartem prosazoval ing. Dejmala do funkce ministra ~ivotniho prost~edi.&amp;quot; (Word for word translation: &amp;quot;CDP \[does\] not suppose cooperation with party \[of\] Mister Slfidek and \[it\] isn't true, that chairman \[of\] Christian democrats Mister Benda in telephone discussion with Petr Pithart enforced ing. Dejmal to function \[of\] minister \[of\] environment.&amp;quot;) In this basic form of the sentence, which is an exact transcription of the text from the corpus, the processing by the positive projective phase of our parser takes 13,07s and it provides 26 different variants of syntactic trees. During the processing there were 2272 items derived. The testing of this sentence and also of all the following ones was performed on Pentium 75MHz with 16MB RAM.</Paragraph>
    <Paragraph position="20"> Such a relatively large number of variants is caused by the fact that our syntactic analysis uses only purely syntactic means - we do not take into account either semantics or textual or sentential context. That is the reason why free modifiers at the end of our sample sentence create a great number of variants of syntactic structures and thus make the processing longer and more complicated. In order to demonstrate this problem we will take this sentence and modify it trying to find out what the main source of ineffectiveness of its parsing is.</Paragraph>
    <Paragraph position="21"> If we look more closely at the number of ambiguities present with individual words, we notice that the most ambiguous word is the word (abbreviation) &amp;quot;ing.&amp;quot; This word form is the same in all cases, genders and numbers. If we substitute this abbreviation by the full form of the word (&amp;quot;in~en~,ra&amp;quot; \[engineer - \[gem\]\]) we get the following results: the sentence is processed 8,95s, the number of variants decreases by four (22) and the number of derived items is, of course, also smaller (I 817). The gain of speed would be even greater would we have worked with a negative or a nonprojective variant of the parser.</Paragraph>
    <Paragraph position="22"> The next step is to delete further groups of words from the input sentence. Among the suitable candidates there is, for example, the prepositional phrase &amp;quot;v telefonickEm rozhovoru&amp;quot; (in \[the\] telephone discussion). This phrase can be easily checked for grammatical correctness locally, because it has a clear leR and right borders (prepositions &amp;quot;v&amp;quot;and &amp;quot;s&amp;quot;). Here we can easily solve the problem where the nominal group ends on the right hand side. in general, we need to parse the whole sentence in order to get this information, but in some specific cases we can rely only on the surface word order.</Paragraph>
    <Paragraph position="23"> After we had deleted this phrase, the processing time went down to 8,79s, the same number of syntactic representations as in the previous case was derived (22) and the number of items was slightly lower (1789). This phrase is therefore certainly not the main source of ineffectiveness in parsing. In order to speed up the processing even more we have to use another type of simplification.</Paragraph>
    <Paragraph position="24"> The first step of simplifying the original input sentence represented almost 50% acceleration although it was only a cosmetic change from an abbreviation to a full word form. From the point of view of Iocalisation of grammatical inconsistencies we can proceed even farther - the group title+surname in fact represents only one item; if we remove titles preceding surnames we do not change syntactic structure of the sentence. It is locally only a tiny bit simpler. When we look more closely at the resulting syntactic representation of the previous variants of the input sentence we may notice that the word &amp;quot;in~en3~ra&amp;quot; \[engineer\[gen.\]\] figures (inadequately, of course, in this case) also as a right-hand attribute to the word &amp;quot;Pithartem\[instr.\]&amp;quot;, as it is shown in the following screenshots (for the sake of simplicity we demonstrate only the relevant part of derivation trees ).</Paragraph>
    <Paragraph position="26"> Let us remove the word &amp;quot;in~en~,ra&amp;quot; from the input sentence altogether. This time the processing time is only 3,74s, only 10 structures are created and 1021 items are derived. Another logical step is to remove all other first names and titles which are placed immediately in front of their governing words. Those words are &amp;quot;pana&amp;quot; \[mister \[gen.\]\], &amp;quot;pan&amp;quot; and &amp;quot;Petrem&amp;quot;. The claim that the first two words are unambiguous is supported by the fact that the form of the word &amp;quot;p~in&amp;quot; \[mister\] is different in Czech in case the word is &amp;quot;independent&amp;quot; and in case it is used as a title (p~na vs. pana \[gen.,acc.\], pzin vs. pan\[nom.\]). When we make this change we get more than 50% shorter processing time, namely 1,71 s, also the number of resulting structures is a half of the original number (5) and only 587 items are derived. Another change we would like to demonstrate is the deletion of all other free modifiers the result of which is a certain &amp;quot;backbone&amp;quot; of the sentence.</Paragraph>
    <Paragraph position="27"> After having carried out all deletions, we arrive at the following structure: &amp;quot;KDS nepfedpokl~id~i spolupr~ici a neni pravdou, ~e Benda prosadil Dejmala.&amp;quot; (Word for word translation: &amp;quot;CDP \[does\] notsuppose cooperation and \[it\] isn't true, that Benda enforced Dejmal.&amp;quot;) The result of the processing is a unique structure and 141 items are derived in 0,22s. The last variant of the input sentence will serve as a contrast to the previous ones. Let us take the last clause of the sentence, namely &amp;quot;P~edseda kPest'anskych demokratO pan Benda v telefonick6m rozhovoru s Petrem Pithartem prosazoval in~en~ra Dejmala do funkce ministra ~.ivotniho prost~edi.&amp;quot; \[&amp;quot;Chairman \[of\] Christian democrats Mister Benda in telephone discussion with Petr Pithart enforced ing. Dejmal to function \[of\] minister \[of\] environment.&amp;quot;).</Paragraph>
    <Paragraph position="28"> If we take into account the results of the previous examples we should not be surprised by the results. The processing time is 2,25s, I 0 structures were created and 722 items were derived.</Paragraph>
    <Paragraph position="29"> This example and also other test data showed that the main source of ineffectivity are clauses with a big number of free modifiers and adjuncts rather than complex sentences with many clauses. These results have led us to a layered design of grammar for positive projective parsing. The core idea of this approach is the following: Syntactic constructions which even in free word order languages may be parsed locally (certain adjectival or prepositional phrases etc.) should be parsed first in order to avoid their mutual unnecessary (from the point of view of grammar checking!) combinations. This means that the grammar should be divided into certain layers of rules (not necessarily disjunctive), which will be applied one atter the other (in principle they may be applied even in cycles, but this options is not used in our implementation).</Paragraph>
    <Paragraph position="30"> In the pivot version of our system we use the following layers: I st layer: a metarule for processing titles and abbreviations preceding names  the right hand side sentential border The application of layers may slow down the processing of short sentences (it has a fixed cost of opening the description file and consulting it during parsing process), therefore it is applied only to  sentences longer than certain threshold (currently 15 words).</Paragraph>
    <Paragraph position="31"> Another important point is, that the results of parsing in layers provides only positive information (i.e. it is able to sort out sentences which are certainly correct, but the failure of parsing in layers does not necessarily mean that the sentence is incorrect). The same approach may not be used for error localization and identification, although the cases when parsing in layers fails on a correct sentence are quite rare.</Paragraph>
    <Paragraph position="32"> The implementation The implementation of our system was to a big extent influenced by the demand of effectiveness. For this reason we had to abandon even feature structures as the form of the representation of lexical data. Our data structure is a set of attribute-value pairs with the data about valency frames of particular words as the only complex values (embedded attribute-value pairs).</Paragraph>
    <Paragraph position="33"> An example of the representation of the Czech wordform &amp;quot;informoval&amp;quot; (\[he\] informed) follows: informoval lexf: informovat wcl: vb syntcl: v v cl: full refl: 0 aspect: prf frameset: ( \[ actant: act case: nom prep: 0 \[ actant: adr case: acc prep: 0 \] \[ actant: pat case: clause prep: \]) neg: no v form: pastp gender: ? inan , anim t num: sg</Paragraph>
    <Paragraph position="35"> The grammar of the system is composed of metarules representing whole sets of rules of the background formalism called Robust Free Order Dependency Grammar (RFODG). The limited space of this paper does not allow to present the full description of RFODG here. The definition may be found for example in \[TR96\].</Paragraph>
    <Paragraph position="36"> The RFODG provides a formal base for the description ofnonprojective and incorrect syntactic constructions. It introduces three measures by means of which it is possible to classify the degree of nonprojectivness and incorrectness of a particular sentence. In this paper we would like to stress one important feature of this formalism, namely the classification of the set of symbols which are used by RFODG into three types: a) terminals and nonterminals b) deletable and nondeletable symbols c) positive and negative symbols The sets under a) have the usual meaning, the sets under b) serve for the classification of syntactic inconsistencies and the sets under c) serve for their Iocalisation. The union of terminals and nonterminals is exactly the set of all symbols used by RFODG. The same holds about the union of deletable and nondeletable symbols and also about the union of positive and negative symbols. In other words, each symbol used by RFODG belongs to exactly one set from each pair of sets under a), b) and c).</Paragraph>
    <Paragraph position="37"> This classification therefore allows to handle ru!es describing both correct and erroneous syntactic constructions in a uniform way and to use a single grammar for the description of both types of syntactic constructions. Whenever a metarule describing syntactic inconsistency is used during the parsing process, a negative symbol is inserted into the tree created according to the grammar.</Paragraph>
    <Paragraph position="38"> The metarules express a procedural description of the process of checking the applicability of a given metarule to a particular pair of input items A and B (A stands to the left from B i n the input). In case that a particular rule may be applied to items A and B, a new item X is created. It is possible to change values of the resulting item X by means of an assignment operator := * The constraint relaxation technique is implemented in the form of so called &amp;quot;soft constraints&amp;quot; - the constraints with an operator ? accompanied by an error marker may be relaxed in phases b) and c) (&amp;quot;hard constraints&amp;quot; with an operator = may never be relaxed).</Paragraph>
    <Paragraph position="39"> The error anticipating rules are marked by a keyword NEGATIVE at the beginning of the rule and are applied only in phases b) and c). The keyword PROJECTIVE indicates that the rule may be applied only in a projective way.</Paragraph>
    <Paragraph position="40"> An example of a (simplified) metarule desc,'ibing the attachment of a nominal modifier in genitive case from the right hand side of the noun:  The interpretation of the grammar is performed by means of a slightly modified CYK algorithm (a description of this algorithm may be found for example in \[$97\]. The grammar works with unambiguous input data (ambiguous words are represented as sets of unambiguous items). All partial parses from the first phase are used in the phases b) and c). For the purpose of testing and debugging the system we use full parsing even in the first phase.</Paragraph>
    <Paragraph position="41"> Speeding up the performance It is often the case that nondeterministic parsers the author of the grammar has to prevent an unnecessary multiplication of results by means of&amp;quot;tricks&amp;quot; which are not supported by the linguistic theory -- let us take for example the problem of subject -- predicate -- object construction. If we do not put any additional restriction on the order of application of rules then the rule filling the subcategorization slots for subject and object may be applied in two ways, either first filling the slot for the subject and then the object or vice versa. Both ways create the same syntactic structure.</Paragraph>
    <Paragraph position="42"> In such a case it is necessary to apply some additional constraints in the grammar -- for example the restriction on the order of subcategorization (an item to the left of a verb should be processed first). This approach makes the grammar more complicated than it is necessary and it may also influence the quality of results (an error on the left hand side of a verb may also prevent an attachment of the items fi'om the right hand side of the verb).</Paragraph>
    <Paragraph position="43"> The interpreter of our grammar solves these situations itself. Every time a new item is created, the interpreter checks, if such an item with the same structure and coverage already exists. If yes, the new item is deleted.</Paragraph>
    <Paragraph position="44"> This property of the interpreter is used together with other kinds of pruning techniques in all phases of grammar checking. In addition, there are also some other techniques used especially in phases b) and c). The work with unambiguous input symbols allows fast parsing in the phase a) (CYK is polynomial with respect to the length of the input), but creates some problems in the context of constraint relaxations used in subsequent phases. For example, a typical error in &amp;quot;free word order'' languages is an error in agreement. Let us suppose that we have the following three input words (the actual lexical value of these words may be neglected): Preposition (accusative or locative) Adjective (animate or inanimate gender, genitive or accusative sing.) Noun (animate, genitive or accusative sing.) These words represent 2 + 4 + 2 = 8 unambiguous items. If we try to create a prepositional phrase without constraint relaxation, we get one resulting item PP(animate, accusative sing.). On the other hand after the relaxation of constraints there are 16 items created. One of them does not contain any syntactic inconsistency, remaining 15 has one or two syntactic inconsistencies. In a nondeterministic parser all 16 variants are used in the subsequent parsing. This causes a combinatorial explosion of mostly incorrect results.</Paragraph>
    <Paragraph position="45"> There are two ways how to solve this problem.</Paragraph>
    <Paragraph position="46"> The first possible solution is to relax the constraints in certain order (to apply a hierarchy on constraints). We have chosen the other possible way, which prefers the subtrees with minimal number of errors. Every time a new branch or subtree is created, it is compared with the other branches or subtrees with the same structure and coverage and if it contains more errors than those already existing, it is not parsed further.</Paragraph>
    <Paragraph position="47"> This technique substantially speeds up the processing of rules with relaxed constraints, but it has also one rather unpleasant side effect: the syntactic inconsistencies may be suppressed and appear later in a different location. This makes the task of the evaluating part of our system a bit more difficult, but nevertheless the gain on effectivity not accompanied by the loss of recall justifies the use of this technique.</Paragraph>
    <Paragraph position="48"> Conclusion The main purpose of the demo of our system is to demonstrate a method of grammar based grammar checking of a &amp;quot;free word order&amp;quot; language. The system is far from being ready for commercial exploitation - the main obstacle is the size of the syntactic dictionary used. Grammar based methods require a complex syntactic information about words. To build a syntactic dictionary of about 150 000 items is a task which exceeds our current capacities with respect both to manpower and funds. It would be interesting to continue the work on our system towards the development of statistical methods for this task.</Paragraph>
    <Paragraph position="49">  i The work was supported by the tollowing research grants: GA(~R 201/96/0195, RSS/H ESP No. 85/1995 and JEP PECO 2824 ,,Language Technologies for</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML