File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3155_metho.xml
Size: 22,685 bytes
Last Modified: 2025-10-06 14:13:00
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-3155"> <Title>JDIh Parsing Italian with a Robust Constraint Grammar ANDREA BOLIOLI LtJC'A DINI (;IOVANNI MALNATI Dims Logic</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> I Introduction </SectionTitle> <Paragraph position="0"> We are going to present a system by which syntactical errors of Italian can be recognized and signalled. This system is called JDII (James Dean the second) and has been developed in Turin by DIMA LOGIC.</Paragraph> <Paragraph position="1"> JDII can accept wrong Italian sentences, even long and complex ones, and it returns a comment on the type of the detected mistake(s), if any. A corpus of 400 grammatical sentences (from a minimum of 4 words to a maximum of 40 words) and possible ungrammatical variants has been used to test the system. The sentences contain the grammatical phenomena treated by the grammar (see below). The system is implemented in PROLOG and C. It runs on UNIX and DOS environments* The modularity of the system is guaranteed by the division of linguistic and software knowledge into two indipendent modules. The linguistic module is roughly made of a morphological component and a syntactic one, while the computational framework (DIMACheck, cf. chapter 3) is mainly based on a parser and on a theorem prover. Software resources are shared with another syntax checker, aiming at an analogous system for the English.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Processing errors 2.1 Error interpretatitm </SectionTitle> <Paragraph position="0"> An error is a violation of some constraint posed by linguistic rules on the language The violation of these constraints causes, according to standard classification, spelling, morphological, syntactic and semantic errors. This classification, which is still useful in defining the nature of linguistic violations from an informal point of view, poses some problems in an authomatic treatment of errors. The fact is that an error can be properly classified only on the basis of writer's intentions. For example, in a sentence like We could assume: a semantic error, since people generally do not write letters to bridges; - a syntactic error, since locative complements with words like ponte are realized by different prepositions (sotto, su) : - a spelling error, if the writer had not wanted to write ponte, but conte (&quot;earl&quot;).</Paragraph> <Paragraph position="1"> The fact is that people correcting texts usually have a pragmatic context which allows mistakes disambiguation. For &quot;context&quot; we mean all the background information concerning the writer, the external conditions when the text was written and , above all, the kind of facts and things which tile text deals with. (Consider, e.g..the different interpretations of a sentence like 1 if it were found in the answer of a fool to a psychological test, in a novel about ancient chivalry, or in some essay titled 'Which is the best place for writing a letter?'). Since we are not able to handle contextual knowledge for error disambiguation, we decided, for our purposes, to drop the classification, and to make use of the notion of error interpretation: &quot;Given a wrong sentence, an error interpretation is any hypothesis of substitution in order to make the sentence correct&quot;.</Paragraph> <Paragraph position="2"> In correction performed by humans, the number of error interpretations is constrained by pragmatical and contextual data. In authomatic detection this number is linked to the capacity of ammitting constraint violations on the rules. In the worst case (i.e. when all the constraints, including lexical constraints, are allowed to be violated)the set of error interpretations is infinite.</Paragraph> <Paragraph position="3"> 2,2 l Jerils rtf tilt' system To get out from this empasse we chose to constrain the possible violations allowed by our grammar just to syntactic ones. We could also deal with spelling corrections by including a small set of incorrect variants for some words. This would reveal meaningful only by tuning the system on the specific linguistic background of the user (in fact different mispellings are made by people speaking different languages and with different degrees of instruction). As for semantic errors we are not able to deal with contexts, so they are simply ignored and '~ English translations will be word by word In the last chapter we dont provide any translation, since ill formed phrases and sentences exemplifying the coverage are language specific Acr~ DE COLING-92, NANTES, 23-28 AOUT 1992 l 003 PRec. OF COLING-92, NANTES, AUG, 23-28, 1992 sentences like 1 are considered correct (and in fact they are, see e.g a context like Giovanni e' impazzito: dice di aver scritto una lettera a un ponte)..</Paragraph> <Paragraph position="4"> To sum up, in our system the error checking works as follows : 1) If a word is found the root of which is not present in the vocabolary or which cannot be properly inflected, the message &quot;unkown word is reported.</Paragraph> <Paragraph position="5"> 2) If a word is found which is included only in the set of the uncorrect variants, the morphology will return to the grammar the texical unit properly inflected, but containing e violated constraint.</Paragraph> <Paragraph position="6"> 3)If all the words in a sentence have been analized by the morphological module, the syntactic processing starts.</Paragraph> <Paragraph position="7"> 4) As we will see, the syntactic parsing will produce either a comment on the grammaticality of the sentence or a general refuse of it ( i.e. a generic error such as &quot;unknown grammatical structure&quot;). Ha visto un crane - > Unknown word (crane) A visto un cane - > Spelling error (a missing h) Ha visto una cane - > Syntactic error (agreement) Ha visto cane uno - > Unknown structure 2.~ Principles el' error diagnosis In the previous paragraph we stated that our attention will be drawn only to syntactic violations. However, this limitation does not solve the problem of multiple error interpretations at all, since, even from a purely syntactic point of view, a sentence can be wrong in different ways. Let us consider a sentence like: 2) * 11 ragazzo e' slala af.fettt~oso.</Paragraph> <Paragraph position="8"> Ihc(masc) hey(mast) has hccn(fcm) lovely(mac, c) We have at least two hypotheses of correction:</Paragraph> <Paragraph position="10"> If we take into account psychological plausibility, we should signal only the error on the word stata.</Paragraph> <Paragraph position="11"> This and other data support a principle in error correction which states that &quot;given a set of possible error interpretations, the right error interpretation is the one with the smallest number of violated constraints &quot;. This principle has been implemented as a built-in preference mechanism over the set of possible final interpretations, while the set itself is restricted by the power of the grammar. The restriction is obtained by implementing peculiar linguistic statements that i) impose linguistically plausible criteria rather than statistical ones; ii) prevent that the explosion of all the possible error interpretations makes the system completely inefficient.</Paragraph> <Paragraph position="12"> An application of the above criteria is provided by the sentence: 3) * It ragazzo chc c'slal(l picchiata dai fa,~cisti sla male. the(mast) boy(mast) who has been(lore) hit(fen1) hy t'ascisls is suffering where we can hypothesize two agreement violations either in the subject NP or in the VP of the relative clause. In this case our system allows us to state that agreement features of the head will win on the ones of the modifiers, so that a gender agreement error is signalled in the relative clause.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 DIMACheck framework </SectionTitle> <Paragraph position="0"> DIMACheck is a general-purpose unification-based natural language parser that, while retaining computational effectiveness and linguistic expression power, stresses the concepts of monotonicy, declarativity and robustness. These goals are achieved, on the one hand, introducing several linguistic devices, like weak constraints, user-defined operators and functions, and on the other hand enforcing strict data-type checking and implementing a time-independent evaluation function (the interpreter of the rules) that guarantees a high expressive power in a totally declarative environment.</Paragraph> <Paragraph position="1"> In order to mantain readability and ease of use of grammars, only two kinds of rules have been introduced, namely structure building rules and lexical rules.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 User I)elined Operators </SectionTitle> <Paragraph position="0"> We think that a re-write system based only on equality constraints is inadequate to express linguistic knowledge, and the introduction of inequality constraints does not always solve the problem. In order to augment the linguistic expressive power without incurring in redundancy and computation-ineffectiveness we introduced the following tool. Formally a User Defined Operator (UDO) is function of the form Boolean <- DataTypcl '~ DalaTypc2 i.e. a UDO is a function mapping pairs of values belonging respectively to DataTypel and DataType2 onto boolean values. The composition rule (the rule that associates the relevant boolean value to each pair) is given explicitly, by listing all the value pairs that map onto true (all the other ones are mapped automatically onto false).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 User I)elieed Functions User Defined Functions (UDFs) have been </SectionTitle> <Paragraph position="0"> introduced to stress the locality of computation. The basic idea is that each value inside a constraint (be it an equality constraint or not) may, in principle, be replaced by a function that computes it on the basis of some given parameters. So, whenever one must compute the value of an attribute which is known to depend on and only on a finite set of other features, AcrEs DE COLING-92, NANTES. 23-28 AOl~rr 1992 1 0 0 4 PROC. OF COLING-92. NANTES, AUt\]. 23-28. 1992 instead of writing lots of rules which embed (and hide) this piece of knowledge into a larger description, it is possible to declare a UDF that manages the computation, thus reducing the number of rules from many to one.</Paragraph> <Paragraph position="1"> Formally, a UDF is a function that maps values from N data types into values of a given data type. in symbols:</Paragraph> <Paragraph position="3"> UDFs are declared explicitely, more or less like UDOs: for each n-tuple of relevant values the result value is stated. UDFs need not to be deterministic: a given pair of input parameters may map into more than one target value.</Paragraph> <Paragraph position="4"> 3.~ Constraint and constraint bundle~ As stated above parsing is, in our view, applying cation a finite set of constraints over an input list of words. We may therefore distinguish between structural constraints (the ones that deal with the order, the occurrences, etc. of parse trees) and feature constraints, that put restrictions on the value of a given variable (the value of an attribute inside the parse tree), The former kind is described in paragraph 34, while the latter, is described here.</Paragraph> <Paragraph position="5"> A feature constraint, or simply a constraint, is a triple of the form: < Operator, At|ributeName, ValueExpression -, Opcralor is the name of either a system defined operator (' -,' and .... - ') or the name of a user-definod one (e.g. 'is a').</Paragraph> <Paragraph position="6"> AttribulcNamc is a legal name for an attribute, the type of which matches the type foreseen by the operator for its left- hand side.</Paragraph> <Paragraph position="7"> ValucExprcssion may be an atomic value - a single variable - a disjunction of atomic values - a user-defined function In our formalism a constraint is stated in an infix form (e.g. 'tense =, pres' or 'tense agreem tense' T or 'tense compute tense(M,T)' ).</Paragraph> <Paragraph position="8"> When a constraint is applied 1o an object it may evaluate either to true or to false: we can therefore say that a constraint is a boolean function. The way in which the result of the application of the constraint is handled by the system leads to the distinction between strong and weak constraints.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.\].2 Strong and weak conMrainls </SectionTitle> <Paragraph position="0"> A strong constraint is a constraint that, if it fails, causes a strong failure, i.e. the object to which it applied is rejected, When a strong constraint succeeds nothing happens, apart from some possible variable binding. Strong constraints are used mainly to prevent useless overgeneration over irrelevant paths. (Usually, but not necessarily, constraints that involve the major syntactic categories are strong). They are also used to propagate values from lower nodes to upper ones.</Paragraph> <Paragraph position="1"> A weak constraint is a constraint that behaves like a strong one if it evaluates to true , but which otherwise produces a soft failure. A soft failure simply consists in recording in the object the information that a weak constraint has failed, without rejecting it. In order to mantain trace of the failed constraints, they are annotated by the user with a number which is used at the end of parsing to generate a proper error message irrdicating which constraint failed and where. Apart from annotation, the syntax of weak constraints is the same of the strong ones, and the same restriction applies.</Paragraph> <Paragraph position="2"> A constraint bundle (CB) is a list of conjuncted constraints (both weak and strong). Notationally, a CB is delimited by braces, single constraints are separated by commas; a slash ('/') splits the list in two parts: the strong one and the weak one. If the weak part is empty', the slash is omitted. Here are some examples of legal CB's: { cat- np / (1) nb ~ N, (2) gd G} { cat v:aux vlype .V/(81} cat v} { cat = pp t During parsing, the parse trees which are built are labelled with a list of pending constraints - i.e. of constraints that have not yet proved to be true or false- and a score .. i.e. an indication of how many weak constraints associated to the tree have already proved to be false, Intuitively, the lower is the score, the better is the object. The constraint solver applies to the list of pending constraints of each final tree, trying to minimize the number of failed weak constraints. The constraint solver selects, as final result the tree with the smallest associated error It's worth noting that this is a global strategy, not a local one. All parse trees, independently of their score are carried or\] up to the end of parsing, and only then the selection is made. There are two reasons for this choice. The first one is theoretical: it is not possible to assume that a locally well formed subtree will lead to a better global tree than that produced by a locally ill-formed one. The second reason is pragmatical: since constraints are solved only when the variables they involve get instantiated, partial trees tend to contain few or no failed weak constraints but long lists of constraints still to be evaluated. Applying the constraint solver in the middle of parsing would be a waste of time, and making the choice disregarding the ;)ending constraints is definitely wrong.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 (;rammatical rules </SectionTitle> <Paragraph position="0"> The system operates or\] the input data driven by rules. Rules mix together structure and feature constraints in order to produce a quasi-well-formed (sub)tree (the 'quasi' is there because the subtree may contain failed weak constraints: it would not be proper At'firs I)E COL1NG-92, NANTI~.S. 23-28 Ao~,r 1992 l 0 0 5 PRec. ov COLING-92, N^N'I'I~S, Auo. 23-280 1992 to call it a well.formed one). Rules are handled by a parser in order to produce all possible results.</Paragraph> <Paragraph position="1"> Currently the system uses a bottom-up, left to right algorithm. However the result is totally independent of the parsing strategy, Structure building rules (sb-rules) are augmented rewrite rules used to describe the structure of quasi well-formed subtrees.A sb-rule has the following form:</Paragraph> <Paragraph position="3"> &quot;< SubTreeExpre~sionN >.</Paragraph> <Paragraph position="4"> where: < RuleNamc > iS a legal unique identifier, < TopConslraintBundle > is a legal CB, < SubTrccExprcssion I.,N > are one of the following: -aCB, - a SubTree description (of any depth) - a regular expression over CBs like: The error associated to the newly built tree is the sum of the errors contained in all its subtrees plus the errors originated by the application of all the constraints of the current rule, both in feature and in structure. Here is an example of sb-rule:</Paragraph> <Paragraph position="6"> Lexical rules are the interface between the external representations of words (i.e, strings) and the internal ones (i.e. CBs). A lexical rules has the general form: <RuleName> = <ConstraintBundle>.</Paragraph> <Paragraph position="7"> where < RulcName > is a legal unique identifier and < CenstraintBundle > is a legal CB</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3.5 Merphulegi~l Rules </SectionTitle> <Paragraph position="0"> In the morphological rules, each root is associated to a morphological class and to a lexieal rule. Before the syntactical parsing starts every word in a sentence has to be processed by the morphological module.</Paragraph> <Paragraph position="1"> The resulting CB is the union of the CB associated to the ending of the word and the CB defined in the lexical rule associated to the root.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Linguistics </SectionTitle> <Paragraph position="0"> JDII does not make strict use of any linguistic theory, even if the guidelines of the implementation are, in a large number of cases, taken from theoretically well founded works (such as Burzio (1986) for the verbal system, Gazdar (1981) for comparative structures, Cinque (1988) for relative clauses and so on). On this respect we fully agree with Dietmar Roesner when he says that &quot;Theorist tend to restrict their approaches to the very techniques available within their theories. In practical NLP systems it may be fruitful to freely combine elements from distinct &quot;linguistic schools&quot;,&quot; Structurally a binary recursive X- BAR schema is followed. The reason why a binary grammar is used is that we lack of a dedicated kind of rule for performing structural checking. As a consequence the computation of constraints depending either on the occurrence or on the linear precedence of optional constituents would reveal very difficult.</Paragraph> <Paragraph position="1"> Indeed, since UDOs and UDFs are not allowed to contain optional parameters, a non binary grammar could handle strings like X\[agreem oper(V1 V2)}- > Y *B ^ A\[sa = V1\] *B ^ A\[s_a = V2\] (where 'agreem' depends on the values of 'VI' and 'V2' and on the presence of 'A') only by exploding all the possible cases. This huge and inefficient explosion is avoided in a system of rules where 'X' is right recurslve and the 'agreem' value is updated by an UDF when every new projection is built:</Paragraph> <Paragraph position="3"> As for ambiguities, we are not interested in producing all possible interpretations of a sentence, Indeed the production of all semantically plausible parses would be out of the scope of a syntax checker, which is supposed to handle only ambiguities relevant for error retrivial.</Paragraph> <Paragraph position="4"> 4,1 Tlne coverage The coverage of JDII includes: - main sentences, both affirmative or negative - argumental clauses playing the role of either subject or object - hypothetical clauses - comparative and consecutive clauses - prepositional clauses - relative clauses - participial clauses - gerundive clauses As for the other constituents, we have a complete treatment for each possible phrasal projection ( AP's, ACr~DECOI.JNG-92, NAm'ES, 2.3-28 AOtTn&quot; 1992 1006 PRec. OF COLING.92, NANTES, AUG. 23-2.8, 1992 NP's, VP's ...) Particular attention has been drawn to the following phenomena: - quantification (e.g. * tutte ragazze, * indite di ragazze, * nessuno delle ragazze, ...) - determination (e.g. * la Maria, * i cromi, * della ragazza verra', ...) - coordination (e.g. * la ragazza bells e sensuali, *ta ragazza e la sua arnica e' venuta, ...) - movements wh-movement (e.g. * la ragazza che Andrea area Maria, * la ragazza che dicono che e' state amato, * la ragazza che dicono che dovrebbe essere state amato, ...) clitic climbing (e.g. * fi ha amato, * li deve aver amato, * deve averh' amato, ...) - dislocations topicalization in coordinate structure (eg. * e' venuta Maria ma Moans, * Maria e' venota ma Moans vs. non e' venota Maria ma Moans, 17011 Maria e ' venuta ma Moana, ...) comparative structure (e.g. * he date tanti baci ieri a Maria che a Moans vs. he date piu' baci iefl a Maria che a Moana, ...) In particular the last four phenomena worked as a test bench in order to check the power and the efficency of the formalismw.r.t, hard tasks, such as unbounded distance structure checking, long distance agreement, discontinuous patterns and so on.On the contrary the formalism proved to be inadequate to tackle context sensitive phenomena such as ellipsis in coordination and comparison, when more than one constituent is bound by the deleted element. In these cases a principle does apply which imposes a context sensitive corrispondence (X Y Z W... X Y Z W....) between the constituents in the second conjunct / comparative clause and the ones in the main clause: 4)Da'lfitt' baci Maria a Ugo chc schit(\[\[i Era a/_,ca</Paragraph> </Section> class="xml-element"></Paper>