File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-1015_metho.xml

Size: 15,048 bytes

Last Modified: 2025-10-06 14:12:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-1015">
  <Title>project note with software demonstration A parser without a dictionary as a tool for research into French syntax</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Jacques VERGNE
LIMSI 29 rue Titon
F-75011 Paris France
</SectionTitle>
    <Paragraph position="0"> A natural language is not a formal language That is why the syntax of a natural language ~ not be described by the rules of the syntax of a formal language. And that is why a natural language cannot be parsed the same way as a formal language. To parse a natural language, it would be necessary to know its syntax, Thus, major difficulties in parsing a natural language are not algorithmic but linguistic.</Paragraph>
    <Paragraph position="1"> Here, I assume that natural language has a very high formal redundancy. In other words, morphosyntaclic clues are numerous enough to make deductions upon categories and relations in many converging ways. Research into French syntax then consists in discovering these clues: morphology of words, agreements, relative positions of elements, segmentation in NPs, or in larger segments, typology of relations, structure of the relations net, formal and quantitative constraints upon this net.</Paragraph>
    <Paragraph position="2"> The parser presented here is an experimental device used to test and confirm the linguistic hypotheses upon large text corpora. The formal redundancy is high enough to parse without a dictionary with very few data (300 ending rules and a lexicon of grammatical words of 4 kB) and a NP stack grammar. Linguistic hypotheses A sentence is considered as a stack of NPs in our Western grammar traditions, action is the main interest and we place verbs in a central position. But, in scientific texts, NPs are the main carrier of meaning, as they name concepts. They arc also used as terms for indexation. From the statistical point of view, too, nouns are by far the most numerous category.</Paragraph>
    <Paragraph position="3"> That is the reason why the basis for this parser is that the sentence is considered as a stack of NPs, with relations of determination between them. An initial NP is laid (it is often the focus of the sentence), and the following phrases precise and determine it. Here determination is considered as adding more data. These different NPs are connected in a more or less narrow way by &amp;quot;links&amp;quot;: * &amp;quot;non verbal links&amp;quot;: prepositions, co-ordinations, subordinating conjunctions, relative pronouns, * verbs in all their forms: conjugated, present and past participles, infinitives.</Paragraph>
    <Paragraph position="4"> Thus the processing of the verb is unified in such a manner that every verb, conjugated or not, of a main or subordinated clause, acts as an adjective (is a &amp;quot;mmslatd&amp;quot; to adjective, according to Tesnigre's concept, in \[Tesni&amp;e 591), determines a NP and is a link in tile &amp;quot;chain&amp;quot; of NPs.</Paragraph>
    <Paragraph position="5"> Thereforc, the NP is considered as the basic and constitutive element of the sentence (as the cell is the basic and constitutive element of the living tissue). The inside and the outside of the NP are processed separately and differently, so we have: * a grammar of the NP, centered on a nominalized word (like the inside structure of the cell, centered on its nucleus), * and a grammar of the sentence, outside the NP, which is expressed in terms of NPs, each of them considered as a closed entity (like the structure of the tissue expressed as an architecture of cells). Let us name this grammar the &amp;quot;NP stack grammar&amp;quot; or &amp;quot;NPSG&amp;quot;. lnside the noun phrase The internal grammar of the NP is centered on a nominalized word, a noun, a nominalized adjective or a verb. It reigns over its determinant and adjectives at the root of a dependency-determination tree.</Paragraph>
    <Paragraph position="6"> The branches of the tree are made of partitive, determinant, indefinite adjective, anteposed adjective (very rare in scientific texts) before the nominalized word, and contiguous or co-ordinated postposed adjectives, after the nominalized word. The nominalized word is determined and qualified by the other words of the NP. This grammar has been presented for instance in \[Vergne 86\] and developed in \[Vergne 89\].</Paragraph>
    <Paragraph position="7"> Outside the noun phrase The NP slack grammar or NPSG, is used to: * validate the sentence structure as a stack of NPs, * confirm the function of words external to NPs, * compute some tight relations external to NPs, such as verb-object.</Paragraph>
    <Paragraph position="8"> Valuation functions are used to compute other looser relations external to NPs, such as prepositional phrase attachment or co-ordinations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Relations typology
</SectionTitle>
      <Paragraph position="0"> I propose to distinguish three types of relations: -1- relations internal to the NP (mainly the determination noun ~ adjective). These relations are computed during the internal analysis of a NP.</Paragraph>
      <Paragraph position="1"> -2- relations external to the NP, but internal to a recognition pattern, as for instance, the relations subject &lt;- verb and verb ~-- object in a SVO sentence.</Paragraph>
      <Paragraph position="2"> These relations are computed at the recognition time during the validation of the sentence at the NP level.</Paragraph>
      <Paragraph position="3"> This computation is algorithmic.</Paragraph>
      <Paragraph position="4"> Nota bene: to avoid the Chomsky's term ~, I propose the term feigner for Tesnibre's concept &amp;quot;regissant&amp;quot; (in \[Tesnihre 59\]). They have the same etymology, and both are verbal derivatives: the feigner reigns over its dependents.</Paragraph>
      <Paragraph position="5"> -3- relations external to the NP and external to the recognition patterns, as for instance, the relations &amp;quot;reigner&amp;quot; &lt;-prepositional phrase (PP): the PPs arc recognized by a pattern of the form: prel)osition-Nl ), a pattern which does not contain the reigner, which is either a verbal &amp;quot;link&amp;quot;, or a determined NP. These relations are computed by valuation t'unctions. The computation then is heuristic.</Paragraph>
      <Paragraph position="6"> The three types of relations are determination relations. They proceed from the tighter ones, inside the NP (-10, to the looser ones, outside the NP, and outskle the recognition patterns (-3-).</Paragraph>
      <Paragraph position="7"> The NP stack grammar Validating the NP stack pattern At the beginning of this step of the parsing, the sentence is represented by a pattern made of a sequence of letters, in which each letter represents either a NP, or a word external to the NP (preposition, verb, for insmm:e).</Paragraph>
      <Paragraph position="8"> It is possible to imagine this pattern as an artichoke, made of leaves around the heart.</Paragraph>
      <Paragraph position="9"> Validating the pattern then consists in plucking off parts of it progressively: * the leaves are replaced or removed m a precise order, until the heart is reached: - leaves are replaced by context sensitive rules which erase negations, adverbs, auxiliaries; - leaves are removed by context free rules which remove everything else but the heaFl; * simultaneously, each time a leaf is replaced or removed, the relations internal to this leaf are computed (relations of type -2- in the typology exposed above). That is the reason why I name this parser: &amp;quot;plucking off parser&amp;quot; or &amp;quot;POi'&amp;quot;. The final state of the pattern once plucked off must be one of the different possible hearts. They are:  simulated reclothing Relations which are internal to a leaf pattern must be transposed into the entire NP level pattern. From tile positions in a leaf pattern, we are able to compute the positions in tile entire NP level pattern.</Paragraph>
      <Paragraph position="10"> To retrieve these absolute positions, we have only to simulate tile reclothing of the heart, by using the historical account of the plucking off. After simulated reclothing (by applying the rules in the reverse order), we obtain the absolute positions in the entire NP level pattern.</Paragraph>
      <Paragraph position="11"> In a later step of the parsing, after the internal analysis of NPs, these relations will be transposed into the word level pattern.</Paragraph>
      <Paragraph position="12"> In such a way, all relations internal to a leaf pattern are computed inside the leaf pattern, then transposed by simulated reclothing into the entire NP level pattern, then at last Iransposed into the word level pattern. These two transpositions may be seen as rel'erence point changes, from a relative position in the leaf pattern (NP level), to an absolute position in the entire pattern (word level).</Paragraph>
      <Paragraph position="13"> Valuation functions: an heuristic way to choose Valuation functions have been described in derail in I Vergne 891.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Principle
</SectionTitle>
      <Paragraph position="0"> A valuation function is a clear and fine way to express an heuristic, when criteria are too fuzzy to make a choice with an algorithm (a binary tree of &amp;quot;if then else&amp;quot; for instance).</Paragraph>
      <Paragraph position="1"> The objective is to make an autoinatic choice without an algorithm. The principle is the following: * Determine the objects to valuate: the candidates distinguished from the non candidates.</Paragraph>
      <Paragraph position="2"> * Quantify vahmtion of the candidates, using criteria to discriminate thein. The criteria represents Ihe knowledge we have about the phenomenon.</Paragraph>
      <Paragraph position="3"> * The candidate who obtained the higher valuation is choosen.</Paragraph>
      <Paragraph position="4"> When to use valuation functions Valuation functions are used to compute the relations of type -2.- : , to search tk)r the &amp;quot;reigner&amp;quot; of a nominal or infinitive PP, of a present participle or of a gerundive introduced by &amp;quot;en&amp;quot;, of a past participle, of a subordinated clause; o to search for the left co-ordinated of a NP, of a nominal or infinitive PP, of an infinitive, of an attribute, of a main or subordinated clause; * to search for a referent which agrees with an anaphoric.</Paragraph>
      <Paragraph position="5"> Using valuation functions with formal criteria is based on the hypothesis of the high formal redundancy of natural language.</Paragraph>
      <Paragraph position="6"> -2- 71 Attaching prepositional phrases principle: At the beginning of the computation, a &amp;quot;power to reign&amp;quot; is affected to each word according to its category and its eventual verbal derivational origin.</Paragraph>
      <Paragraph position="7"> This computation is thought of as the simulation of the conflicts that words have between them to reign over other words. During this computation, powers are variable: cancelled, reduced, augmented or transmitted. the valuation formula: The function used to valuate a reigner-candidate is a linear function of its power and of its distance to the PP. The valuation is the ~ power of the feignercandidate, when seen from the dependent.</Paragraph>
      <Paragraph position="8"> The apparent or relative dimension of an object depends on its absolute dimension, and on the distance between the object and the observer. It is possible to think too of the physical analogy of the field, a concept which gives an explanation of the influence of an object on the objects in the space around: the universal attraction (gravitation field), for instance, is governed by a rule in 1 / distance 2 It is the purpose here to modelise the influence of a reigner ovcr its dependents. The candidate is valuated according to the following formula: \[ relative power = 2 * absolute power - 10 * distance \] error detection: If two valuations are very close, the higher is choosen, as usual, but marked as uncertain.</Paragraph>
      <Paragraph position="9"> final out tpu!2.' Then the reigner is lemmatizcd, and the relation is output, with its type, If the dependent is co-ordinated, the dependency of the right co-.ordinated noun is output too. It has the same reigncr as its left co-ordinate.</Paragraph>
      <Paragraph position="10"> Computing co-ordinations of noun phrases p~ The valuation function used to compute the co-ordinations is a linear function consisting of criteria that finely measure the isomorphism of the two co-ordinated phrases, it depends neither on the powers nor on the distances. Note that this isomorphism is not the fundamental feature of co-ordinated phrases, but that, more globally, it is the most common mark of an &amp;quot;isofunctionality&amp;quot;, perhaps also for some aesthetic reasons: euphony, taste for balance, symmetry, repetition (of structure).</Paragraph>
      <Paragraph position="11"> the valuation criteria: * a first group of criteria is used to measure the isomorphism of the determinants of the two potential co-ordinated NPs: both definite determinants, both indefinite determinants or both without determinant, both possess'ires or both demonstratives; * a second group of criteria is used to measure the isomorphism of nouns: identical nouns, both nominalized verbs, or same number; * a criterion concerns the &amp;quot;isoqualification&amp;quot; of nouns: both qualified or both not qualified.</Paragraph>
      <Paragraph position="12"> Other features of the parser  The parser outputs the results into following files: * results, sentence by sentence: relations, features of each word, and tile determination tree; * other results, grouped and classed on the whole text: - the lexicon of the text, lemmatized, classed by category, and reversed, - relations of the text, classed by syntactic type of relation, - problems met during the parsing, classed by type of problem, - statistics about text size, processing speed, number of tested patterns, categories, deduction modes, relations and patterns of the text.</Paragraph>
      <Paragraph position="13"> Conclusions The NP stack grammar is based on the central place of tile noun phrase.</Paragraph>
      <Paragraph position="14"> It allows, before the internal analysis of NPs, to verify a sentence pattern, by a fast and non recursive algorithm.</Paragraph>
      <Paragraph position="15"> The typology of relations gives clearly the separation between relations the computation of which can be algorithmic and relations the computation of which must be heuristic.</Paragraph>
      <Paragraph position="16"> Using valuation functions to compute relations is an efficient mean to exploit as best as possible purely formal criteria, without using semantic information, thus confirming the high formal redundancy of natural language.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML