File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1012_metho.xml

Size: 24,343 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1012">
  <Title>References</Title>
  <Section position="4" start_page="72" end_page="74" type="metho">
    <SectionTitle>
2 The incremental parser
2.1 Overview
</SectionTitle>
    <Paragraph position="0"> The input to the parser is a tagged text. We currently use a modified version of the Xerox French tagger (Chanod, Tapanainen, 1995). The revisions are meant to reduce the impact of the most frequent errors of the tagger (e.g. errors between adjectives and past participles), and to refine the tagset.</Paragraph>
    <Paragraph position="1"> Each input token is assigned a single tag, generally representing the part-of-speech and some limited morphological information (e.g the number, but not the gender of nouns). The sentence is initially represented by a sequence of wordform-plus-tag pairs.</Paragraph>
    <Paragraph position="2"> The incremental parser consists of a sequence of transducers. These transducers are compiled from regular expressions that use finite-state calculus operators, mainly the Replace operators (Karttunen, 1996). Each of these transducers adds syntactic information represented by reserved symbols (annotations), such as brackets and names for segments and syntactic functions. The application of each transducer composes it with the result of previous applications. null If the constraints stipulated in a given transducer are not verified, the string remains unchanged. This ensures that there is always an output string at the end of the sequence, with possibly underspecified segments.</Paragraph>
    <Paragraph position="3"> Each transducer performs a specific linguistic task. For instance, some networks identify segments for NPs, PPs, APs (adjective phrases) and verbs, while others are dedicated to subject or object. The same task (e.g. subject assignment or verb segmentation) may be performed by more than one transducer. The additional information provided at each stage of the sequence is instrumental in the definition of the later stages of the sequence. Networks are ordered in such a way that the easiest tasks are addressed first.</Paragraph>
    <Section position="1" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
2.2 Non-monotonicity
</SectionTitle>
      <Paragraph position="0"> The replace operators allow one not only to add information but also to modify previously computed information. It is thus possible to reassign syntactic markings at a later stage of the sequence. This has two major usages: * assigning some segments with a default marking at some stage of the process in order to provide preliminary information that is essential to the subsequent stages; and correcting the default marking later if the context so requires * assigning some segments with very general marking; and refining the marking later if the context so permits.</Paragraph>
      <Paragraph position="1"> In that sense, our incremental parser is nonmonotonic: earlier decisions may be refined or even  revised. However, all the transducers can, in principle, be composed into a single transducer which produces the final outcome in a single step.</Paragraph>
    </Section>
    <Section position="2" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
2.3 Cautious segmentation and syntactic
</SectionTitle>
      <Paragraph position="0"> marking Each transducer defines syntactic constructions using two major operations: segmentation and syntactic marking. Segmentation consists of bracketing and labeling adjacent constituents that belong to a same partial construction (e.g. a nominal or a verbal phrase, or a more primitive/partial syntactic chain if necessary). Segmentation also includes the identification of clause boundaries. Syntactic marking annotates segments with syntactic functions (e.g. subject, object, PPObj).</Paragraph>
      <Paragraph position="1"> The two operations, segmentation and syntactic marking, are performed throughout the sequence in an interrelated fashion. Some segmentations depend on previous syntactic marking and vice versa.</Paragraph>
      <Paragraph position="2"> If a construction is not recognized at some point of the sequence because the constraints are too strong, it can still be recognized at a later stage, using other linguistic statements and different background information. This notion of delayed assignment is crucial for robust parsing, and requires that each statement in the sequence be linguistically cautious. Cautious segmentation prevents us from grouping syntactically independent segments.</Paragraph>
      <Paragraph position="3"> This is why we avoid the use of simplifying approximations that would block the possibility of performing delayed assignment. For example, unlike (Abney, 1991), we do not systematically use longest pattern matching for segmentation. Segments are restricted by their underlying linguistic indeterminacy (e.g. post-nominal adjectives are not attached to the immediate noun on their left, and coordinated segments are not systematically merged, until strong evidence is established for their linkage).</Paragraph>
    </Section>
    <Section position="3" start_page="73" end_page="73" type="sub_section">
      <SectionTitle>
2.4 Incremental parsing and linguistic
</SectionTitle>
      <Paragraph position="0"> description The parsing process is incremental in the sense that the linguistic description attached to a given transducer in the sequence: * relies on the preceding sequence of transducers * covers only some occurrences of a given linguistic phenomenon * can be revised at a later stage.</Paragraph>
      <Paragraph position="1"> This has a strong impact on the linguistic character of the work. The ordering of the linguistic descriptions is in itself a matter of linguistic description: i.e. the grammarian must split the description of phenomena into sub-descriptions, depending on the available amount of linguistic knowledge at a given stage of the sequence.</Paragraph>
      <Paragraph position="2"> This may sound like a severe disadvantage of the approach, as deciding on the order of the transducers relies mostly on the grammarian's intuition. But we argue that this incremental view of parsing is instrumental in achieving robust parsing in a principled fashion. When it comes to parsing, no statement is fully accurate (one may for instance find examples where even the subject and the verb do not agree in perfectly correct French sentences). However, one may construct statements which are true almost everywhere, that is, which are always true in some frequently occuring context.</Paragraph>
      <Paragraph position="3"> By identifying the classes of such statements, we reduce the overall syntactic ambiguity and we simplify the task of handling less frequent phenomena. The less frequent phenomena apply only to segments that are not covered by previous linguistic description stages.</Paragraph>
      <Paragraph position="4"> To some extent, this is reminiscent of the optimality theory, in which:  * Constraints are ranked; * Constraints can be violated.</Paragraph>
      <Paragraph position="5">  Transducers at the top of the sequence are ranked higher, in the sense that they apply first, thus blocking the application of similar constructions at a later stage in the sequence.</Paragraph>
      <Paragraph position="6"> If the constraints attached to a given transducer are not fulfilled, the transducer has no effect. The output annotated string is identical to the input string and the construction is bypassed. However, a bypassed construction may be reconsidered at a later stage, using different linguistic statements. In that sense, bypassing allows for the violation of constraints. null</Paragraph>
    </Section>
    <Section position="4" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
2.5 An example of incremental description:
French Subjects
</SectionTitle>
      <Paragraph position="0"> As French is typically SVO, the first transducer in the sequence to mark subjects checks for NPs on the left side of finite verbs.</Paragraph>
      <Paragraph position="1"> Later in the sequence, other transducers allow for subject inversion (thus violating the constraint on subject-verb order), especially in some specific contexts where inversion is likely to occur, e.g.</Paragraph>
      <Paragraph position="2"> within relative or subordinate clauses, or with motion verbs. Whenever a transducer defines a verb-subject construction, it is implicitly known at this  stage that the initial subject-verb construction was not recognized for that particular clause (otherwise, the application of the verb-subject construction would be blocked).</Paragraph>
      <Paragraph position="3"> Further down in the sequence, transducers may allow for verb-subject constructions outside the previously considered contexts. If none of these subject-pickup constructions applies, the final sentence string remains underspecified: the output does not specify where the subject stands.</Paragraph>
      <Paragraph position="4"> It should be observed that in real texts, not only may one find subjects that do not agree with the verb (and even in correct sentences), but one may also find finite verbs without a subject. This is the case for instance in elliptic technical reports (esp. failure reports) or on cigarette packs with inscriptions like Nuit gravement ~ la santg 1.</Paragraph>
      <Paragraph position="5"> This is a major feature of shallow and robust parsers (Jensen et al., 1993; Ejerhed, 1993): they may provide partial and underspecified parses when full analyses cannot be performed; the issue of grammaticality is independent from the parsing process; the parser identifies the most likely interpretations for any given input.</Paragraph>
      <Paragraph position="6"> An additional feature of the incremental parser derives from its modular architecture: one may handle underspecified elements in a tractable fashion, by adding optional transducers to the sequence. For instance, one may use corpus specific transducers (e.g. sub-grammars for technical manuals are specially useful to block analyses that are linguistically acceptable, but unlikely in technical manuals: a good example in French is to forbid second person singular imperatives in technical manuals as they are often ambiguous with nouns in a syntactically undecidable fashion). One may also use heuristics which go beyond the cautious statements of the core grammar (to get back to the example of French subjects, heuristics can identify any underspecified NP as the subject of a finite verb if the slot is available at the end of the sequence). How specific grammars and heuristics can be used is obviously application dependent. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="74" end_page="74" type="metho">
    <SectionTitle>
3 Architecture
</SectionTitle>
    <Paragraph position="0"> The parser has four main linguistic modules, each of them consisting of one or several sequenced transducers: null 1 Seriously endangers your health. This example represents an interesting case of deixis and at the same time a challenge for the POS tagger as Nuit is more likely to be recognized as a noun (Night) than as a verb (Endangers) in this particular context.</Paragraph>
    <Paragraph position="1">  The input text is first tagged with part-of-speech information using the Xerox tagger. The tagger uses 44 morphosyntactic tags such as NOUN-SG for singular nouns and VERB-P3SG for verb 3rd person singular.</Paragraph>
    <Paragraph position="2"> The morphosyntactic tags are used to mark AP, NP, PP and VP segments. We then use the segmentation tags and some additional information (including typography) to mark subjects which, in turn, determine to what extent VCs (Verb Chunks) can be expanded. Finally, other syntactic functions are tagged within the segments.</Paragraph>
    <Paragraph position="3"> Marking transducers are compiled from regular expressions of the form A (c)-&gt; T1 ... T2 that contains the left-to-right longest match replace operator (c)-&gt; . Such a transducer marks in a left-to-right fashion the maximal instances of A by adding the bracketing strings T1 and T2.</Paragraph>
  </Section>
  <Section position="6" start_page="74" end_page="74" type="metho">
    <SectionTitle>
4 Primary Segmentation
</SectionTitle>
    <Paragraph position="0"> A segment is a continuous sequence of words that are syntactically linked to each other or to a main word (the Head). In the primary segmentation step, we mark segment boundaries within sentences as shown below where NP stands for Noun Phrase, PP for Preposition Phrase and VC for Verb Chunk (a VC contains at least one verb and possibly some of its arguments and modifiers).</Paragraph>
    <Paragraph position="1">  All the words within a segment should be linked to words in the same segment at the same level, except the head. For instance, in the NP le commutateur (the switch), le should be linked to commutateur (the head) which, in turn, should be linked to the verb tourne, and not to the verb retourne because the two words are not in the same segment. The main purpose of marking segments is therefore to constrain the particular linguistic space that determines the syntactic function of a word.</Paragraph>
    <Paragraph position="2"> 2 Turning the starter switch to the auxiliary position, the pointer will then return to zero.</Paragraph>
    <Paragraph position="3"> As one can notice from the example above, segmentation is very cautious, and structural ambiguity inherent to modifier attachment (even postnominal adjectives), verb arguments and coordination is not resolved at this stage.</Paragraph>
    <Paragraph position="4"> In order to get more robust linguistic descriptions and networks that compile faster, segments are not defined by marking sequences that match classical regular expressions of the type \[Det (Coord Det) Adj* Noun\], except in simple or heavily constrained cases (APs, Infinitives, etc). Rather, we take advantage of the fact that, within a linguistic segment introduced by some grammatical words and terminated by the head, there is no attachement ambiguity and therefore these words can be safely used as segment delimiters (Bbs, 1993). We first mark possible beginnings and endings of a segment and then associate each beginning tag with an ending if some internal constraints are satisfied. Hence, the main steps in segmentation are:</Paragraph>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
4.1 AP Segmentation
</SectionTitle>
      <Paragraph position="0"> Adjective phrases are marked by a replacement transducer which inserts the \[AP and AP\] boundaries around any word sequence that matches the</Paragraph>
      <Paragraph position="2"> ADVP stands for adverb phrase and is defined as:</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="74" end_page="76" type="metho">
    <SectionTitle>
\[ ADV+ \[\[COORD\[COMMA\] ADV/\]* \]
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
4.2 NP Segmentation
</SectionTitle>
      <Paragraph position="0"> Unlike APs, NPs are marked in two steps where the basic idea is the following: we first insert a special mark wherever a beginning of an NP is possible, i.e, on the left of a determiner, a numeral, a pronoun, etc. The mark is called a temporary beginning of NP (TBeginNP). The same is done for all possible ends of NP (TEndNP), i.e. nouns, numerals, pronouns, etc. Then, using a replacement transducer, we insert the \[NP and NP\] boundaries around the longest sequence that contains at least one temporary beginning of NP followed by one temporary end of NP:</Paragraph>
      <Paragraph position="2"> This way, we implicitly handle complicated NPs such as le ou les responsables ( the-SG or the-PL person(s) in charge), les trois ou quatre affaires (the three or four cases), etc.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.3 PP Segmentation
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
    <Section position="3" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
4.4 VC Segmentation
</SectionTitle>
      <Paragraph position="0"> A VC (Verb Chunk) is a sequence containing at least one verb (the head). It may include words or segments (NPs, PPs, APs or other VCs) that are possibly linked as arguments or adjuncts to the verb.</Paragraph>
      <Paragraph position="1"> There are three types of VCs: infinitives, present participle phrases and finite verb phrases. We first mark infinitives and present participle segments as they are simpler than finite verb phrases-they are not recursive, they cannot contain other VCs.</Paragraph>
      <Paragraph position="2">  Here we use the basic idea described in the NP marking: temporary beginnings (TBeginVC) and ends (TEndVC) of VC are first marked.</Paragraph>
      <Paragraph position="3"> Temporary beginnings of VCs are usually introduced by grammatical words such as qui (relative pronoun), lorsque, et (coordination) etc. However, not all these words are certain VC boundaries: et could be an NP coordinator, while que (tagged as CONJQUE by the HMM tagger) could be used in comparatives (e.g. plus blanc que blanc). Therefore, we use three kinds of TBeginVC to handle different levels of uncertainty: a certain TBeginVC (TBeginVC1), a possible BeginVC (TBeginVC2) and an initial TBeginVC (TBeginVCS) automatically inserted at the beginning of every sentence in the input text. With TBeginVCS, we assume that the sentence has a main finite verb, as is usually the case, but this is just an assumption that can be corrected later.</Paragraph>
      <Paragraph position="4"> A temporary end of VC (TEndVC) is then inserted on the right of any finite verb, and the process of recognizing VCs consists of the following steps: * Step 1: Each certain TBeginVC1 is matched with a TEndVC, and the sequence is marked with \[VC and VC\]. The matching is applied iteratively on the input text to handle the case of embedded clauses (arbitrarily bound to three iterations in the current implementations).</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="76" end_page="77" type="metho">
    <SectionTitle>
*/SENT
5 Marking Syntactic Functions
</SectionTitle>
    <Paragraph position="0"> The process of tagging words and segments with syntactic functions is a good example of the non-monotonic nature of the parser and its hybrid constructive-reductionnist approach. Syntactic functions within non recursive segments (AP, NP and PP) are addressed first because they are easier to tag. Then other functions within verb segments and at sentence level (subject, direct object, verb modifier, etc.) are considered.</Paragraph>
    <Paragraph position="1"> Potential subjects are marked first: an NP is a potential subject if and only if it satisfies some typographical conditions (it should not be separated from the verb with only one comma, etc.). This prevents the NP Jacques, for example, from being marked as a subject in the sentence below: \[VC \[NP le president NP\]/SUBJ \[PP du CSA PP\], \[NP Jacques NP\] \[NP Boutet NP\] , a d4cid4 VC\] \[VC de publier VC\] \[NP la profession NP\] \[PP de foi PP\] ./SENT 5 Then constraints are applied to eliminate some of the potential subject candidates. The constraints are mainly syntactic: they are about subject uniqueness (unless there is a coordination), the necessary sharing of the subject function among coordinated NPs, etc. The remaining candidates are then considered as real subjects. The other syntactic functions, such as object, PP-Obj, verb modifier, etc. are tagged using similar steps.</Paragraph>
  </Section>
  <Section position="9" start_page="77" end_page="77" type="metho">
    <SectionTitle>
6 Expanding Verb Segments
</SectionTitle>
    <Paragraph position="0"> Because primary segmentation is cautious, verb segments end right after a verb in order to avoid arbitrary attachment of argument or adjunct segments (NPs, PPs and APs on the right of a verb). However, experiments have shown that in some kinds of texts, mainly in technical manuals written in a &amp;quot;controlled language&amp;quot;, it is worth applying the &amp;quot;nearest attachment&amp;quot; principle. We expand VCs to include segments and to consider them as arguments or adjuncts of the VC head. This reduces structural ambiguity in the parser output with a very small error rate. For instance, expanding VCs in the sentence given in the previous section leads to the following structure: \[VC \[NP le prfsident NP\]/SUBJ \[PP du CSA PP\], \[NP Jacques NP\] \[NP Boutet NP\] , a d~cid~ \[VC de publier \[NP la profession NP\] \[PP de foi PP\]</Paragraph>
  </Section>
  <Section position="10" start_page="77" end_page="77" type="metho">
    <SectionTitle>
VC\] VC\] ./SENT
</SectionTitle>
    <Paragraph position="0"> Nevertheless, as this principle leads to a significant number of incorrect attachments in the case of more free-style texts, the VC expansion network is optionally applied depending on the input text.</Paragraph>
  </Section>
  <Section position="11" start_page="77" end_page="77" type="metho">
    <SectionTitle>
7 Performance
</SectionTitle>
    <Paragraph position="0"> As mentioned above, the parser is implemented as a sequence of finite state networks. The total size of the 14 networks we currently use is about 500 KBytes of disk space. The speed of analysis is around 150 words per second on a SPAP~Cstation 10 machine running in a development environment that we expect to optimize in the future. As for linguistic performance, we conducted a preliminary evaluation of subject recognition over a technical manual text (2320 words, 157 sentences) and newspaper articles from Le Monde (5872 words, 249 sentences). The precision and recall rates were respectively 99.2% and 97.8% in the first case, 92.6% and 82.6% in the case of the newspaper articles. This difference in performance is due to the fact that, on the one hand, we used the technical manual text to develop the parser and on the other hand, it shows much less rich syntactic structures than the newspaper text.</Paragraph>
    <Paragraph position="1"> We are currently conducting wider experiments to evaluate the linguistic accuracy of the parser.</Paragraph>
  </Section>
  <Section position="12" start_page="77" end_page="78" type="metho">
    <SectionTitle>
8 Parsing Samples
</SectionTitle>
    <Paragraph position="0"> Below are some parsing samples, where the output is slightly simplified to make it more readable. In particular, morphosyntactic tags are hidden and only the major functions and the segment boundaries appear. null A l'interpr~tation des sentiments pr~sidentiels s'ajoute l'atmosphdre de surench~re politique qui prdc~de tout congr~s du Patti socialiste.</Paragraph>
    <Paragraph position="1">  A l'heure, vendredi soir, off les troupes sovidtiques s'appr~taient ~ pdndtrer dans Bakou, la minuscule Rdpublique autonome du Nakhitchevan, territoire az6vi enclav~ en Armdnie d la fronti~re de l'Iran, proclamait unilatdralement son ind@endance, par ddcision de son propre Soviet supreme.</Paragraph>
  </Section>
  <Section position="13" start_page="78" end_page="78" type="metho">
    <SectionTitle>
lAP supr@me AP\]/&lt;NM ./SENT
9 Conclusion
</SectionTitle>
    <Paragraph position="0"> The incremental finite-state parser presented here merges both constructive and reductionist approaches. As a whole, the parser is constructive: it makes incremental decisions throughout the parsing process. However, at each step, linguistic contraints may eliminate or correct some of the previously added information. Therefore, the analysis is non-monotonic and handles uncertainty.</Paragraph>
    <Paragraph position="1"> The linguistic modularity of the system makes it tractable and easy to adapt for specific texts (e.g.</Paragraph>
    <Paragraph position="2"> technical manuals or newspaper texts). This is done by adding specialized modules into the parsing sequence. This way, the core grammar is clearly separated from optional linguistic descriptions and heuristics.</Paragraph>
    <Paragraph position="3"> Ongoing work includes expansion of the French grammar, a wider evaluation, and grammar development for new languages. We will also experiment with our primary target applications, information retrieval and translation assistance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML