File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1043_metho.xml
Size: 26,097 bytes
Last Modified: 2025-10-06 14:07:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1043"> <Title>A language-independent shallow-parser Compiler</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Presentation of the compiler </SectionTitle> <Paragraph position="0"> Our tool has been developed using JavaCC (a compiler compiler similar to Lex & Yacc, but for java). The program takes as input a file containing rules. These rules aim at identifying constituent boundaries for a given language.</Paragraph> <Paragraph position="1"> (For example for English, one such rule could say &quot;When encountering a preposition, start a PP&quot;), either by relying on function words, or on morphological information (e.g. gender) if it is appropriate for the language which is being considered.</Paragraph> <Paragraph position="2"> These rule files specify : * A mapping between the &quot;abstract&quot; morpho-syntactic tags, used in the rules, and &quot;real&quot; morpho-syntactic tags as they will appear in the input.</Paragraph> <Paragraph position="3"> * A declaration of the syntactic constituents which will be detected (e.g. NP, VP, PP ...) * A set of unordered rules From this rule file, the compiler generates a java program, which is a shallow-parser based on the rule file. One can then run this shallow-parser on an input to obtain a shallow-parsed text3.</Paragraph> <Paragraph position="4"> The compiler itself is quite simple, but we have decided to compile the rules rather than interpret them essentially for efficiency reasons. Also, it 3 The input is generally POS-tagged, although this is not an intrinsic requirement of the compiler.</Paragraph> <Paragraph position="5"> is language independent since a rule file may be written for any given language, and compiled into a shallow-parser for that language.</Paragraph> <Paragraph position="6"> Each rule is of the form: {Preamble} disjunction of patterns then actions</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 A concrete example : compiling a </SectionTitle> <Paragraph position="0"> simple NP-chunker for English In this section we present a very simple &quot;toy&quot; example which aims at identifying some NPs in the Penn Treebank4 (Marcus & al 93).</Paragraph> <Paragraph position="1"> In order to do so, we write a rule file, shown on figure 1. The top of the file declares a mapping between the abstract tagset we use in our rules, and the tagset of the PennTreebank. For example commonN corresponds to the 3 tags NN, NNS, NNPS in the PennTreebank. It then declares the labels of the constituents which will be detected (here there is only one: NP). Finally, it declares 3 rules.</Paragraph> <Paragraph position="2"> %% A small NP-chunker for the Penn-treebank</Paragraph> <Paragraph position="4"> Rule 1 says that when a determiner, a quantity adverb or a demonstrative pronoun is encountered, the current constituent must be closed, and an NP must be opened. Rule 2 says that, when not inside an NP, if a common noun, an adjective or a proper noun is encountered, then the current constituent should be closed and an NP should be opened. Finally, Rule 3 says that when some other tag is encountered (i.e. a verb, a preposition, a punctuation, a conjunction 4 This example is kept very simple for sake of clarity. It does not aim at yielding a very accurate result.</Paragraph> <Paragraph position="5"> or an adverb) then the current constituent should be closed.</Paragraph> <Paragraph position="6"> This rule file is then compiled into an NPchunker. If one inputs (a) to the NP-chunker, it So contrary to standard finite-state techniques, only constituent boundaries are explicited, and it is not necessary (or even possible) to specify all the possible ways a constituent may be realized .</Paragraph> <Paragraph position="7"> As shown in section 3, this reduces greatly the number of rules in the system (from several dozens to less than 60 for a wide-coverage shallow-parser). Also, focussing only on constituent boundaries ensures determinism : there is no need for determinizing nor minimizing the automata we obtain from our rules.</Paragraph> <Paragraph position="8"> Our tool is robust : it never fails to provide an output and can be used to create a parser for any text from any domain in any language.</Paragraph> <Paragraph position="9"> It is also important to note that the parsing is done incrementally : the input is scanned strictly from left to right, in one single pass. And for each pattern matched, the associated actions are taken (i.e. constituent boundaries are added). Since there is no backtracking, this allows an output in linear time. If several patterns match, the longest one is applied. Hence our rules are declarative and unordered. Although in theory conflicts could appear between 2 patterns of same length (as shown in (c1) and (c2)), this has never happened in practice. Of course the case is nonetheless dealt with in the implementation, and a warning is then issued to the user.</Paragraph> <Paragraph position="11"> As is seen on figure 1, one can write disjunctions of patterns for a given rule.</Paragraph> <Paragraph position="12"> In this very simple example, only non recursive NP-chunks are marked, by choice. But it is not an intrinsic limitation of the tool, since any amount of embedding can be obtained (as shown in section 3 below), through the use of a Stack. From a formal point of view, our tool has the power of a deterministic push-down automaton.</Paragraph> <Paragraph position="13"> When there is a match between the input and the pattern in a rule, the following actions may be taken : * close(): closes the constituent last opened by inserting </X> in the output, were X is the syntactic label at the top of the Stack.</Paragraph> <Paragraph position="14"> * open(X): opens a new constituent by inserting label <X> in the output * closeWhenOpen(X,Y): delays the closing of constituent labeled X, until constituent labeled Y is opened.</Paragraph> <Paragraph position="15"> * closeWhenClose(X,Y): delays the closing of constituent labeled X, until constituent labeled Y is closed.</Paragraph> <Paragraph position="16"> * doNothing(): used to &quot;neutralize&quot; a shorter match.</Paragraph> <Paragraph position="17"> Examples for the actions open() and close() were provided on figure 1. The actions closeWhenOpen(X,Y) and closeWhenClose(X,Y) allow to perform some attachments. For example a rule for English could say : {NP} (:$conjCoord) then close(), open(NPcoord), closeWhenClose(NPcoord,NP); This rule says that when inside an NP, a coordinating conjunction is encountered, a NPcoord should be opened, and should be closed only when the next NP to the right will be closed. This allows, for example, to obtain</Paragraph> <Paragraph position="19"> The first rule says that when a preposition is encountered, a PP should be opened. The second rule says that when a preposition is encountered, if the previous tag was also a preposition, nothing should be done. Since the pattern for 5 This is shown as an example as to how this action can be used, it does not aim at imposing this structure to coordinations, which could be dealt with differently using other rules.</Paragraph> <Paragraph position="20"> rule 2 is longer than the pattern for rule 1, it will apply when the second preposition in a row is encountered, hence &quot;neutralizing&quot; rule 1. This allows to obtain &quot;flatter&quot; structures for PPs, such as the one in (e1). Without this rule, one would obtain the structure in (e2) for the same input.</Paragraph> <Paragraph position="21"> (e1) This costs <PP> up to 1000 $ </PP> (e2) This costs <PP> up <PP> to 1000 $ </PP> </PP> 3. Some &quot;real world&quot; applications In this section, we present some uses which have been made of this Shallow-Parser compiler. First we explain how the tool has been used to develop a 1 million word Treebank for French, along with an evaluation. Then we present an evaluation for English.</Paragraph> <Paragraph position="22"> It is well known that evaluating a Parser is a difficult task, and this is even more true for Shallow-Parsers, because there is no real standard task (some Shallow-parsers have embedded constituents, some encode syntactic functions, some encode constituent information, some others dependencies or even a mixture of the 2) There also isn-t standard evaluation measures for such tools. To perform evaluation, one can compare the output of the parser to a well-established Treebank developed independently (assuming one is available for the language considered), but the result is unfair to the parser because generally in Treebanks all constituents are attached. One can also compare the output of the parser to a piece of text which has been manually annotated just for the purpose of the evaluation. But then it is difficult to ensure an objective measure (esp. if the person developing the parser and the person doing the annotation are the same). Finally, one can automatically extract, from a well-established Treebank, information that is relevant to a given , widely agreed on, non ambiguous task such as identifying bare non-recursive NP-chunks, and compare the output of the parser for that task to the extracted information. But this yields an evaluation that is valid only for this particular task and may not well reflect the overall performance of the parser. In what follows, in order to be as objective as possible, we use these 3 types of evaluation, both for French and for English6, and use standard measures of recall and precision. Please bear in mind though that these metric measures, although very fashionable, have their limits7. Our goal is not to show that our tool is the one which provides the best results when compared to other shallow-parsers, but rather to show that it obtains similar results, although in a much simpler way, with a limited number of rules compared to finite-state techniques and more tolerance to POS errors, and even in the absence of available training data (i.e. cases were probabilistic techniques could not be used). To achieve this goal, we also present samples of parsed outputs we obtain, so that the reader may judge for himself/herself.</Paragraph> <Paragraph position="23"> 3.1. A shallow-parser for French We used our compiler to create a shallow-parser for French. Contrary to English, very few shallow-parsers exist for French, and no Treebank actually exist to train a probabilistic parser (although one is currently being built using our tool c.f. (Abeille & al. 00)).</Paragraph> <Paragraph position="24"> Concerning shallow-parsers, one can mention (Bourigault 92) who aims at isolating NPs representing technical terms, whereas we wish to have information on other constituents as well, and (Ait-Moktar & Chanod 97) whose tool is not publicly available. One can also mention (Vergne 99), who developed a parser for French which also successfully relies on function words to identify constituent boundaries. But contrary to us, his tool does not embed constituents8. And it is also not publicly available.</Paragraph> <Paragraph position="25"> In order to develop a set of rules for French, we had to examine the linguistic characteristics of this language. It turns out that although French has a richer morphology than English (e.g.</Paragraph> <Paragraph position="26"> gender for nouns, marked tense for verbs), most constituents are nonetheless triggered by the occurrence of a function word. Following the linguistic tradition, we consider as function words all words associated to a POS which labels a closed-class i.e. : determiners, prepositions, clitics, auxiliaries, pronouns (relative, demonstrative), conjunctions</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Of course, manual annotation was done by a different </SectionTitle> <Paragraph position="0"> person than the one who developed the rules.</Paragraph> <Paragraph position="1"> 7 For instance in a rule-based system, performance may often be increased by adding more rules.</Paragraph> <Paragraph position="2"> 8 Instead, it identifies chunks and then assigns some syntactic functions to these chunks.</Paragraph> <Paragraph position="3"> (subordinating, coordinating), auxiliaries, punctutation marks and adverbs belonging to a closed class (e.g. negation adverbs &quot;ne&quot; &quot;pas&quot;)9. The presence of function words makes the detection of the beginning of a constituent rather easy. For instance, contrary to English, subordinating conjunctions (que/that) are never omitted when a subordinating clause starts.</Paragraph> <Paragraph position="4"> Similarly, determiners are rarely omitted at the beginning of an NP.</Paragraph> <Paragraph position="5"> Our aim was to develop a shallow-parser which dealt with some embedding, but did not commit to attach potentially ambiguous phrases such as PPs and verb complements. We wanted to identify the following constituents : NP, PP, VN (verbal nucleus), VNinf (infinitivals introduced by a preposition), COORD (for coordination), SUB (sentential complements), REL (relative clauses), SENT (sentence boundaries), INC (for constituents of unknown category), AdvP (adverbial phrases).</Paragraph> <Paragraph position="6"> We wanted NPs to include all adjectives but not other postnominal modifiers (i.e. postposed relative clauses and PPs), in order to obtain a structure similar to (f).</Paragraph> <Paragraph position="7"> (f) <NP> Le beau livre bleu </NP> <PP> de <NP>ma cousine</NP> </PP> ...</Paragraph> <Paragraph position="8"> (my cousin-s beautiful blue book) Relative clauses also proved easy to identify since they begin when a relative pronoun is encountered. The ending of clauses occurs essentially when a punctuation mark or a conjunction of coordination is encountered or when another clause begins, or when a sentence ends (g1) . These rules for closing clauses work fairly well in practice (see evaluation below) but could be further refined, since they will yield a wrong closing boundary for the relative in a 9 Considering punctuation marks as function words may be &quot;extending&quot; the linguistic tradition. Nonetheless, it is a closed class, since there is a small finite number of punctuation marks.</Paragraph> <Paragraph position="9"> Concerning clitics, we have decided to group them with the verb (h1) even when dealing with subject clitics (h2). One motivation is the possible inversion of the subject clitic (h3). Sentences are given a flat structure, that is complements are not included in a verbal phrase10 (i). From a practical point of view this eases our task. From a theoretical point of view, the traditional VP (with complements) is subject to much linguistic debate and is often discontinuous in French as is shown in (j1) and (j2): In (j1) the NP subject (IBM) is postverbal and precedes the locative complement (sur le marche). In (j2), the adverb certainement is also postverbal and precedes the NP object (une augmentation de capital).</Paragraph> <Paragraph position="10"> When we began our task, we had at our disposal a 1 million word POS tagged and hand-corrected corpus (Abeille & Clement 98). The corpus was meant to be syntactically annotated for constituency. To achieve this, precise annotation guidelines for constituency had been written and a portion of the corpus had been hand-annotated (independently of the development of the shallow-parser) to test the guidelines (approx. 25 000 words) .</Paragraph> <Paragraph position="11"> To evaluate the shallow parser, we performed as described at the beginning of section 3 : We parsed the 1 million words. We set aside 500 sentences (approx. 15 000 words) for quickly tuning our rules. We also set aside the 25 000 words that had been independently annotated in order to compare the output of the parser to a portion of the final Treebank. In addition, an annotator hand-corrected the output of the shallow-parser on 1000 new randomly chosen sentences (approx. 30 000 words).</Paragraph> <Paragraph position="12"> Contrary to the 25 000 words which constituted the beginning of the Treebank, for these 30 000 words verb arguments, PPs and modifiers were not attached. Finally, we extracted bare non-recursive NPs from the 25 000 words, in order to evaluate how the parser did on this particular task.</Paragraph> <Paragraph position="13"> When compared to the hand-corrected parser-s output, for opening brackets we obtain a recall of 94.3 % and a precision of 95.2%. For closing brackets, we obtain a precision of 92.2 % and a recall of 91.4 %. Moreover, 95.6 % of the correctly placed brackets are labeled correctly, the remaining 4.4% are not strictly speaking labeled incorrectly, since they are labeled INC (i.e. unknown) These unknown constituents, rather then errors, constitute a mechanism of underspecification (the idea being to assign as little wrong information as possible)11.</Paragraph> <Paragraph position="14"> When compared to the 25 000 words of the Treebank, For opening brackets, the recall is 92.9% and the precision is 94%. For closing brackets, the recall is 62,8% and the precision is 65%. These lower results are normal, since the Treebank contains attachments that the parser is not supposed to make.</Paragraph> <Paragraph position="15"> Finally, on the specific task of identifying non-recursive NP-chunks, we obtain a recall of 96.6 % and a precision of 95.8 %. for opening 11 These underspecified label can be removed at a deeper parsing stage, or one can add a guesser .</Paragraph> <Paragraph position="16"> brackets, and a recall and precision of resp.</Paragraph> <Paragraph position="17"> 94.3% and 92.9 % for closing brackets.</Paragraph> <Paragraph position="18"> To give an idea about the coverage of the parser, sentences are on average 30 words long and comprise 20.6 opening brackets (and thus as many closing brackets). Errors difficult to correct with access to a limited context involve mainly &quot;missing&quot; brackets (e.g. &quot;comptez vous * ne pas le traiter&quot; (do you expect not to treat him) appears as single constituent, while there should be 2) , while &quot;spurious&quot; brackets can often be eliminated by adding more rules (e.g. for multiple prepositions : &quot;de chez&quot;). Most errors for closing brackets are due to clause boundaries(i.e. SUB, COORD and REL).</Paragraph> <Paragraph position="19"> To obtain these results, we had to write only 48 rules.</Paragraph> <Paragraph position="20"> Concerning speed, as argued in (Tapanainen & Jarvinen, 94), we found that rule-based systems are not necessarily slow, since the 1 million words are parsed in 3mn 8 seconds.</Paragraph> <Paragraph position="21"> One can compare this to (Ait-Moktar & Chanod 97), who, in order to shallow-parse French resort to 14 networks and parse 150words /sec (Which amounts to approx. 111 minutes for one million words)12. It is difficult to compare our result to other results, since most Shallow-parsers pursue different tasks, and use different evaluation metrics. However to give an idea, standard techniques typically produce an output for one million words in 20 mn and report a precision and a recall ranging from 70% to 95% depending on the language, kind of text and task. Again, we are not saying that our technique obtains best results, but simply that it is fast and easy to use for unrestricted text for any language. To give a better idea to the reader, we provide an output of the Shallow-parser for French on figure 2.</Paragraph> <Paragraph position="22"> In order to improve our tool and our rules, a demo is available online on the author-s homepage.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 A Shallow-Parser for English </SectionTitle> <Paragraph position="0"> We wanted to evaluate our compiler on more than one language, to make sure that our results were easily replicable. So we wrote a new set of rules for English using the PennTreebank tagset, both for POS and for constituent labels.</Paragraph> <Paragraph position="1"> 12 They report a recall ranging from 82.6% and 92.6% depending on the type of texts, and a precision of 98% for subject recognition, but their results are not directly comparable to ours, since the task is different.</Paragraph> <Paragraph position="2"> We sat aside sections 00 and 01 of the WSJ for evaluation (i.e. approx. 3900 sentences), and used other sections of the WSJ for tuning our rules.</Paragraph> <Paragraph position="3"> Contrary to the French Treebank, the Penn Treebank contains non-surfastic constructions such as empty nodes, and constituents that are not triggered by a lexical items.</Paragraph> <Paragraph position="4"> Therefore, before evaluating our new shallowparser, we automatically removed from the test sentences all opening brackets that were not immediately followed by a lexical item, with their corresponding closing brackets, as well as all the constituents which contained an empty element. We also removed all information on pseudo-attachment. We then evaluated the output of the shallow-parser to the test sentences. For bare NPs, we compared our output to the POS tagged version of the test sentences (since bare-NPs are marked there).</Paragraph> <Paragraph position="5"> For the shallow-parsing task, we obtain a precision of 90.8% and a recall of 91% for opening brackets, a precision of 65.7% and recall of 66.1% for closing brackets. For the NP-chunking task, we obtain a precision of 91% and recall of 93.2%, using an &quot;exact match&quot; measure (i.e. both the opening and closing boundaries of an NP must match to be counted as correct).</Paragraph> <Paragraph position="6"> The results, were as satisfactory as for French. Concerning linguistic choices when writing the rules, we didn-t really make any, and simply followed closely those of the Penn Treebank syntactic annotation guidelines (modulo the embeddings, the empty categories and pseudoattachments mentioned above).</Paragraph> <Paragraph position="7"> Concerning the number of rules, we used 54 of them in order to detect all constituents, and 27 rules for NP-chunks identification. . In sections 00 and 01 of the wsj there were 24553 NPs, realized as 1200 different POS patterns (ex : CD NN, DT $ JJ NN, DT NN...). Even though these 1200 patterns corresponded to a lower number of regular expressions, a standard finite-state approach would have to resort to more than 27 rules. One can also compare this result to the one reported in (Ramshaw & Marcus 95) who, obtain up to 93.5% recall and 93.1% precision on the same task, but using between 500 and 2000 rules.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Tolerance to POS errors </SectionTitle> <Paragraph position="0"> To test the tolerance to POS tagging errors, we have extracted the raw text from the English corpus from section 3.2., and retagged it using the publicly available tagger TreeTagger (Schmid, 94). without retraining it. The authors of the tagger advertise an error-rate between 3 and 4%. We then ran the NP-chunker on the output of the tagger, and still obtain a precision of 90.2% and a recall of 92% on the &quot;exact match&quot; NP identification task: the fact that our tool does not rely on regular expressions describing &quot;full constituent patterns&quot; allows to ignore some POS errors since mistagged words which do not appear at constituent boundaries (i.e. essentially lexical words) have no influence on the output. This improves accuracy and robustness. For example, if &quot;first&quot; has been mistagged noun instead of adjective in [NP the first man ] on the moon ..., it won-t prevent detecting the NP, as long as the determiner has been tagged correctly.</Paragraph> <Paragraph position="1"> Conclusion We have presented a tool which allows to generate a shallow-parser for unrestricted text in any language. This tool is based on the use of a imited number of rules which aim at identifying constituent boundaries. We then presented evaluations on French and on English, and concluded that our tools obtains results similar to other shallow-parsing techniques, but in a much simpler and economical way.</Paragraph> <Paragraph position="2"> We are interested in developing new sets of rules for new languages (e.g. Portuguese and German) and new style (e.g. French oral texts).</Paragraph> <Paragraph position="3"> It would also be interesting to test the tool on inflectional languages.</Paragraph> <Paragraph position="4"> The shallow-parser for French is also being used in the SynSem project which aims at syntactically and semantically annotating several millions words of French texts distributed by ELRA13. Future improvements of the tool will consist in adding a module to annotate syntactic functions, and complete valency information for verbs, with the help of a lexicon (Kinyon, 00).</Paragraph> <Paragraph position="5"> Finally, from a theoretical point of view, it may be interesting to see if our rules could be acquired automatically from raw text (although this might not be worth it in practice, considering the small number of rules we use, and the fact that acquiring the rules in such a way would most likely introduce errors).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 13 European Language Ressources Association Acknowledgement We especially thank F. </SectionTitle> <Paragraph position="0"> Toussenel, who has performed most of the evaluation for French presented in section 3.1.1.</Paragraph> </Section> class="xml-element"></Paper>