A Robust and Efficient Parser for Non-Canonical Inputs 
 
 
Philippe Blache 
CNRS & Université de Provence 
29, Avenue Robert Schuman 
13621 Aix-en-Provence, France 
pb@lpl.univ-aix.fr
 
  
 
Abstract 
We present in this paper a parser relying 
on a constraint-based formalism called 
Property Grammar. We show how con-
straints constitute an efficient solution in 
parsing non canonical material such as 
spoken language transcription or e-mails. 
This technique, provided that it is imple-
mented with some control mechanisms, 
is very efficient. Some results are pre-
sented, from the French parsing evalua-
tion campaign EASy. 
1 Introduction 
Parsing spoken languages and non canonical in-
puts remains a challenge for NLP systems. Many 
different solutions have been experimented, de-
pending on the kind of material to be parsed or 
the kind of application: in some cases, superficial 
information such as bracketing is enough 
whereas in other situations, the system needs 
more details. The question of robustness, and 
more generally the parsing strategy, is addressed 
differently according to these parameters. Classi-
cally, three families of solutions are proposed: 
 
- Reducing the complexity of the output 
- Controlling the parsing strategy 
- Training and adapting the system to the 
type of input 
 
In the first case, the idea consists in building 
structures with little information, even under-
specified (which means the possibility of build-
ing partial structures). We find in this family the 
different shallow parsing techniques (see for ex-
ample [Hindle83], [Abney96]). Unsurprisingly, 
the use of statistical methods is very frequent and 
efficient in this kind of application (see [Tjong 
Kim Sang00] for some results of a comparison 
between different shallow parsers). Generally, 
such parsers (being them symbolic or not) are 
deterministic and build non recursive units. In 
some cases, they can also determine relations 
between units.  
 
The second family contains many different tech-
niques. The goal is to control a given parsing 
strategy by means of different mechanisms. 
Among them, we can underline three proposals: 
 
- Implementing recovering mechanisms, 
triggering specific treatments in case of 
error (cf. [Boulier05]) 
- Controlling the parsing process by 
means of probabilistic information (cf. 
[Johnson98]) 
- Controlling deep parsers by means of 
shallow parsing techniques (cf. [Crys-
mann02], [UszKoreit02], [Marimon02]) 
 
The last kind of control mechanism consists in 
adapting the system to the material to be parsed. 
This can be done in different ways: 
 
- Adding specific information in order to 
reduce the search space of the parsing 
process. This kind of information can 
appear under the form of ad hoc rules or 
information depending on the kind of 
data to be treated. 
- Adapting the resources (lexicon, gram-
mars) to the linguistic material 
 
These different strategies offer several advan-
tages and some of them can be used together. 
Their interest is that the related questions of ro-
bustness and efficiency are both taken into ac-
count. However, they do not constitute a generic 
19
solution in the sense that something has to be 
modified either in the goal, in the formalism or in 
the process. In other words, they constitute an 
additional mechanism to be plugged into a given 
framework. 
 
We propose in this paper a parsing technique 
relying on a constraint-based framework being 
both efficient and robust without need to modify 
the underlying formalism or the process. The 
notion of constraints is used in many different 
ways in NLP systems. They can be a very basic 
filtering process as proposed by Constraint 
Grammars (see [Karlsson90]) or can be part to 
an actual theory as with HPSG (see [Sag03]), the 
Optimality Theory (see [Prince03]) or Constraint 
Dependency Grammars (cf. [Maruyama90]). Our 
approach is very different: all information is rep-
resented by means of constraints; they do not 
stipulate requirements on the syntactic structure 
(as in the above cited approaches) but represent 
directly syntactic knowledge. In this approach, 
robustness is intrinsic to the formalism in the 
sense that what is built is not a structure of the 
input (for example under the form of a tree) but a 
description of its properties. The parsing mecha-
nism can then be seen as a satisfaction process 
instead of a derivational one. Moreover, it be-
comes possible, whatever the form of the input, 
to give its characterization. The technique relies 
on constraint relaxation and is controlled by 
means of a simple left-corner strategy. One of its 
interests is that, on top of its efficiency, the same 
resources and the same parsing technique is used 
whatever the input.  
 
After a presentation of the formalism and the 
parsing scheme, we describe an evaluation of the 
system for the treatment of spoken language. 
This evaluation has been done for French during 
the evaluation campaign Easy.  
 
2 Property Grammars: a constraint-
based formalism 
 
We present in this section the formalism of Prop-
erty Grammars (see [Bès99] for preliminary 
ideas, and [Blache00], [Blache05] for a presenta-
tion). The main characteristics of Property 
Grammars (noted hereafter PG), is that all infor-
mation is represented by means of constraints. 
Moreover, grammaticality does not constitute the 
core question but become a side effect of a more 
general notion called characterization: an input is 
not associated to a syntactic structure, but de-
scribed with its syntactic properties.  
 
PG makes it possible to represent syntactic in-
formation in a decentralized way and at different 
levels. Instead of using sub-trees as with classical 
generative approaches, PG specifies directly con-
straints on features, categories or set of catego-
ries, independently of the structure to which they 
are supposed to belong. This characteristic is 
fundamental in dealing with partial, underspeci-
fied or non canonical data. It is then possible to 
stipulate relations between two objects, inde-
pendently from their position in the input or into 
a structure. The description of the syntactic prop-
erties of an input can then be done very pre-
cisely, including the case of non canonical or non 
grammatical input. We give in the remaining of 
the section a brief overview of GP characteristics 
 
All syntactic information is represented in PG by 
means of constraints (also called properties). 
They stipulate different kinds of relation between 
categories such as linear precedence, imperative 
co-occurrence, dependency, repetition, etc. There 
is a limited number of types of properties. In the 
technique described here, we use the following 
ones: 
 
- Linear precedence: Det < N (a determiner 
precedes the noun) 
- Dependency: AP → N (an adjectival phrase 
depends on the noun) 
- Requirement: V[inf] ⇒ to (an infinitive 
comes with to) 
- Exclusion: seems ≠ ThatClause[subj] (the 
verb seems cannot have That clause subjects) 
- Uniqueness : UniqNP{Det}(the determiner is 
unique in a NP) 
- Obligation : ObligNP{N, Pro}(a pronoun or a 
noun is mandatory in a NP) 
 
This list can be completed according to the needs 
or the language to be parsed. In this formalism, a 
category, whatever its level is described with a 
set of properties, all of them being at the same 
level and none having to be verified before an-
other.  
 
Parsing a sentence in PG consists in verifying for 
each category the set of corresponding properties 
in the grammar. More precisely, the idea consists 
in verifying for each constituent subset its rele-
vant constraints (i.e. the one applying to the ele-
20
ments of the subset). Some of these properties 
can be satisfied, some other can be violated. The 
result of this evaluation, for a category, is a set of 
properties together with their evaluation. We call 
such set the characterization of the category. 
Such an approach makes it possible to describe 
any kind of input. 
 
Such flexibility has however a cost: parsing in 
PG is exponential (cf. [VanRullen05]). This 
complexity comes from several sources. First, 
this approach offers the possibility to consider all 
categories, independently from its corresponding 
position in the input, as possible constituent for 
another category. This makes it possible for ex-
ample to take into account long distance or non 
projective dependencies between two units. 
Moreover, parsing non canonical utterances re-
lies on the possibility of building characteriza-
tions with satisfied and violated constraints. In 
terms of implementation, a property being a con-
straint, this means the necessity to propose a 
constraint relaxation technique. Constraint re-
laxation and discontinuity are the main complex-
ity factors of the PG parsing problem. The tech-
nique describe in the next section propose to con-
trol these aspects.  
 
3 Parsing in PG 
 
Before a description of the controlled parsing 
technique proposed here, we first present the 
general parsing schemata in PG. The process 
consists in building the list of all possible sets of 
categories that are potentially constituents of a 
syntactic unit (also called constructions). A char-
acterization is built for each of this set. Insofar as 
constructions can be discontinuous, it is neces-
sary to build all possible combinations of catego-
ries, in other words, the subsets set of the catego-
ries corresponding to the input to be parsed, 
starting from the lexical categories. We call as-
signment such a subset. All assignments have 
then, theoretically, to be evaluated with respect 
to the grammar. This means, for each assign-
ment, traversing the constraint system and evalu-
ating all relevant constraints (i.e. constraints in-
volving categories belonging to the assignment). 
For some assignments, no property is relevant 
and the corresponding characterization is the 
empty set: we say in this case that the assignment 
in non productive. In other cases, the characteri-
zation is formed with all the evaluated properties, 
whatever their status (satisfied or not). At the 
first stage, all constructions contain only lexical 
categories, as in the following example: 
 
Construction Assignment Characterization 
AP  {Adv, Adj} {Adv < Adj; Adv → Adj; 
...} 
NP {Det, N} {Det < N; Det → N; N ≠ 
Pro; ...} 
  
An assignment with a productive characteriza-
tion entails the instantiation of the construction 
as a new category; added to the set of categories. 
In the previous examples, AP and NP are then 
added to the initial set of lexical categories. A 
new set of assignments is then built, including 
these new categories as possible constituents, 
making it possible to identify new constructions. 
This general mechanism can be summarized as 
follows: 
 
Initialization
∀ word at a position i:
create the set ci of its possible 
categories
 K ←  {ci | 1<i<number of words}
 S ←  set of subsets of K 
Repeat
∀ Si ∈ S
     if Si  is a productive assignment 
    add ki the characterization 
   label to K 
 S ←  set of subsets of K 
Until new characterization are built 
 
This parsing process underlines the complexity 
coming from the number of assignments to be 
taken into account: this set has to be rebuilt at 
each step (i.e. when a new construction is 
added).  
 
As explained above, each assignment has to be 
evaluated. This process comes to build a charac-
terization formed by the set of its relevant prop-
erties. A property p is relevant for an assignment 
A when A contains categories involved in the 
evaluation of p. In the case of unary properties 
constraining a category c, the relevance is di-
rectly known. In the case of n-ary properties, the 
situation is different for positive or negative 
properties. The former (e.g. cooccurrence con-
straints) concern two realized categories. In this 
case, c1 and c2 being these categories, we have 
{c1, c2} ⊂ A.  In the case of negative properties 
(e.g. cooccurrence restriction), we need to have 
either c1 ∉A or c2 ∉A.  
 
When a property is relevant for a given A, its 
satisfiability is evaluated, according to the prop-
21
erty semantics, each property being associated to 
a solver. The general process is described as fol-
lows:  
 
Let G the set of properties in the gram-
mar, let A an assignment 
∀ pi ∈ G, if pi is relevant 
 Evaluate the satisfiability of pi
  for A
 Add pi and its evaluation to the
  characterization C of A
Check whether C is productive 
 
In this process, for all assignments, all properties 
have to be checked to verify their relevance and 
eventually their satisfiability.  
 
The last aspect of this general process concerns 
the evaluation of the productivity of the charac-
terization or an assignment. A productive as-
signment makes it possible to instantiate the cor-
responding category and to consider it as real-
ized. A characterization is obviously productive 
when all properties are satisfied. But it is also 
possible to consider an assignment as productive 
when it contains violated properties. It is then 
possible to build categories, or more generally 
constructions, even for non canonical forms. In 
this case, the characterization is not entirely posi-
tive. This process has to be controlled. The basic 
control consists in deciding a threshold of vio-
lated constraints. It is also possible to be more 
precise and propose a hierarchization of the con-
straint system: some types of constraints or some 
constraints can play a more important role than 
others (cf. [Blache05b]).  
 
A controlled version of this parsing schema, im-
plemented in the experimentation described in 
the next section, takes advantage of the general 
framework, in particular in terms of robustness 
implemented as constraint relaxation. The proc-
ess is however controlled for the construction of 
the assignment. 
 
This control process relies on a left-corner strat-
egy, adapted to the PG parsing schema. This 
strategy consists in identifying whether a cate-
gory can start a new phrase. It makes it possible 
to drastically reduce the number of assignments 
and then control ambiguity. Moreover, the left 
corner suggests a construction label. The set of 
properties taken into consideration when build-
ing the characterization is then reduced to the set 
of properties corresponding to the label. These 
two controls, plus a disambiguation of the lexical 
level by means of an adapted POS tagger, render 
the parsing process very efficient.  
 
The left corner process relies on a precedence 
table, calculated for each category according to 
the precedence properties in the grammar. This 
table is built automatically in verifying for each 
category whether, according to a given construc-
tion, it can precede all the other categories. The 
process consists in verifying that the category is 
not a left member of a precedence property of the 
construction. If so, the category is said to be a 
possible left corner of the construction. The 
precedence table contains then for each category 
the label of the construction for which it can be 
left corner. 
 
During the process, when a category is a poten-
tial left corner of a construction C, we verify that 
the C is not the last construction opened by a left 
corner. If so, a new left corner is identified, and 
C is added to the set of possible constituents (us-
able by other assignments). Moreover, the char-
acterization of the assignment beginning with ci 
is built in verifying the subset of properties de-
scribing C.  
 
The generation of the assignments can also be 
controlled by means of a co-constituency table. 
This table consists for each category, in indicat-
ing all the categories with which it belongs to a 
positive property. This table is easily built with a 
simple traversal of the constraint system. Adding 
a new category ci  to an assignment A is possible 
only when ci appears as a co-constituent of a 
category belonging to A. 
 
S initial set of lexical categories 
Identification all the left corners 
For all C, construction opened by a left
  corner ci with G’ the set of
  properties describing C 
 Build assignments beginning by ci
 Build characterizations verifying G’ 
  
The parsing mechanism described here takes ad-
vantage of the robustness of PG. All kind of in-
put, whatever its form, can be parsed because if 
the possibility of relaxing constraints. Moreover, 
the control technique makes it possible to reduce 
the complexity of the process without modifying 
its philosophy. 
 
4 Evaluation  
 
22
We experimented this approach during the 
French evaluation campaign EASy (cf. 
[Paroubek05]). The test consisted in parsing sev-
eral files containing various kinds of material: 
literature, newspaper, technical texts, questions, 
e-mails and spoken language. The total size of 
this corpus is one million words. Part of this cor-
pus was annotated with morpho-syntactic (POS 
tags) and syntactic annotations. The last one pro-
vides bracketing as well as syntactic relations 
between units. The annotated part of the corpus 
represents 60,000 words and constitutes the gold 
standard.  
 
The campaign consisted for the participants to 
parse the entire corpus (without knowing what 
part of the corpus constituted the reference). The 
results of the campaign are not yet available con-
cerning the evaluation of the relations. The fig-
ures presented in this section concern constituent 
bracketing. The task consisted in identifying 
minimal non recursive constituents described by 
annotation guidelines given to the participants. 
The different categories to be built are: GA (ad-
jective group: adjective or passed participle), GN 
(nominal group: determiner, noun adjective and 
its modifiers), GP (prepositional group), GR (ad-
verb), NV (verbal nucleus: verb, clitics) and PV 
(verbal propositional group). 
 
Our system parses the entire corpus (1 million 
words) in 4 minutes on a PC. It presents then a 
very good efficiency.  
 
We have grouped the different corpora into three 
different categories: written texts (including 
newspapers, technical texts and literature), spo-
ken language (orthographic transcription of 
spontaneous speech) and e-mails. The results are 
the following: 
 
  Precision Recall F-mesure
Written texts 77.78 82.96 79.84 
Spoken lan-
guage 75.13 78.89 76.37 
E-Mails 71.86 79.06 74.42 
 
These figures show then very stable results in 
precision and recall, with only little loss of effi-
ciency for non-canonical material. When study-
ing more closely the results, some elements of 
explanation can be given. The e-mail corpus is to 
be analyzed separately: many POS tagging er-
rors, due to the specificity of this kind of input 
explain the difference. Our POS-tagger was not 
tuned for this kind of lexical material.  
 
The interpretation of the difference between writ-
ten and oral corpora can have some linguistic 
basis. The following figures give quantitative 
indications on the categories built by the parser. 
The first remark is that the repartition between 
the different categories is the same. The only 
main difference concerns the higher number of 
nucleus VP in the case of written texts. This 
seems to support the classical idea that spoken 
language seems to use more nominal construc-
tions than the written one.  
 
Constituents for Written Corpora
7847
18693
10706
4605
17599
1012
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
GA GN GP GR NV PV
Constituants
 
 
Constituents for Oral Corpora
5460
16726
9608
4992
13020
1001
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
GA GN GP GR NV PV
Constituants
 
 
The problem is that our parser encounters some 
difficulties in the identification of the NP bor-
ders. It very often also includes some material 
belonging in the grammar given during the cam-
paign to AP or VP. The higher proportion of NPs 
in spoken corpora is an element of explanation 
for the difference in the results. 
 
23
5 Conclusion 
The first results obtained during the evaluation 
campaign described in this paper are very inter-
esting. They illustrate the relevance of using 
symbolic approaches for parsing non-canonical 
material. The technique described here makes it 
possible to use the same method and the same 
resources whatever the kind of input and offers 
the possibility to do chunking as well as deep 
analysis. Moreover, such techniques, provided 
that they are implemented with some control 
mechanisms, can be very efficient: our parser 
treat more than 4,000 words per second. It con-
stitutes then an efficient tool capable of dealing 
with large amount of data. On top of this effi-
ciency, the parser has good results in terms of 
bracketing, whatever the kind of material parsed. 
This second characteristics also shows that the 
system can be used in real life applications. 
 
In terms of theoretical results, such experimenta-
tion shows the interest of using constraints. First, 
they makes it possible to represent very fine-
level information and offers a variety of control 
mechanisms, relying for example on the possibil-
ity of weighting them. Moreover, constraint re-
laxation techniques offer the possibility of build-
ing categories violating part of syntactic descrip-
tion of the grammar. They are then particularly 
well adapted to the treatment of non canonical 
texts. The formalism of Property Grammars be-
ing a fully constraint-based approach, it consti-
tutes an efficient solution for the description of 
any kind of inputs. 

References
[Abney 96] Abney S. (1996) “Partial Parsing via Fi-
nite-State Calculus”, in proceedings of ESSLLI'96 
Robust Parsing Workshop 
[Bès99] Bès G. (1999) “La phrase verbale noyau en 
français”, in Recherches sur le français parlé, 15, 
Université de Provence. 
[Blache00] Blache P. (2000) “Constraints, Linguistic 
Theories and Natural Language Processing”, in 
Natural Language Processing, D. Christodoulakis 
(ed), LNAI 1835, Springer-Verlag 
[Blache05a] Blache P. (2005) “Property Grammars: A 
Fully Constraint-Based Theory”, in Constraint 
Solving and Language Processing, H. Christiansen 
& al. (eds), LNAI 3438, Springer 
[Boullier 05] Boullier P. & B. Sagot (2005) “Efficient 
and robust LFG parsing: SxLfg”, in Proceedings of 
IWPT '05. 
[Crysmann02] Crysmann B. A. Frank, B. Kiefer, S. 
Müller, G. Neumann, J. Piskorski, U. Schäfer, M. 
Siegel, H. Uszkoreit, F. Xu, M. Becker & H. 
Krieger (2002) “An Integrated Architecture for 
Shallow and Deep Processing”, in proceedings of 
ACL-02. 
[Frank03] Frank A., M. Becker, B. Crysmann, B. 
Kiefer & U. Schäfer (2003) “Integrated Shallow 
and Deep Parsing: TopP meets HPSG”, in proceed-
ings of ACL-03. 
 [Hindle83] Hindle D. (1983) User manual for Fid-
ditch, a deterministic parser, Technical memoran-
dum 7590-142, Naval Research Laboratory.  
[Johnson98] Johnson M. (1998) “PCFG Models of 
Linguistic Tree Representations'”, in Computa-
tional Linguistics, 24:4. 
[Karlsson90] Karlsson F. (1990) “Constraint grammar 
as a framework for parsing running texts”, in pro-
ceedings of ACL-90. 
[Marimon02] Marimon M. (2002) “Integrating Shal-
low Linguistic Processing into a Unification-Based 
Spanish Grammar”, in proceedings of COLING-02. 
[Maruyama90] Maruyama H. (1990) “Structural Dis-
ambiguation with Constraint Propagation'”, in pro-
ceedings of ACL'90. 
 [Paroubek05] Paroubek P., L. Pouillot, I. Robba & A. 
Vilnat (2005) “EASy : campagne d’évaluation  des 
analyseurs syntaxiques”, in proceedings of the 
workshop EASy, TALN-2005. 
[Prince93] Prince A. & Smolensky P. (1993) “Opti-
mality Theory: Constraint Interaction in Generative 
Grammars”, Technical Report RUCCS TR-2, Rut-
gers Center for Cognitive Science. 
[Tjong Kim Sang00] Tjong Kim Sang E. & S 
Buchholz (2000) “Introduction do the CoNLL-
2000 Shared Task: Chunking”, in proceedings of 
CoNLL-2000. 
[Uszkoreit02] Uszkoreit H. (2002) “New Chances for 
Deep Linguistic Processing”, in proceedings of 
COLING-02. 
[VanRullen05] Van Rullen T. (2005), Vers une ana-
lyse syntaxique à granularité variable, PhD Thesis, 
Université de Provence. 
