Subject and Object Dependency Extraction Using Finite-State Transducers 
Salah Ait-Mokhtar, Jean-Pierre Chanod 
Rank Xerox Research Centre, 6 Chemin de Maupertuis 
F-38240 Meylan, France 
{Ait,Chanod}@grenoble.rxrc.xerox.com 
Abstract 
We describe and evaluate an approach for fast 
automatic recognition and extraction of subject 
and object dependency relations from large 
French corpora, using a sequence of finite-state 
transducers. The extraction is performed in two 
major steps: incremental finite-state parsing and 
extraction of subject/verb and object/verb rela- 
tions. Our incremental and cautious approach 
during the first phase allows the system to deal 
successfully with complex phenomena such as 
embeddings, coordination of VPs and NPs or 
non-standard word order. The extraction re- 
quires no subcategorisation information. It relies 
on POS information only. After describing the 
two steps, we give the results of an evaluation 
on various types of unrestricted corpora. Preci- 
sion is around 90-97% for subjects (84-88% for 
objects) and recall around 86-92% for subjects 
(80-90% for objects). We also provide some er- 
ror analysis; in particular, we evaluate the im- 
pact of POS tagging errors on subject/object de- 
pendency extraction. 
1 Introduction 
Dependency extraction from large corpora is mainly 
used in two major directions: automatic acquisition of 
lexical patterns \[Brent, 1991; Grishman and Sterling, 
1992; Briscoe and Carrol, 1994; Sanfilippo, 1994\] ~ or 
for end-user applications such as document indexing or 
information retrieval \[Grefenstette, 1994\]. 
We describe and evaluate an approach for fast auto- 
matic recognition and extraction of subject and object 
dependency relations from large French corpora, using a 
sequence of finite-state transducers. The extraction is 
based on robust shallow parsing. We extract syntactic 
relations without producing complete parse trees in the 
1 See also the SPARKLE European project, 
http.//www.dc.p~ cnr.it/sparkle.html. 
traditional sense. The extraction requires no subcategori- 
sation information. It relies on POS information only. 
The extraction is performed in two major steps: 
1. incremental finite-state parsing annotates the 
input string with syntactic markings; 
2. the annotated string is transduced in order to 
extract subject/verb and object/verb relations. 
Our incremental and cautious approach during the first 
phase allows the system to deal successfully with com- 
plex phenomena such as embeddings, coordination of 
VPs and NPs or non-standard word order. 
We evaluated subject and object dependency extrac- 
tion on various types of unrestricted corpora. Precision 
is around 90-97% for subjects (84-88% for objects) and 
recall around 86-92% for subjects (80-90% for objects). 
The paper also provides some error analysis; in par- 
ticular, we evaluate the impact of POS tagging errors on 
the extraction process. 
2 Incremental Finite-State Parsing 
\[AB-Mekhtar and Chanod, 1997\] fully describes the 
incremental finite-state parser. Unlike previous work in 
this area \[Abney, 1991; Roche, 1993; Appelt et al., 
1993; Koskenniemi et al, 1992; Voutilainen and Ta- 
panainen, 1993; Chanod and Tapanainen, 1996\] the 
parser combines constructivist and reductionist ap- 
proaches, in order to maximise efficiency and robust- 
ness. 
Before the parsing proper, we perform tokenisation, 
iexical lookup and part-of-speech disarnbiguation, using 
the French Xerox tagger \[Chanod and Tapanainen, 
1995\]. The input to the parser is a tagged string repre- 
sented by a sequence of word-form + tag pairs of the 
type: 
le bon vin (the good wine) 
<Ie+DET-SG> 
<bon+ADJ-SG> 
<vin+NOUN-SG> 
71 
The parser output is a shallow parse where phrasal and 
clausal constructs are bracketed and more or less richly 
annotated as in: 
Jean aime le ben vin (lit.: John likes the 
good wine) 
\[VC \[NP Jean NP\]/SUBJ :v airne v: VC\] 
\[NP le \[AP bon AP\] vin NP\]/OBJ 
The parser consists of a sequence of transducers. 
These transducers are compiled from regular expressions 
that mainly involve the longest match replace operator 2 
\[Karttunen, 1996\]. Each of these transducers adds syn- 
tactic information represented by reserved symbols 
(annotations), such as segment names, boundaries and 
tags for syntactic functions. At any given stage of the 
sequence, the input and the output represent the whole 
sentence as an annotated string. The output of a trans- 
ducer is the input to the next transducer. 
The parsing process is incremental in the sense that 
the linguistic description attached to a given transducer 
in the sequence: 
- adds information relying on the preceding sequence 
of transductions; 
- applies only to some instances of a given linguistic 
phenomenon; 
- may be revised at a later stage. 
Each step defines a syntactic construction using two 
major operations: segmentation and syntactic marking. 
Segmentation consists in bracketing and labeling adja- 
cent elements belonging to a same partial construction 
(e.g. a nominal or a verbal phrase, or a more partial 
syntactic chain if necessary). Segmentation includes also 
the identification of clause boundaries. Syntactic mark- 
ing annotates segments with syntactic functions (e.g. 
subject, object, PPObj). 
The two operations of segmentation and syntactic 
marking are performed throughout the cascade in an 
interrelated fashion. Some segmentations depend on 
previous syntactic marking and vice versa. 
If the constraints encoded by a given transducer do 
not hold, the string remains unchanged. This ensures that 
there is always an output string at the end of the cascade, 
with possibly under-specified segments. 
The additional information provided at each stage in 
the sequence is instrumental in the definition of the later 
stages of the cascade. Networks are ordered in such a 
way that the easiest tasks are addressed first. They can 
be performed accurately with less background informa- 
tion. 
2.1 Primary Segmentation 
A segment (or chunk) is a continuous sequence of words 
that are syntactically linked to each other or to a gov- 
erning head (see \[Federici et al., 1996; A~t-Mokhtar and 
Chanod, 1997\] for a more detailed description). 
Segments are marked using regular expressions such as 
the following one for NPs: 
\[TBeginNP ~$\[TEndNP\] TEndNP \] 
@-> 
\[NP ... NP\] 
where the longest-match left-to-right Replace Operator 
(noted @->) inserts \[NP and NP\] boundaries around the 
longest sequence that begins with a potential NP start 
(TBeginNP) and ends with the first NP end (TEndNP) to 
the right. 
In the primary segmentation step, we mark segment 
boundaries within sentences as shown below. Here NP 
stands for Noun Phrase, PP for Preposition Phrase and 
VC for Verb Chunk (a VC contains at least one verb and 
possibly some of its arguments and modifiers). 
Example: 
Nc 
\[VC Lorsqu' \[NP on NP\] tourne VC\] 
\[NP le commutateur NP\] 
\[PP de d6marrage PP\] 
\[PP sur la position PP\] \[AP auxiliaire AP\] , 
\[NP r aiguille NP\] retoume 
vc\] 
alors \[PP ~t z6ro PP\] ./SENT 3 
All the words within a segment, except the head, are 
linked to words in the same segment at the same level. 
The main purpose of marking segments is therefore to 
constrain the linguistic space that determines the syn- 
tactic function of a word. 
As one can see from the example above, segmentation 
is very cautious. Structural ambiguity inherent to modi- 
fiers attachment, verb arguments and coordination is not 
resolved at this stage. 
2.2 Marking Syntactic Functions 
Syntactic functions within nonrecursive segments (AP, 
NP and pp4) are handled first because they are easier to 
mark. The other functions within verb segments and at 
sentence level (subject, object, verb modifier, etc.) are 
considered later. The string marked with segmentation 
and syntactic annotations is the input to the extraction 
component described below. 
2 In brmf, the expression: A @-> B \]1 Left _ Rtght m&cates 
that the longest string that matches the regular expressmn A is 
replaced by B, if it appears in the context defined by Left and 
Right. 
3 In English: Turning the starter switch to the auxdlary posi- 
tron, the pointer will then return to zero. 
4 As a consequence of the cautious segmentation such seg- 
ments are always nonrecursive. 
72 
Extracting dependency relations from annotated 
strings is not straightforward, as the annotations do not 
explicitly relate arguments to their governing heads. For 
instance, shared arguments of coordinated verbs or de- 
pendencies accross embedded structures need to be re- 
solved during the extraction phase, as illustrated in the 
example belowS: 
La ville de Lattes rejette toujours la proposi- 
tion de remonter le niveau du fleuve pour fa- 
ciliter la circulation des bateaux et n'exclut 
pus a priori l'idee d'instaurer un p6age. 
The parser outputs is: 
\[VC \[NP La ville NP\]/SUBJ \[PP de Lattes PP\] 
:v rejette v: VC\] toujours 
\[NP la proposition NP\]/OBJ 
\[VC de remonter VC\] \[NP le niveau NP\]/OBJ 
\[PP du fleuve PP\] \[VC pour faciliter VC\] 
\[NP la circulation NP\]/OBJ 
\[PP des bateaux PP\] 
\[VC et :v n' exclut v: VC\] pasa priori 
\[NP I' id6e NP\]/OBJ \[VC d' instaurer VC\] 
\[NP un p6age NP\]/OBJ. 
As objects are marked outside the verb chunks (VC), 
the extraction phase must still expand the VCs in order 
to identify object dependencies. Moreover, the extrac- 
tion must identify la ville as the shared subject of the 
coordinated verbs rejette and exclut, which are separated 
by two infinitival clauses. After the preliminary parsing 
stage, the two coordinated verbs still belong to two dif- 
ferent VCs. 
3 Subject and Object Recognition and 
Extraction 
The task consists in recognizing the subject and object 
segments and extracting them along with their respective 
verbs. Function tagging is performed during the shallow 
parsing process; then a special transducer recognizes 
subject/verb and object/verb dependencies (i.e. it finds 
out which is the subject or object of which verb) and 
extracts them. 
3.1 Subject and object tagging 
Potential subjects are marked first. An NP is a potential 
subject if and only if it is followed by a finite verb and it 
satisfies some typographical conditions (it should not be 
separated from the verb with only one comma, etc.). This 
prevents the NP Jacques Boutet for instance from 
being marked as a subject in the sentence below: 
5 In Enghsh The city of Lattes still rejects the proposal to 
raise the level of the river to facdttate the boat traffic and 
does not exclude a priori the idea of imposing a toll 
\[VC \[NP le pr6sident NP\]/SUBJ 
\[PP du CSA PP\], \[NP Jacques Boutet NP\], 
a d6cid6 VC\] 
\[VC de publier VC\] 
\[NP la profession NP\]/OBJ 
\[PP de foi PP\]./SENT 
If this type of subject does not exist, we look for in- 
verted subjects under the same typographical constraints. 
Other constraints are applied later to eliminate some 
of the potential subject hypotheses. These constraints are 
mainly syntactic: 
1. A potential subject is not a subject if it has no de- 
terminer, unless it is a proper noun or it is a coordinated 
common noun. 
2. A potential subject is not a subject if it immediately 
follows a PP or an NP (i.e. with no connector in be- 
tween) and is preceded or followed by another potential 
subject. 
3. A potential subject is not a subject if it is followed 
by another potential subject with no coordination. 
The remaining potential subjects are taken to be actual 
subjects. The whole process of tagging and correcting 
subject tags consists of a sequence of replace expres- 
sions. 
Once the subjects are tagged, another transducer per- 
forms object tagging with similar steps and constraints 
but now only non subject NPs are considered. 
3.2 Embeddings and regular expressions 
To make this approach work properly, we must be very 
careful to take into account embedded contexts 
(embedded clauses, text within parentheses, and so on). 
For instance, subject constraint 3 above should not apply 
if the two potential subjects are not at the same level, 
like the subjects \[NP l'usine NP\] and \[NP le ministre 
NP\] in the following sentence: 
\[VC \[NP L' usine NP\]/SUBJ, 
\[VC que \[NP le ministre NP\]/SUBJ :v devrait 
v: VC\] 
\[VC implanter VC\] \[PP ~t Eloyes PP\], 
:v repr6sente v: VC\] \[NP un investissement 
NP\]/OBJ 
\[PP d'environ 148 milliards NP\]/N 
\[PP de francs PP\] ./SENT 6 
This also applies throughout the dependency extrac- 
tion process as described in the next section. 
In order to handle this difficulty properly with the fi- 
nite state calculus, we take advantage of the verb seg- 
menting marks produced by the shallow parser and de- 
fine a maximal embedding level: 
6The factory which the minister should establish at Eloyes 
represents approximately a 148 bdhon francs investment 
73 
level0 = ~$\[BeginVC I EndVC\] 
level1 = \[ level0 I \[BeginVC level0 EndVC\] \]* 
curlev = \[ level1 \[ \[BeginVC levell EndVC\] \]* 
The regular expression levelO matches any string 
which does not contain any BeginVC or EndVC marks, 
i.e, an embedded clause, level1 matches any string that 
contains strings matching levelO or entire embedded 
clauses that contains only levelO matching strings. 
Therefore, curlev (which stands for Current Level) 
matches strings that either do not contain any embedded 
clauses, or may contain entire embedded clauses to a 
maximal depth of 2 embedded levels. 
We can easily extend the definition of curlev to handle 
punctuated embeddings such as texts within parentheses 
or hyphens. We define Bs as the beginning of such em- 
beddings (either an opening parenthesis or a hyphen) 
and Es as the ending (a closing parenthesis or another 
hyphen): 
Bs = \["(" I \] 
Es = \[")" I \] 
and redefine the curler expression as: 
level0 = ~$\[BeginVC \[ EndVC I Bs\[ Es\] 
level1 = \[ level0 I \[BeginVC level0 EndVC\] 
I \[ Bs level0 Es\] \]* 
curlev = \[ levell \] \[BeginVC levell EndVC\] 
l \[ Bs level1 Es\] \]* 
Given these definitions, we can control the linguistic 
space in which rules and constraints apply. For instance, 
Subject constraint 3 can be written this way: 
TSUBJ -> \[\] II - 
\[curlev 
\\[COORD \[ BeginVC \[ EndVC i Bs \[Es\] 
BeginNP ~$EndNP EndNP TSUBJ 
~\[ curlev TSUBJ curlev\] Beginv \] 
which means: remove a potential subject tag (TSUBJ) 
whenever there is another potential subject on the right 
which is not preceded by a coordination. The other po- 
tential subject should be the last one before the finite 
verb. All these considerations are stated over the same 
sentence level (expressed as curlev) so that embeddings 
relative to a given level and what they contain are not 
concerned at all. 
3.3 Extraction of Dependency Relations 
Once the subjects and objects are tagged, we get a shal- 
low parse of the sentence where phrasal units and func- 
tion tags appears, as shown in the example below: 
\[VC \[NP Le pr6sident NP\]/SUBJ \[PP du CSA 
PP\], \[NP Jacques Boutet NP\] , a d6cid6 VC\] 
\[VC de publier VC\] \[NP la profession 
NP\]/OBJ \[PP de foi PP\] ./SENT 
One nontrivial task consists in associating the subjects 
and objects with the right verb. There are two main dif- 
ficulties: with subjects and objects of embedded sen- 
tences, and with coordinated verbs and shared subjects 
and objects. 
To find out which is the subject/object of a given 
verb, only the Nps tagged as subjects/objects that are in 
the same level should be considered. This is necessary in 
order to avoid overgeneration of dependencies, that is, 
extracting the subject of an embedded sentence as the 
subject of the main verb or of another embedded sen- 
tence. The definition of curlev is very useful for such 
task. For instance, the transducer obtained from the fol- 
lowing regular expression7: 
\[ 
\[ 0 .x. \["SUBJ:'\] \] 
\[\[7"I .x. o I 
\[ BeginNP ~$\[EndNP\] EndNP\] \[SUBJ : ".-)" \] 
\[ curlev .x. 0 \] 
\[Beginv .x. BeginVC\] \[~$Endv\] \[Endv .x. 
EndVC\] 
.x. O l \] 
can be applied on the previous sentence to extract nor- 
mal subject/verb dependencies: 
SUBJ: \[NP le president NP\] --> 
\[VC a d6cid6 VC\] 
Other similar expressions are written for the extraction 
of inverted subjects, shared subjects and objects. All the 
extracting expressions are joined using the Union op- 
erator \] and compiled into a single transducer which 
takes as input shallow parses and produces subject/verb 
and object/verb dependencies. 
3.4 Samples 
Below are some samples from the output of the extract- 
ing transducer. 
7 This expression applies to the parsed sentence. It leaves un- 
changed any NP marked as SUBJ and the corresponding VC 
found at the same level. All other segments are erased and 
relational markings (the "SUB J:" tag and the arrow 
"~'°) are inserted. 
74 
Sentence 1: A l'interpr6tation des sentiments pr6sidentiels 
s'ajoute l'atmosph6re de surench6re politique qui pr6c6de 
tout congr6s du Parti socialiste. 
Shallow parse: 
\[VC \[PP A I' interpr6tation PP\] 
\[PP des sentiments PP\] lAP pr6sidentiels AP\] 
:v s' ajoute v: VC\] 
\[NP l'atmosphare NP\]/<SUBJ 
\[PP de surenchare PP\] \[AP politique AP\] 
\[VC \[NP qui NP\]/SUBJ :v pr6cade v: VC\] tout 
\[NP congr6s NP\]/OBJ \[PP du Patti PP\] \[AP 
socialiste AP\] . 
Dependencies: 
SUBJ: \[VC s' ajoute VC\] 
\[NP r atmosph6re NP\] 
SUBJ: \[NP qui NP\] ---> \[VC pr6cade VC\] 
OBJ: \[VC pr6cade VC\] ~-- \[NP congas NP\] 
Sentence 2: En relanfant ainsi d61ib~r6ment ragitation sur 
des enjeux de la gr~ve de septembre 1988, Mr Guilhaume 
met en difficult6 le gouvernement et force les autres PDG 
de raudiovisuel public it choisir leur camp. 
Shallow parse: 
\[VC \[VC En relan~ant VC\] ainsi dalib~r6ment 
\[NP 1' agitation NP\]/OBJ 
\[PP sur des enjeux PP\] \[PP de la grave PP\] 
\[PP de septembre PP\] \[NP 1988 NP\]/N, 
\[NP Mr Guilhaume NP\]/SUBJ :v met v: VC\] 
\[PP en difficult6 PP\] 
\[NP le gouvernement NP\]/OBJ 
\[VC et :v force v: VC\] 
\[NP les \[AP autres AP\] PDG NP\]/OBJ 
\[PP de l'audiovisuel PP\] lAP public AP\] 
\[VC & choisir VC\] \[NP leur camp NP\]/OBJ . 
Dependencies: 
SUB.I: \[NP Mr Guilhaume NP\] ~ \[VC met VC\] 
SUB J: \[NP Mr Guilhaume NP\] ---> \[VC force VC\] 
OBJ: \[VC En relan~ant VC\] ~ \[NP 1' agitation NP\] 
OBJ: \[VC met VC\] ,- \[NP le gouvernement NP\] 
OBJ: \[VC force VC\] 
\[NP les lAP autres AP\] PDG NP\] 
OBJ: \[VC ~t choisir VC\] ~ \[NP leur camp NP\] 
Sentence 3: Mais, d6j/t, l'id6e que ceux-ci puissent avoir la 
responsabilit~ d' un mus6e class6 (6tablissement d6pendant 
juridiquement d'une ville ou d' un d6partement, mais dont 
les responsables sont nomm6s par l'Etat) provoque une 
certaine 6motion chez les nationaux (voir le Monde du 23 
novembre 1989). 
Shallow parse: 
\[VC Mais, d6jtt, \[NP l'id6e NP\]/SUBJ 
\[VC que \[NP ceux-ci NP\]/SUBJ 
:v puissent v: VC\] 
\[VC avoir VC\] \[NP la responsabilit6 NP\]/OBJ 
\[PP d' un mus6e PP\] \[AP class6 AP\] 
( \[NP 6tablissement NP\]/N 
lAP d6pendant AP\] juridiquement 
\[PP d' une viUe PP\] 
ou \[PP d' un d6partement PP\], mais 
\[VC dont \[NP les responsables NP\]/SUBJ 
:v sont nomm6s v: VC\] \[PP par r Etat PP\] ) 
:v provoque v: VC\] 
\[NP une \[AP certaine AP\] 6rnotion NP\]/OBJ 
\[PP chez les nationaux PP\] 
( \[VC voir VC\] \[NP le Monde NP\]/OBJ 
\[PP du 23 novembre PP\] \[NP 1989 NP\]/N ) . 
Dependencies: 
SUBJ: \[NP 1' id6e NP\] ---> \[VC provoque VC\] 
SUB J: \[NP ceux-ci NP\] ~ \[VC puissent VC\] 
SUB J: \[NP les responsables NP\] .-.q, 
\[VC sont nomm~s VC\] 
OBJ: 
OBJ: 
OBJ: 
\[VC avoir VC\] ~ \[NP la responsabilit6 NP\] 
\[VC provoque VC\] 
\[NP une lAP certaine AP\] 6motion NP\] 
\[VC voir VC\] ~-- \[NP le Monde NP\] 
4 Evaluation 
The 15 networks of the parser and extractor require 
about 750 KBytes of disk space. The speed of analysis is 
around 150 words per second on a SPARCstation 10, 
including preprocessing (tokenisation, lexical lookup, 
POS tagging), parsing and dependency extraction s. 
We evaluated the extraction on three different types of 
corpus: newspaper Le Monde (187 sentences, 4560 
words, average 24.3 words/sentence), financial reports 
(524 sentences, 12551 words, average 24 
words/sentence) and technical documentation (294 sen- 
tences, 5300 words, average 18 words/sentence). The 
sentences were selected randomly; they are independent 
from the corpus used for grammar development. 
The evaluation concentrated on recognition of surface 
nominal and pronominal subjects and on nominal direct 
object. Relations are counted as correct only if both the 
governing head and its argument are correct (for in- 
stance, we counted the subject/Verb relation as errone- 
ous if the verb group was not fully correct, even if the 
subject was). 
Results are given in tables 1 and 2. 
8 Future work includes optimismg the implementauon to 
speed up the process. 
75 
Corpus 
Newspa- 
per 
Fmancml 
reports 
Techmc 
Doc 
Actual 
Subj. 
350 
452 
275 
Taggedas Correct 
Su~ect Sub Tag 
324 306 
430 418 
263 238 
Preci- 
sion % 
94 4 
97.2 
90.5 
Recall 
% 
87 4 
92.5 
86 5 
Table 1: SubJect recognmon on three different corpora 
Preci- ' Recall Corpus Tagged as Correct 
Objects Object Obj Tag slon % % 
Newspa- 214 198 171 86 3 79.9 
per 
Financial 135 160 124 84.4 91 8 
reports 
Techmc. 337 311 275 88.4 81 6 
Doc 
Table 2: Object recognmon on three different corpora 
Actual 
4.1 Error analysis 
We conducted a detailed error analysis on the newspaper 
corpus. The major source of subject/verb and object/verb 
errors was identified for each sentence. If the same 
source of error accounts for more than one error in a 
given sentence, it is counted once. There may be also 
more than one source of error for the same sentence. We 
identified 81 sources of errors, distributed across the 68 
sentences that included at least 1 error (table 3). 
Source of errors Number of 
occurrences 
errors due to tagger 20 
errors due to coordination 9 
errors due to apposition or NP enumeration 9 
errors due to time expressions 7 
errors due to pronouns 7 
errors due to mctse 2 
errors due to subject inversion (not incise) 3 
messed input (typos, missing blanks) 3 
numerical expressions 3 
punctuation 3 
NP adverbials (other than time expressions) 2 
errors due to missing determiners 2 
errors due to frozen expressions 2 
errors due to clause boundaries 2 
errors due to negation (e.g. pas moins de) 2 
predicate of non-finite verbs 2 
bugs 1 
emphatic constructions 1 
book title (as object) 1 
Table 3. Error Analysis 
Among the errors, some can be corrected easily with 
limited additional work. For instance, the tested version 
of the parser did not include preprocessing for time ad- 
verbials or numerical expressions. The tagset associated 
with the POS tagger did not provide enough distinctions 
between different types of pronouns 9. 
The most significant errors are due to long NP se- 
quences (coordination, enumeration, apposition) and to 
the POS tagger. NP sequences are challenging, as it is 
difficult for the parser to distinguish between apposi- 
tions, enumeration and coordinated NPs. Various pa- 
rameters are to be taken into account (determiners, 
proper nouns, punctuation, semantic relations, etc.) 
many of which go beyond the scope of a shallow (or 
even non shallow) syntactic parser. 
4.2 Errors due to POS tagger 
There is little to be found in the literature on the impact 
of POS tagging errors on syntactic analysis. Out of 187 
sentences in our newspaper corpus, 20 had subject or 
object errors due primarily to tagging errors. This does 
not mean that other sentences did not have tagging er- 
rors, but such errors had no impact on the assignment of 
subjects and objects (neither on the identification of 
head verbs). This is encouraging, because taggers with 
97% acccuracy at word level may have a limited accu- 
racy at sentence level (70%). 
Table 4 shows the extraction results from sentences 
where errors due to the POS tagger had no impact. The 
figures between parentheses refer to the results obtained 
on the whole corpus, i.e. including tagging errors. 
Corpus 
Subjects 
Objects 
Table 4. Sub: 
errors 
Actual Tagged Correct Precl- Recall 
as Tag slon % % 
296 278 266 95.6 89 8 
(94 4) (87 4) 
177 175 152 86.8 85 8 
(86 3) (79 9) 
ect and object recognition without POS tagging 
The improvement is mostly for recall of objects. This 
is due to the French specific ambiguity for the words de, 
du, des which can be indefinite determiners or preposi- 
tion+determiner sequences. The tagger does not do a 
good job at distinguishing between the two tags, espe- 
cially in postverbal position. This has an obvious impact 
on the dependency analysis, as segments mistagged as 
PPs cannot be identified as subjects or objects. 
9 This project started around mid-96; some of the obvious 
improvements are already part of a recently updated version. 
76 
5 Conclusion 
We presented a finite-state approach to subject and ob- 
ject dependency extraction from unrestricted French 
corpora. The method involves incremental finite-state 
parsing and transductions for dependency extraction, 
using only part-of-speech information. The evaluation 
on different corpora (22,000 words) shows high accu- 
racy and precision. 
The method is being expanded to other dependency 
relations and to other languages. 
The extracted dependencies can be used to automati- 
cally build subcategorisation frames or semantic restric- 
tions for verb arguments, which, in turn, can be reused 
in further steps of incremental parsing. They can also be 
used for knowledge extraction from large corpora and 
for document indexing. 
Acknowledgments: 
We are grateful to Annie Zaenen and Lauri Karttunen for 
their editorial advices. 

References 
\[Abney, 1991 \] Steve Abney. Parsing by chunks. Princtpled- 
BasedParsmg, R. Berwick, S. Abney, and C. Tenny, edi- 
tors. Kluwer Academic Publishers, Dordrecht, 1991. 
\[AR-Mokhtar and Chanod, 1997\] Salah AR-Mokhtar and 
Jean-Pierre Chanod. Incremental Finite-State Parsing. 
Proceedings of ANLP-97, Washington, 1997. 
\[Appelt et al, 1993\] Douglas E. Appelt, Jerry R. Hobbs, 
John Bear, David Israel, and Mabry Tyson. FASTUS: A 
Finite-State Processor for Information Extraction from 
Real-World Text. 
Proceedings 1JCA1-93, Charnbery, France, August 1993. 
\[Brent, 1991\] Michael Brent. Automatic Acquisition of 
subcategorization frames from untagged texts. Proceedmgs 
of the 29th Annual Meetmg of the Association 
for Computattonal Lmgutsttcs. Berkeley, CA, 1991. 
Briscoe and Carroll, 1994\] Edward Briscoe and John Car- 
roll. Towards Automatic Extraction of Argument Structure 
from Corpora. Technical Report MLTT-006, Rank Xerox 
Research Centre, Meylan, 1994. 
\[Chanod and Tapanainen, 1995\] Jean-Pierre Chanod and 
Pasi Tapanainen. Tagging French - comparing a statistical 
and a constraint-based method. Proceedings of the Seventh 
Conference of the European Chapter of the Association for 
Computat:onal Linguistics, pages 149-- 156, Dublin, 1995. 
\[Chanod and Tapanainen, 1996\] Jean-Pierre Chanod and 
Pasi Tapanainen. A Robust Finite-State Parser for French. 
Proceedmgs of ESSLLI'96 Workshop on Robust Parsing 
Prague, 1996. 
\[Federici etal, 1996\] Stefano Federici, Simonetta Monte- 
magni and Vito Pirrelli. Shallow Parsing and Text Chunk- 
ing: a View on Underspeeification in Syntax. Proceedings 
SLLI'96 Workshop on Robust Parsing, Prague, 1996. 
\[Grefenstette, 1994\] Gregory Grefenstette. FExplorattons m 
Automat:e Thesaurus Dtscovery. Kluwer Academic Press, 
Boston, 1994. 
\[Grishman and Sterling, 1992\] Ralph Grishman and John 
Sterling. Acquisition of Selectional Patterns. Proceedings of 
Cohng-92 Nantes, 1992. 
\[Karttunen, 1996\] Lauri Karttunen. Directed replacement. 
Proceedings of the 34th Annual Meeting of the Assoczatzon 
for Computatwnal Lmgulsttcs, Santa Cruz, USA, ssociation 
for Computational Linguistics. 1996. 
\[Koskenniemi et al., 1992\] Kimmo Koskenniemi, Pasi Ta- 
panainen, and Atro Voutilainen. Compiling and using finite- 
state syntactic rules. Proceedings of the Fourteenth Interna- 
nonal Conference on Computatzonal Lmgulstzcs COLING- 
92, Nantes, 1992. 
\[Roche, 1993\] Emmanuel Roche. Analyse syntaxique trans- 
formationnelle du francais par transducteurs et lexique- 
grammaire, Ph.D. dissertation, Universit6 de 
Paris 7, 1993. 
\[Sanfilippo, 1994\] Antonio Sanfilippo. Word Knowledge 
Acquisition, Lexicon Construction and Dictionary Compi- 
lation. Proceedings ofCohng-94, Kyoto, 1994. 
\[Voutilainen and Tapanainen, 1993\] Atro Voutilainen and 
Pasi Tapanainen. Ambiguity resolution in a reductionistic 
parser. Proceedings of the Sixth Conference of the European 
Chapter of the Assoetatton for Computatwnal Lmgulstws, 
Utrecht, 1993. 
