ProjectingPOStagsandsyntacticdependenciesfromEnglishandFrench
toPolishinalignedcorpora
SylwiaOzdowska
ERSS-CNRS&UniversitéToulouse-leMirail
MaisondelaRecherche
5alléesAntonioMachado
F-31058ToulouseCedex9
ozdowska@univ-tlse2.fr
Abstract
Thispaperpresentsthefirststeptoproject
POS tags and dependencies from English
and French to Polish in aligned corpora.
Both the English and French parts of the
corpus are analysed with a POS tagger
and a robust parser. The English/Polish
bi-text and the French/Polish bi-text are
then aligned at the word level with the
GIZA++ package. The intersection of
IBM-4 Viterbi alignments for both trans-
lation directions is used to project the an-
notationsfromEnglishandFrenchtoPol-
ish. The results show that the precision
of direct projection vary according to the
type of induced annotations as well as the
source language. Moreover, the perfor-
mancesarelikelytobeimprovedbydefin-
ing regular conversion rules among POS
tagsanddependencies.
1 Introduction
A clear imbalance may be observed between lan-
guages, such as English or French, for which a
numberofNLPtoolsaswellasdifferentlinguistic
resourcesexist(Leech,1997)andthoseforwhich
they are sparse or even absent, such as Polish.
One possible option to enrich resource-poor lan-
guages consists in taking advantage of resource-
rich/resource-poorlanguagealignedcorporatoin-
duce linguistic information for the resource-poor
side from the resource-rich side (Yarowski et al.,
2001; Borin, 2002; Hwa et al., 2002). For Pol-
ish, this has been made possible on account of its
accessing to the European Union (EU) which has
resultedintheconstructionofalargemultilingual
corpusofEUlegislativetextsandagrowinginter-
estfornewMemberStateslanguages.
This paper presents a direct projection of vari-
ous morpho-syntactic informations from English
and French to Polish. First, a short survey of re-
lated works is made in order to motivate the is-
suesaddressedinthisstudy. Then,theprincipleof
annotation projection is explained and the frame-
work of the experiment is decribed (corpus, POS
tagging and parsing, word alignment). The re-
sults of applying the annotation projection princi-
plefromtwodifferentsourcelanguagesarefinally
presentedanddiscussed.
2 Background
Yarowski, Ngai and Wicentowski (2001) have
used annotation projection from English in order
toinducestatisticalNLPtoolsforinstanceforChi-
nese, Czech, Spanish and French. Different kinds
of analysis were produced: POS tagging, noun
phrase bracketing, named entity tagging and in-
flectionalmorphologicalanalysis,andreliedonto
trainstatisticaltoolsforeachtask. Theauthorsre-
port that training allows to overcome the problem
of erroneous and incomplete word alignment thus
improvingtheaccuracyascomparedtodirectpro-
jection: 96%forcorePOStagsinFrench.
The study proposed by Hwa, Resnik, Weinberg
andKolak(2002)aimsatquantifyingthedegreeto
whichsyntacticdependenciesarepreservedinEn-
glish/Chinese aligned corpora. Syntactic relation-
shipsareprojectedtoChineseeitherdirectlyorus-
ingelementarytransformationruleswhichleadsto
68%precisionandabout66%recall.
Finally, Borin (2002) has tested the projection
of major POS tags and associated grammatical
informations (number, case, person, etc.) from
53
Swedish to German. 95% precision has been
obtained for major POS tags
1
whereas associ-
ated grammatical informations have turned out
not to be applicable across the studied languages.
A rough comparison has been made between
Swedish, German and additional languages (Pol-
ish, English and Finnish). It tends to show that
it should be possible to derive indirect yet regular
POScorrespondences,atleastacrossfairlysimilar
languages.
The projection from French and English to Pol-
ish presented in this paper is basically a direct
one. It concerns different linguistic informations:
POStagsandassociatedgrammaticalinformation
as well as syntactic dependencies. Regarding the
works mentioned above, uneven results are ex-
pected depending on the type of annotations in-
duced. This is the first point this study considers.
The second one is to identify regularity in render-
ing some French or English POS tags or depen-
dencieswithsomePolishones. Finally,theideais
totestiftheresultsvarysignificantlywithrespect
tothesourcelanguageusedfortheinduction.
3 Projectingmorpho-syntactic
annotations
Wetakeasthestartingpointofannotationprojec-
tion the direct correspondence assumption as for-
mulated in (Hwa et al., 2002): “fortwo sentences
in parallel translation, the syntactic relationships
in one language directly map the syntactic rela-
tionships in the other”, and extend it to POS tags
as well. The general principle of annotation pro-
jectioninalignedcorporamaybeexplainedasfol-
lows:
iftwowordsw1andw2aretranslationequiv-
alents within aligned sentences, the morpho-
syntactic informations associated to w1 are
assignedtow2
In this study, the projected annotations are POS
tags, with gender and number subcategories for
nouns and adjectives, on one hand, and syntactic
dependenciesontheotherhand.
Let us take the example of Commission and
Komisja, respectivelyw1
i
andw2
m
, two aligned
words (figure 1). In accordance with the annota-
tion projection principle,Komisjaisfirst assigned
the POSN (noun) as well as the information on
itsnumber,sg(singular),andgenderf (feminine).
1
Assessedoncorrectalignments.
Furthermore, the dependencies connectingw1
i
to
other wordsw1
j
are examined. Foreachw1
j
,if
thereisanalignmentlinkingw1
j
andw2
n
,thede-
pendency identified betweenw1
i
andw2
j
is pro-
jected tow2
m
andw2
n
. For example, the noun
Commission (w1
i
) is syntactically connected to
theverbadopte(w1
j
)through thesubjectrelation
andadopteis aligned toprzyjmuje(w2
n
). There-
fore,itispossibletoinduceadependencyrelation,
namelyasubjectone,betweenKomisja(w2
m
)and
przyjmuje(w2
n
)
2
.
Nfsg
Nfsg
subj
Komisja przyjmuje roczny program
La Commission adopte un programme  annuel
V
V
NmsgDET ADJmsg
ADJmsg Nmsg
DET
subj
Figure 1: Projection of POS tags and dependen-
ciesfromFrenchtoPolish
The induced dependencies are given the same la-
bel as the source dependencies that is to say that
thenounKomisjaandtheverbprzyjmujearecon-
nected through the subject relation. Moreover, in
this preliminary study, the projection is basically
limitedtocaseswherethereisexactlyonerelation
goingfromw1
i
andw1
j
ontheonehand,andfrom
w2
m
andw2
n
on the other hand. Thus, as shown
in figure 2, the relation connecting Komisjaand
przyjmujecouldnotbeinducedfromEnglishsince
Commissionandadaptare not linked directly but
bymeansofthemodalshall.
AUX ADJ
ADJ N
DET DET
auxsubj
N
NV
V
Komisja przyjmuje roczny program
N
The Commission shall adopt an annual program
Figure 2: Projection of POS tags and dependen-
ciesfromEnglishtoPolish
2
The POS and the additional grammatical informations
available are also projected from the verb adopte to przyj-
muje.
54
The only exception concerns the complement and
prepositional complement relations. Indeed, Pol-
ish is a highly inflected language which means
that: 1) word order is less constrained than in
FrenchandEnglish2)syntacticrelationsbetween
wordsareindicatedbythecase. Thisisthereason
why, going back to figure 1, the projection from
the nouns programme and travail, linked by the
prepositionde, results in the induction of a rela-
tionbetweenthenounsprogramandpracy.
4 Experimentalframework
4.1 Bi-texts
The countries wishing to join the EU have first to
approve theAcquisCommunautaire.TheAcquis
communautaireencompasses the core EU law, its
resolutions and declarations as well as the com-
mon aims pursued since its creation in the 1950s.
It comprises about 8,000 documents that have
been translated and published by official institu-
tions
3
thus ensuring a high quality of translation.
EachlanguageversionoftheAcquisisconsidered
semantically equivalent to the others and legally
binding. This collection of documents is made
availableonEurope’swebsite
4
.
TheACcorpusismadeofapartoftheAcquistexts
in 20 languages
5
, and in particular the languages
of the new Member States
6
. It has been collected
and aligned at the sentence level by the Language
Technology team at the Joint Research Centre
working for the European Commision
7
(Erjavec
et al., 2005; Pouliquen and Steinberger, 2005).
It is one of the largest parallel corpus regarding
its size
8
and the number of different languages it
covers. A portion of the English, French and Pol-
ish parts form the multilingual parallel corpus se-
lected for this study. Table 1 gives the main fea-
turesofeachpartofthecorpus.
3
OnlyEuropeanCommunitylegislationprintedinthepa-
pereditionoftheOfficialJournaloftheEuropeanUnionis
deemedauthentic.
4
http://europa.eu.int/eur-lex/lex
5
German, English, Danish, Spanish, Estonian, Finish,
French,Greek,Hungarian,Italian,Latvian,Lithuanian,Mal-
tese, Deutch, Polish, Portugese, Slovak, Slovene, Swedish
andCzech.
6
In 2004, the EU welcomed ten new Member States:
Cyprus,Estonia,Hungary,Latvia,Lithuania,Malta,Poland,
CzechRepublic,Slovakia,Slovenia.
7
http://www.jrc.cec.eu.int/langtech/index.html
8
Thenumberofwordformsgoesfrom6upto13million
according to the language. The parts corresponding to the
languagesofthenewMemberStatesrangefrom6upto10
million word forms as compared to 10 up to 13 million for
English French Polish
wordforms 562,458 809,036 764,684
sentences 52,432
Table 1: AC – the English/French/Polish parallel
corpus
4.2 Bi-textprocessing
4.2.1 POStagging
Both the English and French parts of the corpus
have been POS tagged and parsed. The POS tag-
ging has been performed using the TreeTagger
(Schmidt,1994). Amongthemorpho-syntacticin-
formations provided by the TreeTagger’s tagset,
onlythemaindistinctionsarekeptforfurtheranal-
ysis: noun,verb,presentparticiple,adjective,past
participle, adverb, pronoun and conjunction (co-
ordination and subordination). Nouns, adjectives
and past participles are assigned data related to
their number and gender and verbs are assigned
information on voice, gender and form (infinitive
or not), if available (table 2). The TreeTagger’s
output is given as input to the parser after a post-
processing stage which modifies the tokenization.
Some multi-word units are conflated (for exam-
ple complex prepositions such as inaccordance
with,aswellasforEnglish,conformémentà,sous
formedeforFrench,adverbslikeinparticular,at
least,enparticulier,aumoins,orevenverbspren-
dreenconsidération,avoirrecours).
4.2.2 Parsing
Each post-processed POS-tagged corpus is anal-
ysed with a deep and robust dependency parser:
SYNTEX(FabreandBourigault, 2001; Bourigault
et al., fothcoming). For each sentence, SYN-
TEX identifies syntactic relations between words
such as subject (SUBJ), object (OBJ), preposi-
tional modifier (PMOD), prepositional comple-
ment (PCOMP), modifier (MOD), etc. Both ver-
sions of the parser are being developed accord-
ing to the same procedure and architecture. The
outputs are quite homogeneous in both languages
since the dependencies are identified and repre-
sentedinthesameway,thusallowingthecompar-
isionofannotationsinducedfromeitherFrenchor
English. Table2givessomeexamplesofthebasic
relationstakenintoaccountaswellasthetagsas-
signed to the syntactically connected words. The
thelanguagesofthe“pre-enlargement”EU.
55
parts of speech are in upper case (N represents a
noun,Vaverb,etc.) andthegrammaticalinforma-
tion(number,gender)isinlowercase(sgreprents
thesingular,pltheplural,fthefeminineandmthe
masculine).
(the)Regulation_Nsg
SUBJ
←− establishes_Vsg
(le)règlement_Nmsg
SUBJ
←− détermine_Vsg
covering_PPR
OBJ
−→ placing_PPR
PMOD
−→ on_PREP
PCOMP
−→ (the)market_Nsg
(qui)régissent_Vpl
OBJ
−→ (la)mise_Nfsg
PMOD
−→ sur_PREP
PCOMP
−→ (le)marché_Nmsg
further_ADJ
MOD
←− calls_Npl
appels_Nmpl
MOD
−→ supplémentaires_ADJpl
(the)Member_Nsg
MOD
←− States_Npl
(les)États_Nmpl
MOD
−→ Membres_Nmpl
(thedebates)clearly_ADV
MOD
←− illustrate_Vpl
(lesdébats)montrent_Vpl
MOD
−→ clairement_ADV
(placingon)the_DET
DET
←− market_Nsg
la_DET
DET
←− mise(sur)le_DET
DET
←− marché_Nmsg
Table 2: Syntactic dependencies identified with
SYNTEX
4.2.3 Wordalignment
The English/Polish parts of the corpus on the one
hand, and the French/Polish parts on the other
hand,havebeenalignedatthewordlevelusingthe
GIZA++package
9
(OchandNey,2003). GIZA++
consists of a set of statistical translation mod-
els of different complexity, namely the IBM ones
(Brown et al., 1993). For both corpora, the tok-
enization resulting from the post-processing stage
prior to parsing was used in the alignment pro-
cess for the English and Polish parts in order to
keepthesamesegmentationespeciallytofacilitate
manualannotationforevaluationpurposes. More-
over, each word being assigned a lemma at the
POStaggingstage,thesentencesgivenasinputto
GIZA++ were lemmatized, as lemmatization has
proven to boost statistical word alignment perfor-
mances. On the Polish side, a rough tokeniza-
tionusingblanksandpunctuationwasrealised;no
lemmatization was performed. The IBM-4 model
has been trained on each bi-text in both trans-
lation directions and the intersection of Viterbi
9
GIZA++ is available at
http://www.jfoch.com/GIZA++.html.
alignments obtained has been used to project the
morpho-syntacticannotations. Inotherwords,our
first goal was to test the extent to which the di-
rectprojectionacrossEnglishorFrenchandPolish
was accurate. Therefore, we relied only on one-
to-one alignments, thus favouring precision to the
detrimentofrecallforthispreliminarystudy. Fig-
ure3showsanexampleofwordalignmentoutput.
The intersection in both directions is represented
with plain arrows; the dotted ones represent uni-
directional alignments. It shows that the intersec-
tionresultsinanincompletealignmentwhichmay
differ depending on the pair of languages consid-
ered and the segmentation performed in each lan-
guage
10
.
Les sanctions sont réglées dans la convention de subvention
Sankcje sa uregulowane w porozumiewaniach o dotacji
Sanctions are_regulated in grant agreements
Figure 3: Intersection of IMB-4 model Viterbi
alignmentsinbothtranslationdirections
5 Evaluation
5.1 Method
In order to evaluate the annotation projection,
anaposteriorireference was constructed, which
means that a sample of the output was selected
randomlyandannotatedmanually. Therearesome
advantages to work with this kind of reference.
First, it is less time-consuming than an aapri-
orireference built independently from the output
obtained. Second, it allows to skip the cases for
whichitisdifficulttodecidewhethertheyarecor-
rect or not: syntactic analysis may be ambiguous
and translation often makes it difficult to deter-
mine which source unit corresponds to which tar-
get one (Och and Ney, 2003). A better level of
confidence may thus be ensured with anaposte-
riorireferenceincomparisonwithahumananno-
tation task where a choice is to be made for each
case. Finally, whatever strategy is adopted, there
is always a part of subjectivity in human annota-
tion. Thus, the results may vary from one person
toanother. Themajordrawbackofanaposteriori
reference is that it allows to assess only precision
10
Theunderscoreindicatestokenconflation.
56
andnotrecallsinceitpreciselyonlycontainsdata
provided as output of the algorihtm subjected to
evaluation.
5.2 Parameters
The sample used in order to constitute theapos-
teriorireferenceismadeof50French/Polishsen-
tencesand50English/Polishsentences. Thesame
sentencesin each language version wereselected.
Indeed, one of the goals of this study is to deter-
mine if the choice of the source language has an
influence on annotation projection results. These
50sentencescorrespondto800evaluatedtagsand
400 evaluated dependencies in the French/Polish
bi-text, and 782 evaluated POS tags and 391 de-
pendenciesintheEnglish/Polishbi-text.
Several parameters have been taken into account
for each type of annotation projection by answer-
ingyesornotothepointslistedbelow.
ForPOStags:
1a. theprojectedPOSisthecorrectone;
2a. the gender and number of nouns, adjectives
andpastparticiplesarecorrect.
Thegenderparameterhasbeenevaluatedonlyfor
the projection from French to Polish as this infor-
mationwasnotavailableinEnglish.
Fordependencies:
1b. there is a dependency relation between two
givenPolishwordsregardlessofitslabel;
2b. thelabelofthedependencyiscorrect.
Each time the answerto points 2a and 2b wasno,
the information about the correct annotation was
added.
6 Results
6.1 Performances
Table3presentsthenumberofprojectedPOStags
anddependencieswithrespecttoeachsourcelan-
guage. It gives the precision for each parameter,
POS tag (1a), number and gender (2a), unlabeled
dependencies (1b) and labeled dependencies (2b)
assessedagainsttheaposteriorireference.
ItshowsthatthenumberofprojectedPOStagsas
well as syntactic relations is slightly lower when
Englishisusedassourcelanguage. Alowernum-
ber of identified alignment links or dependencies
may explain this difference. It also should be
Fr/Pl En/Pl
projectedPOStags 800 782
1a POStags .87 .88
2a number .88 .91
2a gender .59 –
projecteddependencies 400 391
1b unlabeleddependencies .83 .82
2b labeleddependencies .62 .67
Table3: Precisionaccordingtoeachevaluatedpa-
rameter
noted that the evaluated projections are not nec-
essarily the same in both corpora. As mentioned
insection5.1,thesamesentenceswerechosenfor
evaluation. Nevertheless, since word alignment
depends on the pair of languages involved, it has
an impact on the projections obtained and the a
posteriorireferencebuiltontheirbasis.
The precision rates vary according to the type of
informationsinduced. Nosignificantdifferenceis
observed whether the source language is French
or English. The number subcategory achieves
the highest score: 0.88 and 0.91 respectively for
French/Polish and English/Polish. Dependencies
rank second—0.83 and 0.82—but an important
decrease in accuracy—about 20%—is observed
when their labels are taken into account. Finally,
for French, the gender category achieves the low-
est score: 0.59. The main reasons for which an-
notationprojectionfailsareinvestigatedhereafter.
Theprojectionofthenumberandgendersubcate-
goriesarenottakenintoaccount.
6.2 Resultanalysis
There are various reasons for the failure of the
POS tags and dependencies’ projection: a) word
alignment, b) lexical density, c) tokenization, d)
POS tagging/parsing errors and e) insertion (for
dependencies). In following examples, the word
alignments are bold faced and in order to avoid
confusion, thePOStagsonthePolishsidearethe
intendedPOStagsandnottheinducedPOStags.
a) The noun countries is aligned to trzecich
11
which is actually an adjective. On the other
hand,participationandudziałbeing aligned, the
projecteddependencyisalsoerroneous.
Participation_N
1
ofthirdcountries_N
2
Udział_N
1
pa´nstwtrzecich_ADJ
2
11
Thecorrectalignmentispa´nstw.
57
b)Underistranslatedbytheprepositionnalphrase
napodstawie but is aligned only to podstawie
whichisanoun. Thus,theprojectedtagcannotbe
assigned just topodstawie, which is also the case
withthePMODdependencybetweenzawarteand
podstawie.
concluded_PPA
1
under_PREP
2
the general
framework
zawarte_PPA
1
napodstawie_N
2
ogólnychram
c) This case is similar to the previous but
the difference in lexical density is partly caused
by the conflation of inaccordancewith, which
corresponds to the prepositionnal phrasezgodnie
z,atthepost-processingstageofthePOStagging.
They must be constituted
in_accordance_with_PREP
1
thelaw_N
2
Musz˛aby´c ustanowione zgodnie_ADV
1
z
prawem_N
2
d) The following example shows an error in
PCOMP attachement resulting in an error in
dependency projection: with is linked to pursue
instead of activities and the same relation is
assignedtooandzajmowa´c.
They must pursue_V
1
activities with_PREP
2
a
Europeandimension
Musz˛azajmowa´c_V
1
si˛edziałalno´sci˛ao_PREP
2
europejskimwymiarze
e) On the Polish side, the inserted noun
postanowie´ngovernstraktatu. Thus, the PCOMP
dependency does not linkdlaandtraktatubutdla
andpostanowie´n.
Withoutprejudicefor_PREP
1
theTreaty_N
2
Bez uszczerbku dla_PREP
1
postanowie´n
Traktatu_N
2
Considering the precision figures, in partic-
ular those accounting for the projection of
dependencies which decrease significantly when
labels are considered, we tried to determine if
there are indirect yet regular French/Polish and
English/Polish correspondences. By indirect
correspondence we mean that a given source
POS tag or dependency is usually rendered by
a given Polish POS tag or dependency. The
correspondences are calculated provided there
is no error prior to projection (word alignment,
taggingorparsing).
Table 4 shows the direct and indirect correspon-
dences among the POS tags which occur in the
reference set. We can see that there is a direct
correspondenceamongPOStagsin92%and93%
of the cases respectively for French/Polish and
English/Polish projection. Moreover, the indirect
correspondences, for example noun/adjective or
verb/noun, are similar for both source languages.
The following examples show occurrences of
noun/adjectiveandverb/nouncorrespondences.
theexerciceofimplementing_Npowers
l’exercicedescompétencesd’exécution_N
wykonywaniauprawnie´nwykonawczych_ADJ
measures planned to ensure_V dissemina-
tion
mesures prévues pour assurer_V la diffu-
sion ´srodki zaplanowane dla zapewnienia_N
rozpowszechnienia
Someindirectcorrespondencesaremoreprobable
than others that seem unexpected. Most of the
time the latter come from the differences in
tokenizationmentionedabove.
FrPOS PlPOS c
N_359 N_349;ADJ_6;PPA_3;V_1 .97
ADJ_74 ADJ_69;N_3;V_1;DET_1 .93
V_68 V_55;N_13 .80
PPA_67 PPA_59;V_6;ADJ_1;N_1 .88
PREP_35 PREP_24; N_7; DET_2; V_1;
PPR_1
.68
others_61 same_56 .91
664 612 .92
EnPOS PlPOS c
N_374 N_364;ADJ_9;PPA_1 .97
PREP_64 PREP_53;N_7;DET_4 .83
V_51 V_35;PPA_10;N_6 .69
ADJ_46 ADJ_42;N_2;V_1;DET_1 .91
DET_36 DET_33;N_2 .91
others_73 same_70 .95
644 597 .93
Table 4: French/Polish and English/Polish POS
tagcorrespondences
Table5summarizesdirectandindirectcorrespon-
dences among syntactic dependency relations.
It can be seen that direct correspondence rates
for dependencies are lower than direct corre-
spondences for POS tags: 78% when the source
language is French source and 82% when it is
58
English. Moreover, the difference according to
thesourcelanguage—5%infavourofEnglish—is
more important than for POS tags—1% in favour
of English. It is mainly due to the PMOD and
PCOMP relations: thefirstconnectsapreposition
to its governor and the second connects the
dependent to a preposition. Since Polish is an
inflected language, the connections between
words are indicated through cases. In particular,
it results in a noun not being necessarily linked
to another noun by a preposition. This is also
the case for English, as far as compounds are
concerned,whileinFrenchaprepositionisalmost
always required to form noun phrases. This is
one of the reasons why the direct correspondence
rate between English and Polish is higher than
between French and Polish. The following
example shows a direct MOD/MOD correspon-
dence for the English/Polish pair and an indirect
PMOD_PCOMP/MOD correspondence for the
French/Polishone.
purity
MOD
−→ criteria_Nsubstances_Nlisted
les critères_N
PMOD_de_PCOMP
−→ pureté des
substancesénumérés
kryteria_N
MOD
−→ czyszto´sci_N dla substancji
wymienionych
FrDEP PlDEP c
PMOD_111 PMOD_56;MOD_51;OBJ_4 .50
MOD_106 MOD_106 1
PCOMP_35 PCOMP_25; MOD_7; OBJ_2;
PMOD_1;
.71
OBJ_23 OBJ_16;MOD_5;PMOD_2 .69
SUJ_19 SUJ_18;OBJ_1 .94
others_38 same_38 1
332 259 .78
EnDEP PlDEP c
MOD_95 MOD_90;PMOD_5 .94
PMOD_93 PMOD_59; MOD_26; PCOMP_4;
OBJ_3;SUBJ_1
.63
PCOMP_64 PCOMP_49;MOD_8;PMOD_7 .76
DET_29 DET_29 1
OBJ_23 OBJ_22;PMOD_1 .95
others_18 same_18 1
322 267 .83
Table5: French/PolishandEnglish/Polishsyntac-
ticcorrespondences
7 Discussion
The results of the projection of POS tags and de-
pendencies concur with those reported in the re-
latedworkspresentedinsection2. First,concern-
ing the number and gender subcategories, Borin
(2002) has found that the former is applicable
acrosslanguageswhereasthelatterislessrelevant,
atleastfortheGerman/Swedishlanguagepair. As
seen in section 3, the projection of the number
subcategory offers the highest score and the pro-
jection of the gender the lowest—0.59. It was to
be expected that gender would perform the worst
considering its arbitrary nature at least in French
andPolish. Indeed,therearethreegendersinPol-
ish, masculine, feminine and neutral, as well as
in English, and two in French. Thus, not only the
numberofgendersacrossFrenchandPolishisdif-
ferentbuttheyarenotdistributedinthesameway
inbothlanguages. Theinformationongenderwas
not available for English, gender being assigned
accordingtothehuman/non-humanfeature.
Considering POS tags, the level of direct corre-
spondence is the highest one when compared to
thenumberandgendersubcategoriesaswellasto
dependencies. The precision performed is how-
ever lower with respect to the figures obtained by
Borin (2002) on the one hand, and Yarowski et
al.’s (2001) on the other hand. In Borin’s study,
precision was assessed provided the word align-
ments used to project POS tags were correct. In
thisstudy,precisionhasbeenevaluatedregardless
of possible errors prior to projection. When these
errors are discarded, the precision rates are simi-
lar. In Yarowski et al.’s work (2001), the evalua-
tion did not concern annotation projection but an
inducedtaggertrainedon500Koccurrencesofau-
tomatically derived POS tag projections. Indeed,
theauthorsclaimthatdirectannotationprojection
isquitenoisy. Thisstudyshowsthatsuchasimple
approach can perform fairly well as far as preci-
sion is concerned. The results are likely to be im-
provedbyimplementingbasicPOStagconversion
rulesassuggestedin(Borin,2002).
For the projection of dependencies, defining such
conversion rules seems necessary assuggested by
the significant difference in precision when the
projection of unlabeled and labeled dependencies
are compared. Polish does not proceed in the
same way to encode syntactic functions as com-
pared to English or French. Nevertheless, some
of the syntactic divergences observed seem regu-
59
larenoughtobeusedtoderiveindirectcorrespon-
dences. Hwa et al. (2002) have noticed that ap-
plying elementary linguistic transformations con-
siderablyincreasesprecisionandrecallwhenpro-
jecting syntactic relations, at least for the En-
glish/Chinese language pair. The present study
suggests that this kind of approach is promising
for the English/Polish and French/Polish pairs as
well.
Theexceptionnalstatusofthecorpuscertainlyin-
fluencesthequalityoftheresults. Legislativetexts
of the EU in their different language versions are
legally binding. Thus, they have to be as close
as possible semantically and this constraint may
favourthedirectcorrespondencesobserved.
8 Conclusion
Wehavepresentedasimpleyetpromisingmethod
based on aligned corpora to induce linguistic an-
notations in Polish texts. POS tags and depen-
denciesaredirectlyprojectedtothePolishpartof
the corpus from the automatically annotated En-
glish or French part. As far as precision is con-
cerned, the direct projection is fairly efficient for
POS tags but appears to be too restrictive for de-
pendencies. Nevetheless, the results are encour-
aging since they are likely to be improved by ap-
plying indirect correspondence rules. They vali-
date the idea of the existence of direct or indirect
yetregularcorrespondencesontheEnglish/Polish
and French/Polish language pairs which has al-
ready been tested with some syntax-based align-
menttechniques(Ozdowska,2004;Ozdowskaand
Claveau, 2005). The next step will consist in ex-
ploitingtheindirectcorrespondencesandthemul-
tiple sources of information provided by two dif-
ferent source languages. Moreover, using IBM-4
wordalignmentsinonedirectioninsteadofthein-
tersectionwillbeconsidered.
Thisworkmainlyfocussesonprecisionthuslack-
ing information on recall. Larger scale evalua-
tionswouldbenecessarytovalidatetheapproach,
particularly evaluations that could measure recall,
since the amount of evaluation data used is this
studycouldbeconsideredtoolimited.

References
Lars Borin. 2002. Alignment and tagging. In Lars
Borin,editor,Parallelcorpora,parallelworlds: se-
lected papers from a symposium on parallel and
comparable corpora at Uppsala University, pages
207–217.Rodopi,Amsterdam/NewYork.
DidierBourigault,CécileFabre,CécileFrérot,Marie-
PauleJacques,andSylwiaOzdowska. fothcoming.
Acquisitionetévaluationsurcorpusdepropriétésde
sous-catégorisation syntaxique. T.A.L (Traitement
AutomatiquedesLangues).
Peter F. Brown, Stephen. A. Della Pietra, Vincent
J. Della Pietra, and Robert L. Mercer. 1993.
The mathematics of statistical machine translation:
parameter estimation. Computational Linguistics,
19(2):263–311.
Tomaž Erjavec, Camelia Ignat, BrunoPouliquen, and
Ralf Steinberger. 2005. Massive multilingualcor-
pus compilation: Acquis communautaire and TO-
TALE.In2nd Language and Technology Confer-
ence.
CécileFabreandDidierBourigault. 2001. Linguistic
cluesforcorpus-basedacquisitionoflexicaldepen-
dencies. InCorpusLinguisitcConference.
Rebecca Hwa, Philip Resnik, Amy Weinberg, and
Okan Kolak. 2002. Evaluating translational cor-
respondence using annotation projection. In 40th
Annual Conference of the Association for Compu-
tationalLinguistics.
GeoffreyLeech. 1997. Introductingcorpusannotation.
In Roger Garside, Geoffrey Leech, and Anthony
McEnery,editors,CorpusAnnotation.LinguisticIn-
formationfromComputerTextcorpora,pages1–18.
Longman,London/NewYork.
FranzJosefOchandHermannNey. 2003. Asystem-
aticcomparisonofvariousstatisicalalignmentmod-
els. ComputationalLinguistics,1(29):19–51.
SylwiaOzdowskaandVincentClaveau. 2005. Aligne-
ment de mots par apprentissage de règles de prop-
agation syntaxique en corpus de taille restreinte.
In Conférence sur le Traitement Automatique des
LanguesNaturelles,pages243–252.
SylwiaOzdowska. 2004. Identifyingcorrespondences
between words: an approach based on a bilingual
syntactic analysis of French/English parallel cor-
pora. In Multilingual Linguistic Resources Work-
shopofCOLING’04.
BrunoPouliquenandRalfSteinberger. 2005. Theac-
quis communautaire corpus. In JRC Enlargement
andIntegrationWorkshop.
Helmut Schmidt. 1994. Probabilistic part-of-speech
tagging using decision trees. In 1st International
Conference on New Methods in Natural Language
Processing.
David Yarowski, Grace Ngai, and Richard Wicen-
towski. 2001. Inducing multilingual text analysis
tools via robust projection across aligned corpora.
In1stHumanLanguageTechnologyConference.
