A Model for Fine-Grained Alignment of Multilingual Texts
Lea CYRUS and Hendrik FEDDES 
Arbeitsbereich Linguistik
University of M unster
H u erstra e 27, 48149 M unster, Germany
flea,feddesg@marley.uni-muenster.de
Abstract
While alignment of texts on the sentential level
is often seen as being too coarse, and word align-
ment as being too  ne-grained, bi- or multi-
lingual texts which are aligned on a level in-
between are a useful resource for many pur-
poses. Starting from a number of examples of
non-literal translations, which tend to make
alignment di cult, we describe an alignment
model which copes with these cases by explicitly
coding them. The model is based on predicate-
argument structures and thus covers the middle
ground between sentence and word alignment.
The model is currently used in a recently initi-
ated project of a parallel English-German tree-
bank (FuSe), which can in principle be extended
with additional languages.
1 Introduction
When building parallel linguistic resources, one
of the most obvious problems that need be
solved is that of alignment. Usually, in sentence-
or word-aligned corpora, alignments are un-
marked relations between corresponding ele-
ments. They are unmarked because the kind
of correspondence between two elements is ei-
ther obvious or beyond classi cation. E.g., in
a sentence-aligned corpus, the n : m relations
that hold between sentences express the fact
that the propositions contained in n sentences
in L1 are basically the same as the propositions
in m sentences in L2 (lowest common denomi-
nator). No further information about the kind
of correspondence could possibly be added on
this degree of granularity. On the other hand, in
word-aligned corpora, words are usually aligned
as being \lexically equivalent" or are not aligned
at all.1 Although there are many shades of \lexi-
cal equivalence", these are usually not explicitly
 We would like to thank our colleague Frank Schu-
macher for many valuable comments on this paper.
1Cf. the approach described in (Melamed, 1998).
categorised. As (Hansen-Schirra and Neumann,
2003) point out, for many research questions
neither type of alignment is su cient, since the
most interesting phenomena can be found on a
level between these two extremes.
We propose a more  nely grained model
of alignment which is based on monolingual
predicate-argument structures, since we assume
that, while translations can be non-literal in a
variety of ways, they must be based on simi-
lar predicates and arguments for some kind of
translational equivalence to be achieved. Fur-
thermore, our model explicitly encodes the ways
in which the two versions of a text deviate from
each other. (Salkie, 2002) points out that the
possibility to investigate what types of non-
literal translations occur on a regular basis is
one of the major pro ts that linguists and trans-
lation theorists can draw from parallel corpora.
In Section 2, we begin by describing some
ways in which translations can deviate from
one another. We then describe in detail the
alignment model, which is based on a monolin-
gual predicate-argument structure (Section 3).
In Section 4 we conclude by introducing the
parallel treebank project FuSe which uses the
model described in this paper to align German
and English texts from the Europarl parallel
corpus (Koehn, 2002).
2 Di erences in Translations
In most cases, translations are not absolutely
literal counterparts of their source texts. In or-
der to avoid translationese, i. e. deviations from
the norms of the target language, a skilled
translator will apply certain mechanisms, which
(Salkie, 2002) calls \inventive translations" and
which need to be captured and systematised.
The following section will give some examples2
2As we work with English and German, all exam-
ples are taken from these two languages. They are taken
from the Europarl corpus (see Section 4) and are ab-
breviated where necessary. Unfortunately, it is not eas-
of common discrepancies encountered between
a source text and its translation.
2.1 Nominalisations
Quite frequently, verbal expressions in L1 are
expressed by corresponding nominalisations in
L2. This departure from the source text results
in a completely di erent structure of the tar-
get sentence, as can be seen in (1) and (2),
where the English verb harmonise is expressed
as Harmonisierung in German. The argument
of the English verb functioning as the grammat-
ical subject is realised as a postnominal modi er
in the German sentence.
(1) The laws against racism must be har-
monised.3
(2) Die
The
Harmonisierung
harmonisation
der
of the
Rechtsvorschriften
laws
gegen
against
den
the
Rassismus
racism
ist
is
dringend
urgently
erforderlich.
necessary.
This case is particularly interesting, because it
involves a case of modality. In the English sen-
tence, the verb is modi ed by the modal aux-
iliary must. In order to express the modality
in the German version, a di erent strategy is
applied, namely the use of an adjective with
modal meaning (erforderlich, ’necessary’). Con-
sequently, there are two predications in the Ger-
man sentence as opposed to only one predica-
tion in the English sentence.
2.2 Voice
A further way in which translations can dif-
fer from their source is the choice of active or
passive voice. This is exempli ed by (3) and
(4). Here, the direct object of the English sen-
tence corresponds to the grammatical subject of
the German sentence, while the subject of the
English sentence is realised as a prepositional
phrase with durch in the German version.
(3) The conclusions of the Theato report
safeguard them perfectly.4
ily discernible from the corpus data which language is
the source language. Consequently, our use of the terms
’source’, ’target’, ’L1’, and ’L2’ does not admit of any
conclusions as to whether one of the languages is the
source language, and if so, which one.
3Europarl:de-en/ep-00-01-19.al, 489.
4Europarl:de-en/ep-00-01-18.al, 749.
(4) Durch
By
die
the
Schlu folgerungen
conclusions
des
of the
Berichts
report
Theato
Theato
werden
are
sie
they
uneingeschr ankt
unlimitedly
bewahrt.
safeguarded
2.3 Negation
Sometimes, a positive predicate expression is
translated by negating its antonym. This is the
case in (5) and (6): both sentences contain a
negative statement, but while the negation is in-
corporated into the English adjective by means
of the negative pre x in-, it is achieved syntac-
tically in the German sentence.
(5) the Directive is inapplicable in Den-
mark5
(6) die
the
Richtlinie
Directive
ist
is
in
in
D anemark
Denmark
nicht
not
anwendbar
applicable
2.4 Information Structure
Sentences and their translations can be organ-
ised di erently with regard to their information
structure. Sentences (7) and (8) are a good ex-
ample for this type of non-literal translation.
(7) Our motion will give you a great deal of
food for thought, Commissioner6
(8) Eine
A
Reihe
row
von
of
Anregungen
suggestions
werden
will
wir
we
Ihnen,
you,
Herr
Mr.
Kommissar,
Commissioner,
mit
with
unserer
our
Entschlie ung
resolution
mitgeben
give
The German sentence is rather inconspicuous,
with the grammatical subject being a prototyp-
ical agent (wir, ’we’). In the English version,
however, it is the means that is realised in sub-
ject position and thus perspectivised. The cor-
responding constituent in German (mit unserer
Entschlie ung, ’with our motion’) is but an ad-
verbial. In English, the actual agent is not re-
alised as such and can only be identi ed by a
process of inference based on the presence of the
possessive pronoun our. Thus, while being more
or less equivalent in meaning, this sentence pair
di ers signi cantly in its overall organisation.
5Europarl:de-en/ep-00-01-18.al, 2522.
6Europarl:de-en/ep-00-01-18.al, 53.
3 Alignment Model
The alignment model we propose is based on
the assumption that a representation of transla-
tional equivalence can best be approximated by
aligning the elements of monolingual predicate-
argument structures. Section 3.1 describes this
layer of the model in detail and shows how some
of the di erences in translations described in
Section 2 can be accomodated on such a level.
We assume that the annotation model described
here is an extension to linguistic data which are
already annotated with phrase-structure trees,
i. e. treebanks. Section 3.2 shows how the bind-
ing of predicates and arguments to syntactic
nodes is modelled. Section 3.3 describes the de-
tails of the alignment layer and the tags used
to mark particular kinds of alignments, thus ac-
counting for some more of the di erences shown
in Section 2.
3.1 Predicates and Arguments
The predicate-argument structures used in our
model consist solely of predicates and their ar-
guments. Although there is usually more than
one predicate in a sentence, no attempt is made
to nest structures or to join the predications
logically in any way. The idea is to make the
predicate-argument structure as rich as is ne-
cessary to be able to align a sentence pair while
keeping it as simple as possible so as not to
make it too di cult to annotate. In the same
vein, quanti cation, negation, and other opera-
tors are not annotated. In short, the predicate-
argument structures are not supposed to cap-
ture the semantics of a sentence exhaustively in
an interlingua-like fashion.
To have clear-cut criteria for annotators to
determine what a predicate is, we rely on the
heuristic assumption that predicates are more
likely to be expressed by tokens belonging to
some word classes than by tokens belonging to
others. Potential predicate expressions in this
model are verbs, deverbal adjectives and nouns7
or other adjectives and nouns which show a syn-
tactic subcategorisation pattern. The predicates
are represented by the capitalised citation form
of the lexical item (e. g. harmonise). They are
assigned a class based on their syntactic form
(v, n, a for ’verbal’, ’nominal’, and ’adjectival’,
respectively), and derivationally related predi-
7For all non-verbal predicate expressions for which a
derivationally related verbal expression exists it is as-
sumed that they are deverbal derivations, etymological
counter-evidence notwithstanding.
cates form a predicate group.
Arguments are given short intuitive role
names (e. g. ent harmonised, i. e. the entity
being harmonised) in order to facilitate the
annotation process. These role names have to
be used consistently only within a predicate
group. If, for example, an argument of the pred-
icate harmonise has been assigned the role
ent harmonised and the annotator encoun-
ters a comparable role as argument to the pred-
icate harmonisation, the same role name for
this argument has to be used.8
The usefulness of such a structure can be
shown by analysing the sentence pair (1) and
(2) in Section 2.1. While the syntactic con-
structions di er considerably, the predicate-
argument structure shows the correspondence
quite clearly (see the annotated sentences in
Figure 19): in the English sentence, we  nd
the predicate harmonise with its argument
ent harmonised, which corresponds to the
predicate harmonisierung and its argument
harmonisiertes in the German sentence. The
information that a predicate of the class v is
aligned with a predicate of the class n can be
used to query the corpus for this type of non-
literal translations.
The active vs. passive translation in sentences
(3) and (4) is another phenomenon which is ac-
comodated by a predicate-argument structure
(Figure 2): the subject np502 in the English
sentence corresponds to the passivised subject
np502 (embedded in pp503) in the German sen-
tence on the basis of having the same argument
role (safeguarder vs. bewahrer) in a com-
parable predication.
It is sometimes assumed that predicate-
argument structure can be derived or recov-
ered from constituent structure or functional
tags such as subject and object.10 It is true
that these annotation layers provide important
heuristic clues for the identi cation of predi-
8Keeping the argument names consistent for all pred-
icates within a group while di erentiating the predicates
on the basis of syntactic form are complementary prin-
ciples, both of which are supposed to facilitate querying
the corpus. The consistency of argument names within
a group, for example, enables the researcher to anal-
yse paradigmatically all realisations of an argument ir-
respective of the syntactic form of the predicate. At the
same time, the di erentiation of predicates makes possi-
ble a syntagmatic analysis of the di erences of argument
structures depending on the syntactic form of the pred-
icate.
9All  gures are at the end of the paper.
10See e. g. (Marcus et al., 1994).
cates and arguments and may eventually speed
up the annotation process in a semi-automatic
way. But, as the examples above have shown,
predicate-argument structure goes beyond the
assignment of phrasal categories and grammati-
cal functions, because the grammatical category
of predicate expressions and consequently the
grammatical functions of their arguments can
vary considerably. Also, the predicate-argument
structure licenses the alignment relation by
showing explicitly what it is based on.
3.2 Binding Layer
As mentioned above, we assume that the an-
notation model described here is used on top
of syntactically annotated data. Consequently,
all elements of the predicate-argument structure
must be bound to elements of the phrasal struc-
ture (terminal or non-terminal nodes). These
bindings are stored in a dedicated binding layer
between the constituent layer and the predicate-
argument layer.
A problem arises when there is no direct cor-
respondence between argument roles and con-
stituents. For instance, this is the case whenever
a noun is postmodi ed by a participle clause: in
Figure 3, the argument role ent raised of the
predicate raise is realised by np525, but the
participle clause (ipa517) containing the pred-
icate (raised6) needs to be excluded, because
not excluding it would lead to recursion. Con-
sequently, there is no simple way to link the
argument role to its realisation in the tree.
In these cases, the argument role is linked to
the appropriate phrase (here: np525) and the
constituent that contains the predicate (ipa517)
is pruned out, which results in a discontinu-
ous argument realisation. Thus, in general, the
binding layer allows for complex bindings, with
more than one node of the constituent structure
to be included in and sub-nodes to be explicitly
excluded from a binding to a predicate or argu-
ment.11
When an expected argument is absent on the
phrasal level due to speci c syntactic construc-
tions, the binding of the predicate is tagged ac-
cordingly, thus accounting for the missing argu-
ment. For example, in passive constructions like
in Table 1, the predicate binding is tagged as pv.
Other common examples are imperative con-
structions. Although information of this kind
may possibly be derived from the constituent
11See the database documentation (Feddes, 2004) for
a more detailed description of this mechanism.
structure, it is explicitly recorded in the binding
layer as it has a direct impact on the predicate-
argument structure and thus might prove use-
ful for the automatic extraction of valency pat-
terns.
Sentence wenn korrekt gedolmetscht wurde
Gloss if correctly interpreted was
"
Binding pv
j
Pred/Arg dolmetschen
Table 1: Example of a tagged predicate binding
(Europarl:de-en/ep-00-01-18.al, 2532)
Note that the passive tag can also be ex-
ploited in order to query for sentence pairs like
(3) and (4) (in Section 2.2), where an active sen-
tence is translated with a passive: it is straight-
forward to  nd those instances of aligned predi-
cates where only one binding carries the passive
tag.
3.3 Alignment Layer
On the alignment layer, the elements of a pair of
predicate-argument structures are aligned with
each other. Arguments are aligned on the basis
of corresponding roles within the predications.
Comparable to the tags used in the binding
layer that account for speci c constructions (see
Section 3.2), the alignments may also be tagged
with further information. These tags are used
to classify types of non-literalness like those dis-
cussed in Sections 2.3 and 2.4.12
Sentences (5) and (6) are an example for a
tagged alignment. As Section 2.3 has shown,
negation may be incorporated in a predicate in
L1, but not in L2. Since our predicate-argument
structure does not include syntactic negation,
this results in the alignment of a predicate in
L1 with its logical opposite in L2. To account
for this fact, predicate alignments of this kind
are tagged as absolute opposites (abs-opp).
Similarly, alignment tagging is applied when
predications are in some way incompatible, as
is the case with sentences (7) and (8) in Sec-
tion 2.4. As can be seen in the aligned annota-
tion (Figure 4), the di erent information struc-
ture of these sentences has caused the two cor-
responding argument roles of giver and mit-
geber to be realised by two incompatible ex-
pressions representing di erent referents (np500
12The deviant translations described in Sections 2.1
and 2.2 are already represented via predicate class (see
Section 3.1) and on the binding layer (see Section 3.2),
respectively.
vs. wir5). In this case, the alignment between
the incompatible arguments is tagged incomp.
If there is no corresponding predicate-
argument structure in the other language (as
e. g. the adjectival predicate in sentence (2)) or
if an argument within a structure does not have
a counterpart in the other language, there will
be no alignment.
Table 2 gives an overview of the annotation
layers as described in this section.
Layer Function
Phrasal constituent structure of language A
Binding binding # predicates/arguments to " nodes
pa predicate-argument structures
Alignment aligning l predicates and arguments
pa predicate-argument structures
Binding binding " predicates/arguments to # nodes
Phrasal constituent structure of language B
Table 2: The layers of the predicate-argument
annotation
All elements of the alignment structure are
supposed to mark explicitly the way they con-
tribute to or distort the resulting translational
equivalence of a sentence pair.13 First and fore-
most, if two elements are aligned to each other,
this alignment is licensed by their having com-
parable roles in the predicate-argument struc-
tures. This is the default case. If, however, a
particular alignment relation, either of predi-
cates or of arguments, is deviant in some way,
this deviance is explicitly marked and classi ed
on the alignment layer.
4 Application and Outlook
The alignment model we have described is cur-
rently being used in a project to build a tree-
bank of aligned parallel texts in English and
German with the following linguistic levels: pos
tags, constituent structure and functional re-
lations, plus the predicate-argument structure
and the alignment layer to \fuse" the two
{ hence our working title for the treebank,
FuSe, which additionally stands for functional
semantic annotation (Cyrus et al., 2003; Cyrus
et al., 2004).
Our data source, the Europarl corpus (Koehn,
2002), contains sentence-aligned proceedings of
the European parliament in eleven languages
13Cf. the \translation network" described in (Santos,
2000) for a much more complex approach to describing
translation in a formal way; this model, however, goes
well beyond what we think is feasible when annotating
large amounts of data.
and thus o ers ample opportunity for extend-
ing the treebank at a later stage.14 For syntac-
tic and functional annotation we basically adapt
the tiger annotation scheme (Albert and oth-
ers, 2003), making adjustments where we deem
appropriate and changes which become neces-
sary when adapting to English an annotation
scheme which was originally developed for Ger-
man.
We use Annotate for the semi-automatic
assignment of pos tags, hierarchical struc-
ture, phrasal and functional tags (Brants, 1999;
Plaehn, 1998a). Annotate stores all annota-
tions in a relational database.15 To stay consis-
tent with this approach we have developed an
extension to the Annotate database structure
to model the predicate-argument layer and the
binding layer.
Due to the monolingual nature of the Anno-
tate database structure, the alignment layer
(Section 3.3) cannot be incorporated into it.
Hence, additional types of databases are needed.
For each language pair (currently English and
German), an alignment database is de ned
which represents the alignment layer, thus fus-
ing two extended Annotate databases. Addi-
tionally, an administrative database is needed
to de ne sets of two Annotate databases and
one alignment database. The  nal parallel tree-
bank will be represented by the union of these
sets (Feddes, 2004).
While annotators use Annotate to enter
phrasal and functional structures comfortably,
the predicate-argument structures and align-
ments are currently entered into a structured
text  le which is then imported into the
database. A graphical annotation tool for these
layers is under development. It will make bind-
ing the predicate-argument structure to the con-
stituent structure easier for the annotators and
suggest argument roles based on previous deci-
sions.
Possiblities of semi-automatic methods to
speed up the annotation and thus reduce the
costs of building the treebank are currently be-
ing investigated.16 Still, quite a bit of manual
14There are a few drawbacks to Europarl, such as its
limited register and the fact that it is not easily dis-
cernible which language is the source language. How-
ever, we believe that at this stage the easy accessibility,
the amount of preprocessing and particularly the lack of
copyright restrictions make up for these disadvantages.
15For details about the Annotate database structure
see (Plaehn, 1998b).
16One track we follow is to investigate if it is feasible to
work will remain. We believe, however, that the
e ort that goes into such a gold-standard paral-
lel treebank is very much worthwhile since the
treebank will eventually prove useful for a num-
ber of  elds and can be exploited for numer-
ous applications. To name but a few, translation
studies and contrastive analyses will pro t par-
ticularly from the explicit annotation of transla-
tional di erences. nlp applications such as Ma-
chine Translation could, e. g., exploit the con-
stituent structures of two languages which are
mapped via the predicate-argument-structure.
Also, from the disambiguated predicates and
their argument structures, a multilingual va-
lency dictionary could be derived.

References
Stefanie Albert et al. 2003. tiger Annota-
tionsschema. Technical report, Universit at
des Saarlandes, Universit at Stuttgart, Uni-
versit at Potsdam. Unpublished Draft { 24
July 2003.
Thorsten Brants. 1999. Tagging and Parsing
with Cascaded Markov Models: Automation of
Corpus Annotation, volume 6 of Saarbr ucken
Dissertations in Computational Linguistics
and Language Technology. Saarland Univer-
sity, Saarbr ucken.
Lea Cyrus, Hendrik Feddes, and Frank Schu-
macher. 2003. FuSe { a multi-layered paral-
lel treebank. Poster presented at the Second
Workshop on Treebanks and Linguistic The-
ories, 14{15 November 2003, V axj o, Sweden
(TLT 2003). http://fuse.uni-muenster.
de/Publications/0311_tltPoster.pdf.
Lea Cyrus, Hendrik Feddes, and Frank
Schumacher. 2004. Annotating predicate-
argument structure for a parallel treebank.
In Charles J. Fillmore, Manfred Pinkal,
Collin F. Baker, and Katrin Erk, editors,
Proc. LREC 2004 Workshop on Building
Lexical Resources from Semantically An-
notated Corpora, Lisbon, May 30, 2004,
pages 39{46. http://fuse.uni-muenster.
de/Publications/0405_lrec.pdf.
Hendrik Feddes. 2004. FuSe database
structure. Technical report, Ar-
beitsbereich Linguistik, University of
have the annotators mark predicate-argument structures
on raw texts and have the phrasal and functional layers
added in a later stage, possibly supported by methods
which derive these layers partially from the predicate-
argument structures. This is, however, still very tenta-
tive.
M unster. http://fuse.uni-muenster.de/
Publications/dbStruktur.pdf.
Silvia Hansen-Schirra and Stella Neumann.
2003. The challenge of working with multilin-
gual corpora. In Stella Neumann and Silvia
Hansen-Schirra, editors, Proceedings of the
workshop on Multilingual Corpora: Linguis-
tic Requirements and Technical Perspectives.
Corpus Linguistics 2003, Lancaster, pages 1{6.
Philipp Koehn. 2002. Europarl: A multilin-
gual corpus for evaluation of machine trans-
lation. Unpublished draft, http://www.isi.
edu/~koehn/publications/europarl/.
Mitch Marcus, G. Kim, M. Marcinkiewicz,
R. MacIntyre, A. Bies, M. Ferguson, K. Katz,
and B. Schasberger. 1994. The Penn Tree-
bank: Annotating predicate argument struc-
ture. In Proc. ARPA Human Language Tech-
nology Workshop.
I. Dan Melamed. 1998. Manual annota-
tion of translational equivalence: The blinker
project. Technical Report 98-07, IRCS, Uni-
versity of Pennsylvania. http://citeseer.
ist.psu.edu/melamed98manual.html.
Oliver Plaehn. 1998a. Annotate Bedi-
enungsanleitung. Technical report, Univer-
sit at des Saarlandes, FR 8.7, Saarbr ucken.
http://www.coli.uni-sb.de/sfb378/
negra-corpus/annotate-manual.ps.gz.
Oliver Plaehn. 1998b. Annotate Datenbank-
Dokumentation. Technical report, Univer-
sit at des Saarlandes, FR 8.7, Saarbr ucken.
http://www.coli.uni-sb.de/sfb378/
negra-corpus/datenbank.ps.gz.
Raphael Salkie. 2002. How can linguists pro t
from parallel corpora? In Lars Borin, editor,
Parallel Corpora, Parallel Worlds, pages 93{
109. Rodopi, Amsterdam.
Diana Santos. 2000. The translation network:
A model for a  ne-grained description of
translations. In Jean V eronis, editor, Parallel
Text Processing: Alignment and Use of Trans-
lation Corpora, volume 13 of Text, Speech
and Language Technology, chapter 8. Kluwer,
Dordrecht.
