Methods and Practical Issues in Evaluating Alignment Techniques 
Philippe Langlais 
CTT/KTH SE-I0044 Stockholm 
CERI-LIA, AGROPARC BP 1228 
F-84911 Avignon Cedex 9 
Philippe.Langlais~speech.kth.se 
Michel Simard 
RALI-DIRO 
Univ. de Montrdal 
Qudbec, Canada H3C 3J7 
shnardm~IRO.UMontreal.CA 
Jean Vdronis 
LPL, Univ. de Provence 
29, Av. R. Schuman 
F-13621 Aix-en-Provence Cedex 1 
veronis~univ-aix.fr 
Abstract 
This paper describes the work achieved in the 
first half of a 4-year cooperative research project 
(ARCADE), financed by AUPELF-UREF. The 
project is devoted to the evaluation of paral- 
lel text alignment techniques. In its first period 
ARCADE ran a competition between six sys- 
tems on a sentence-to-sentence alignment task 
which yielded two main types of results. First, 
a large reference bilingual corpus comprising of 
texts of different genres was created, each pre- 
senting various degrees of difficulty with respect 
to the alignment task. 
Second, significant methodological progress 
was made both on the evaluation protocols and 
metrics, and the algoritbm.q used by the dif- 
ferent systems. For the second phase, which is 
now underway, ARCADE has been opened to 
a larger number of teams who will tackle the 
problem of word-level alignment. 
1 Introduction 
In the last few years, there has been a growing 
interest in parallel text alignment techniques. 
These techniques attempt to map various tex- 
tual units to their translation and have proven 
useful for a wide range of applicatious and tools. 
A simple example of such a tool is probably 
the TransSearch bilingual concordancing system 
(Isabelle et al., 1993), which allows a user to 
query a large archive of existing translations in 
order to find ready-made solutions to specific 
translation problems. Such a tool has proved ex- 
tremely useful not only for translators, but also 
for bilingual lexicographers (Langlois, 1996) and 
terminologists (Dagan and Church, 1994). More 
sophisticated applications based on alignment 
technology have also been the object of recent 
work, such as the automatic building of bilin- 
gual lexical resources (Melamed, 1996; Klavans 
and Tzoukermann, 1995), the automatic verifi- 
cation of translations (Macklovitch, 1995), the 
automatic dictation of translations (Brousseau 
et al., 1995) and even interactive machine trans- 
lation (Foster et al., 1997). 
Enthusiasm for this relatively new field was 
sparked early on by the apparent demonstra- 
tion that very simple techniques could yield al- 
most perfect results. For instance, to produce 
sentence alignments, Brown et al. (1991) and 
Gale and Church (1991) both proposed meth- 
ods that completely ignored the lexical content 
of the texts and both reported accuracy lev- 
els exceeding 98%. Unfortunately performance 
tends to deteriorate significantly when aligners 
are applied to corpora which are widely differ- 
ent from the training corpus, and/or where the 
alignments are not straightforward. For instance 
graphics, tables, "floating" notes and missing 
segments, which are very common in real texts, 
all result in a dramatic loss of efficiency. 
The truth is that, while text alignment is 
mostly an easy problem, especially when consid- 
ered at the sentence level, there are situations 
where even humans have a hard time making 
the right decision. In fact, it could be argued 
that, ultimately, text alignment is no easier than 
the more general problem of natural language 
understanding. 
In addition, most research efforts were 
directed towards the easiest problem, that of 
sentence-to-sentence alignment (Brown et al., 
1991; Gale and Church, 1991; Debili, 1992; 
Kay and l~scheisen, 1993; Simard et al., 1992; 
Simard and Plamondon, 1996). Alignment at 
the word and term level, which is extremely 
useful for applications such as lexieal resource 
extraction, is still a largely unexplored research 
area(Melamed, 1997). 
In order to live up to the expectations of the 
711 
various application fields, alignment technology 
will therefore have to improve substantially. 
As was the case with several other language 
processing techniques (such as information 
retrieval, document understanding or speech 
recognition), it is likely that a systematic evalu- 
ation will enable such improvements. However, 
before the ARCADE project started, no for- 
real evaluation exercise was underway; and 
worse still, there was no multilingnal aligned 
reference corpus to serve as a "gold standard" 
(as the Brown corpus did, for example, for 
part of speech tagging), nor any established 
methodology for the evaluation of alignment 
systems. 
2 Organization 
ARCADE is an evaluation exercise financed 
by AUPELF-UREF, a network of (at least 
partially) French-speaking universities. It was 
launched in 1995 to promote research in the 
field of multilingual alignment. The first 2-year 
period (96-97) was dedicated to two main 
tasks: 1) producing a reference bilingual corpus 
(French-English) aligned at sentence level; 2) 
evaluating several sentence alignment systems 
through an ARPA-like competition. 
In the first phase of ARCADE, two types of 
teams were involved in the project: the corpus 
providers (LPL and RALI) and the (RALI, LO- 
ILIA, ISSCO, IRMC and LIA). General coor- 
dination was handled by J. V~ronis (LPL); a 
discussion group was set up and moderated by 
Ph. Langlais (LIA & KTH). 
3 Reference corpus 
One of the main results of ARCADE has been 
to produce an aligned French-English corpus, 
combining texts of different genres and various 
degrees of difficulty for the alignment task. It 
is important to mention that until ARCADE, 
most alignment systems had been tested on ju- 
dicial and technical texts which present rela- 
tively few difficulties for a sentence-level align- 
ment. Therefore, diversity in the nature of the 
texts was preferred to the collection of a large 
quantity of similar data. 
3.1 Format 
ARCADE contributed to the development 
and testing of the Corpus Encoding Standard 
(CES), which was initiated during the MUL- 
TEXT project (Ide et al., 1995). The CES is 
based on SGML and it is an extension of the 
now internationally-accepted recommendations 
of the Text Encoding Initiative (Ide and 
Vdronis, 1995). Both the JOG and BAF parts 
of the ARCADE corpus (described below) are 
encoded in CES format. 
3:2 JOC 
The JOC corpus contains texts which were pub- 
lished in 1993 as a section of the C Series of the 
Official Journal of the European Community in 
all of its official languages. This corpus, which 
was collected and prepared during the MLCC 
and MULTEXT projects, contains, in 9 parallel 
versions, questions asked by members of the Eu- 
ropean Parliament on a variety of topics and the 
corresponding answers from the European Com- 
mission. JOC contains approximately 10 million 
words (ca. 1.1 million words per language). The 
part used for JOC was composed of one fifth 
of the French and English sections (ca. 200 000 
words per language). 
3.3 BAF 
The BAF corpus is also a set of parallel French- 
English texts of about 400 000 words per lan- 
guage. It includes four text genres: 1) INST, 
four institutional texts (including transcription 
of speech from the Hansard corpus) for a total- 
ing close to 300 000 words per language, 2) SCI- 
ENCE, five scientific articles of about 50 000 
words per language, 3) TECH, technical doc- 
umentation of about 40 000 words per language 
and 4) VERNE, the Jules Verne novel: "De 
la terre d la lune" (ca. 50 000 words per lan- 
guage). This last text is very interesting because 
the translation of literary texts is much freer 
than that of other types of tests. Furthermore, 
the English version is slightly abridged, which 
adds the problem of detecting missing segments. 
The BAF corpus is described in greater detail 
in (Simard, 1998). 
4 Evaluation measures 
We first propose a formal definition of paral- 
lel text alignment, as defined in (Isabelle and 
Simard, 1996). Based on that definition, the 
usual notions of recall and precision can be used 
to evaluate the quality of a given alignment with 
712 
respect to a reference. However, recall and preci- 
sion can be computed for various levels of gran- 
ularity: an alignment at a given level (e.g. sen- 
tences) can be measured in terms of units of a 
lower level (e.g. words, characters). Such a fine- 
grained measure is less sensitive to segmenta- 
tion problems, and can be used to weight errors 
according to the number of sub-units they span. 
4.1 Formal definition 
If we consider a text S and its translation T as 
two sets of segments S = {Sl, s2, .., Sn} and T = 
{tl,t2,...,tm}, an alignment A between S and 
T can be defined as a subset of the Cartesian 
product ~(S) x p(T), where p(S) and p(T) are 
respectively the set of all subsets of S and T. 
The triple iS, T, A) will be called a bitext. Each 
of the elements (ordered pairs) of the alignment 
will be called a bisegment. 
This definition is fairly general. However, in 
the evaluation exercice described here, segments 
were sentences and were supposed to be contigu- 
ous, yielding monotonic alignments. 
For instance, let us consider the fol- 
lowing alignment, which will serve as the 
reference alignment in the subsequent ex- 
amples. The formal representation of it is: 
Ar = {({Sl}, {tl}), ({s2}, {t2,t3})}. 
sl Phrase num~ro un. 
s2 Phrase num~ro deux 
qui ressemble h la l~re. 
tl The first sentence. 
t2 The 2nd sentence. 
t3 It looks like the first. 
4.2 Recall and precision 
Let us consider a bitext (S,T, Ar) and a 
proposed alignment A. The alignment recall 
with respect to the reference Ar is defined 
as: recall = IA N Arl/IA~I. It represents the 
proportion of bisegments in A that are correct 
with respect to the reference At. The silence 
corresponds to 1- recall. The alignment 
precision with respect to the reference Ar 
is defined as: precision = IA N Arl/IAI. It 
represents the proportion of bisegments in A 
that are right with respect to the number of 
bisegment proposed. The noise corresponds to 
1 -- precision. 
We will also use the F-measure (Rijsbergen, 
1979) which combines recall and precision in 
a single efficiency measure (harmonic mean of 
precision and recall): 
(recall x precision) F 
2" ( recall~ + precision)" 
Let us assume the following proposed align- 
ment: 
sl Phrase num~ro un. tl The first sentence. 
t2 The 2nd sentence. 
s2 Phrase num~ro deux t3 It looks like the first. 
qui ressemble h la l~re. 
The formal representation of this alignment 
is: A = {({s,}, ({}, {t2}), ({s2}, {t3})}. 
We note that: A n Ar = {((s,}, {tl})}. Align- 
ment recall and precision with respect to Ar are 
1/2 -- 0.50 and 1/3 -- 0.33 respectively. The F- 
measure is 0.40. 
Improving both recall and precision are an- 
tagonistic goals : efforts to improve one often 
result in degrading the other. Depending on the 
applications, different trade-offs can be sought. 
For example, if the bisegments are used to auto- 
matically generate a bilingual dictionary, maxi- 
mizing precision (i.e. omitting doubtful couples) 
is likely to be the preferred option. 
Recall and precision as defined above are 
rather unforgiving. They do not take into ac- 
count the fact that some bisegments could be 
partially correct. In the previous example, the 
bisegment ({s2}, {t3}) does not belong to the 
reference, but can be considered as partially cor- 
rect: t3 does match a part of s2. To take partial 
correctness into account, we need to compute re- 
call and precision at the sentence level instead 
of the alignment level. 
Assuming the alignment A = {al, a2,..., am} 
and the reference Ar = {arl, at2,..., am}, with 
ai = (asi, ati) and arj = (arsj,artj), we can 
derive the following sentence-to-sentence align- 
ments: 
A' = Ui(asi × ati) 
A~r = Uj(arsj x artj) 
Sentence-level recall and precision can thus 
be defined in the following way: 
recall = IA' ' ' nArl/lArl 
precision = IA' n A'rl/IA'I 
In the example above: A' = {(sl, tl), (s2, t3)} 
and A~ = {(sl, tl), (s2, t2), (s2, t3)}. Sentence- 
level recall and precision for this example are 
713 
therefore 2/3 = 0.66 and 1 respectively, as com- 
pared to the alignment-level recall and preci- 
sion, 0.50 and 0.33 respectively. The F-measure 
becomes 0.80 instead of 0.40. 
4.3 Granularity 
In the definitions above, the sentence is the unit 
of granularity used for the computation of recall 
and precision at both levels. This results in two 
difficulties. First, the measures are very sensi- 
tive to sentence segmentation errors. Secondly, 
they do not reflect the seriousness of misalign- 
ments. It seems reasonable that errors involving 
short sentences should be less penalized than 
errors involving longer ones, at least from the 
perspective of some applications. 
These problems can be avoided by taking ad- 
vantage of the fact that a unit of a given gran- 
~arity (e.g. sentence) can always be seen as 
a (possibly discontinuous) sequence of units of 
finer granularity (e.g. character). 
Thus, when an alignment A is compared to 
a reference alignment Ar using the recall and 
precision measures computed at the char-level, 
the values obtained are inversely proportional to 
the quantity of text (i.e. number of characters) 
in the misaligned sentences, instead of the num- 
ber of these misaligned sentences. For instance, 
in the example used above, we would have at 
sentence level: 
* using word granularity (punctuation marks 
are considered as words) : 
IA'I = 4*4 + 0*4 + 9*6 = 106 
IAr'l = 4*4 + 9.10 = 70 
IAr' " A'I = 4*4 + 9*6 = 70 
recall = 70/106 = 0.66 
precision = 1 
F = 0.80 
• using character granularity (excluding 
spaces): 
\[A'\[ = 15.17 + 0.15 + 36*20 = 975 
\[Ar'\] = 15.17 + 36*35 = 1515 
IAr' " A'I = 15.17 + 36*20 = 975 
recall = 975/1515 = 0.64 
precision = 1 
F=0.78 
5 Systems tested 
Six systems were tested, two of which having 
been submitted by the I:tALI. 
RALI/Jacal This system uses as a first step 
a program that reduces the search space only to 
those sentence pairs that are potentially inter- 
esting (Simard and Plamondon, 1996). The un- 
derlying principle is the automatic detection of 
isolated cognates (i.e. for which no other similar 
word exists in a window of given size). Once the 
search space is reduced, the system aligns the 
sentences using the well-known sentence-length 
model described in (Gale and Church, 1991). 
RALI/Sallgn The second method proposed 
by RALI is based on a dynamic programming 
scheme which uses a score function derived from 
a translation model similar to that of (Brown 
et al., 1990). The search space is reduced to a 
beam of fixed width around the diagonal (which 
would represent the alignment if the two texts 
were perfectly synchronized). 
LORIA The strategy adopted in this system 
differs from that of the other systems since sen- 
tence alignment is performed after the prelim- 
inary alignment of larger units (whenever pos- 
sible, using mark-up), such as paragraphs and 
divisions, on the basis of the SGML structure. 
A dynamic programming scheme is applied to 
all alignment levels in successive steps. 
IRMC This system involves a preliminary, 
rough word alignment step which uses a trans- 
fer dictionary and a measure of the proximity of 
words (D~bili et al., 1994). Sentence alignment 
is then achieved by an algorithm which opti- 
mizes several criteria such as word-order con- 
servation and synchronization between the two 
texts. 
LIA Like Jacal, the LIA system uses a 
pre-processing step involving cognate recog- 
nition which restricts the search space, but 
in a less restrictive way. Sentence alignment 
is then achieved through dynamic program- 
ming, using a score function which combines 
sentence length, cognates, transfer dictionary 
and frequency of translation schemes (1-1, 1-2, 
etc.). 
ISSCO Like the LORIA system, the ISSCO 
aligner is sensitive to the macro-structure of 
the document. It examines the tree structure 
of an SGML document in a first pass, weighting 
each node according to the number of charac- 
ters contained within the subtree rooted at that 
node. The second pass descends the tree, first 
714 
by depth, then by breath, while aligning sen- 
tences using a method resembling that of Gale 
& Church. 
6 Results 
Four sets of recall/precision measures were com- 
puted for the alignments achieved by the six 
systems for each text type previously described 
above: Align, alignment-level, Sent sentence- 
level, Word, word-level and Char, character- 
level. The global efficiency of the different sys- 
tems (average F-values) for each text type is 
given in Figure 1. 
t.ORJA t ! I " • . W~ -II - 
I,IA : . . \[.J! i 1 • ! , TECJH 
,.~ i i id~ i ii.i_~?~'r i 
• i "¢¢-~, : ~m.: 
LIA ~ i ~b-~ ~ i : ! : IX,'~ \] SC~'~ ! ' ~. ! ! i ! ~ ! ~ .."" 
. SCH~CI 
! ! • ~ s.,4~.~ ! ! . i 
i 
a~o ~:T', , "~ 
• ,,-IH i ii iii i i i i i.oll.l~l!~l i i iiiiii ! i i i 
ii i i ~ i i i iiii i i i i ~,.t- ' 
I ~ i i i ..'~...,,~.1~ 
'ii'  ! i i i! ii , "i i 
I.., ~i i i 1o< 
r.,i Ill ~ i Is l 
Figure h Global efficiency (average F-values for 
Align, Sent, Word and Char measures) of the 
different systems (Jacal, Salign, LORIA, IRMC, 
ISSCO, LIA), by text type (logarithmic scale). 
First, note than the Char measures are higher 
that the Align measures. This seems to con- 
firm that systems tend to fail when dealing 
with shorter sentences. In addition, the refer- 
ence alignment for the BAF corpus combines 
several 1-1 alignments in a single n-n align- 
ment, for practical reasons owing to the sen- 
tence segmentation process. This results in de- 
creased Align measures. 
The corpus on which all systems scored high- 
est was the JOC. This corpus is relatively sim- 
ple to align, since it contains 94% of 1-1 align- 
ments, reflecting a translation strategy based 
on speed and absolute fidelity. In addition, this 
corpus contains a large amount of data that 
remains unchanged during the translation pro- 
cess (proper names, dates, etc.) and which can 
serve as anchor points by some systems. Note 
that the LORIA system achieves a slightly bet- 
ter performance than the others on this cor- 
pus, mainly because it is able to carry out a 
structure-alignment since paragraphs and divi- 
sions are explicitly marked. 
The worst results were achieved on the 
VERNE corpus. This is also the corpus for 
which the results showed the most scattering 
across systems (22% to 90% char-precision). 
These poor results are linked to the literary 
nature of the corpus, where translation is freer 
and more interpretative. In addition, since the 
English version is slightly abridged, the occa- 
sional omissions result in de-synchronization 
in most systems. Nevertheless, the LIA sys- 
tem still achieves a satisfactory performance 
(90% char-recall and 94% char-precision), 
which can be explained by the efficiency of its 
word-based pre-alignment step, as well as the 
scoring function used to rank the candidate 
bisegments. 
Significant discrepancy are also noted be- 
tween the Align and Char recalls on the TECH 
corpus. This document contained a large 
glossary as an appendix, and since the terms 
are sorted in alphabetic order, they are ordered 
differently in each language. This portion of 
text was not manually aligned in the reference. 
The size of this bisegment (250-250) drastically 
lowers the Char-recall. Aligning two glossaries 
can be seen as a document-structure alignment 
task rather than a sentence-alignment task. 
Since the goal of the evaluation was sentence 
alignment, the TECH corpus results were not 
taken into account in the final grading of the 
systems. 
The overall ranking for all systems (excluding 
the TECH corpus results) is given in Figure 2, 
in terms of the Sent and Char F-measures. The 
LIA system obtains the best average results and 
shows good stability across texts, which is an 
715 
g0 LIA JACAL 
I Allsn ~ Char 
s~8 Sent ~ Wm-d 
SALIGN LORIA LSSCO \]\[R~lC 
Figure 2: Final r~nking on the systems (average F-vaiues). 
important criterion for many applications. 
7 Conclusion and future work 
The ARCADE evaluation exercise has allowed 
for significant methodological progress on paral- 
lel text alignment. The discussions among par- 
ticipants on the question of a testing proto- 
col resulted in the definition of several evalu- 
ation measures and an assessment of their rela- 
tive merits. The comparative study of the sys- 
tems performance also yielded a better under- 
standing of the various techniques involved. As 
a significant spin-off, the project has produced 
a large aligned bilingual corpus, composed of 
several types of texts, which can be used as a 
gold standard for future evaluation. Grounded 
on the experience gained in the first test cam- 
paign, the second (1998-1999) has been opened 
to more te~m.q and plans to tackle more difficult 
problems, such as word-level alignment. 1 
Acknowledgments 
This work has been partially funded by 
AUPELF-UREF. We are indebted to Lucie 
Langlois and EUiott Macklovitch for their 
fruitful comments on this paper. 

References 
J. Brousseau, C. Drouin, G. Foster, P. IsabeUe, 
R. Kuhn, Y. Normandin, and P. Platoon- 
don. 1995. French Speech Recognition in an 
Automatic Dictation System for Translators: 
the TransTalk Project. In Proceedings o-f Eu- 
rospeech 95, Madrid, Spain. 
1For more information check the Web site at 
http: \] \] www.lp l. univ-a~.fr \]pro jects \]arcade 
P. F. Brown, J. Cocke, S. A. Della Pietra, 
V. J. Della Pietra, F. Jelinek, J. D. Lafferty, 
R. L. Mercer, and P. S. Roosin. 1990. A Sta- 
tistical Approach to Machine Translation. In 
Computational Linguistics, volume 16, pages 
79-85, June. 
P.F. Brown, J.C. Lai, and R.L. Mercer. 1991. 
Aligning Sentences in Parallel Corpora. In 
~9th Annual Meeting o-f the Association for 
Computational Linguistics, pages 169-176, 
•Berkeley, CA,USA. 
Ido Dagan and Kenneth W. Church. 1994. Ter- 
might: Identifying and Translating Techni- 
cal Terminology. In Proceedings of ANLP-94, 
Stuttgart, Germany. 
• F. D~bili, E. Sammouda, and A. Zribi. 1994. De 
l'appariement des roots ~ la comparaison de 
phrases. In 9~me Congr~s de Reconnaissance 
des Formes et Intelligence Artificielle, Paris, 
Janvier. 
F. Debili. 1992. Aligning Sentences in Bilingual 
Texts French - English and French - Arabic. 
In COLING, pages 517-525, Nantes, 23-28 
Aout. 
George Foster, Pierre Isabelle, and Pierre Pla- 
mondon. 1997. Target-Text Mediated Inter- 
active Machine Translation. Machine Trans- 
lation, 21(1-2). 
W. A. Gale and Kenneth W. Church. 1991. 
A Program for Aligning Sentences in Bilin- 
gual Corpora. In 29th Annual Meeting of 
the Association -for Computational Linguis- 
tics, Berkeley, CA. 
N. Ide and J. V~ronis, 1995. The Text Encod- 
ing Initiative: background and context, chap- 
ter 342p. Kluwer Academic Publishers, Dor- 
drecht. 
N. Ide, G. Priest-Dorman, and J. V6ronis. 
1995. Corpus encoding standard. 
Report. Accessible on the World 
Wide Web: http://www.lpl, univ- 
aix.fr/projects/multext/CES/CES 1.html. 
Pierre IsabeUe and Michel Simard. 
1996. Propositions pour la 
representation et l'~valuation des 
alignements de textes parall~les. 
http ://www-ral i. iro. umontreal, ca/arc-a2/- 
PropEval. 
Pierre Isabelle, Marc Dymetman, George Fos- 
ter, Jean-Marc Jutras, Elliott Macklovitch, 
Franqois Perrault, Xiaobo Ren, and Michel 
Simard. 1993. Translation Analysis and 
Translation Automation. In Proceedings of 
TMI-93, Kyoto, Japan. 
M. Kay and M. PdSscheisen. 1993. Text- 
translation alignment. Computational Lin- 
guistics, 19(1):121-142. 
Judith Klavans and Evelyne Tzoukermama. 
1995. Combining Corpus and Machine- 
readable Dictionary Data for Building Bilin- 
gual Lexicons. Machine Translation, 10(3). 
Lueie Langlois. 1996. Bilingual Concordances: 
A New Tool for Bilingual Lexicographers. In 
Proceedings of AMTA-96, Montreal, Canada. 
Elliott Maekloviteh. 1995. TransCheek -- or 
the Automatic Validation of Human Trans- 
lations. In Proceedings of the MT Summit V, 
Luxembourg. 
I. Dan Melamed. 1996. Automatic Con- 
struetion of Clean Broa~l-eoverage Transla- 
tion Lexicons. In Proceedings of AMTA-96, 
Montreal, Canada. 
I. Dan Melamed. 1997. A portable algorithm 
for mapping bitext correspondence. In 35th 
Conference of the Association for Computa- 
tional Linguistics, Madrid, Spain. 
C.J. Van Rijsbergen. 1979. Information Re- 
trieval,2nd edition, London, Butterworths. 
M. Simard and P. Plamondon. 1996. Bilingual 
sentence alignment: Balancing robustness and 
aecura~zy. In Proceedings of the Second Con- 
ference of the Association for Machine Trans- 
lation in the Americas (AMTA), Montreal, 
Quebec. 
M. Simard, G.F. Foster, and P. IsabeUe. 1992. 
Using Cognates to Align Sentences in Bilin- 
gual Corpora. In Fourth International Con- 
ference on Theoretical and Methodological Is- 
sues in Machine Translation (TM1), pages 
67-81, Montr6al, Canada. 
M. Simard. 1998. The BAF: A corpus of 
English-French Bitext. In First International 
Conference on Language Resources and Eval- 
uation, Granada, Spain. 
