Using GATE as an Environment for Teaching NLP
Kalina Bontcheva, Hamish Cunningham, Valentin Tablan, Diana Maynard, Oana Hamza
Department of Computer Science
University of Sheffield
Sheffield, S1 4DP, UK
{kalina,hamish,valyt,diana,oana}@dcs.shef.ac.uk
Abstract
In this paper we argue that the GATE
architecture and visual development
environment can be used as an effec-
tive tool for teaching language engi-
neering and computational linguistics.
Since GATE comes with a customis-
able and extendable set of components,
it allows students to get hands-on ex-
perience with building NLP applica-
tions. GATE also has tools for cor-
pus annotation and performance eval-
uation, so students can go through the
entire application development process
within its graphical development en-
vironment. Finally, it offers com-
prehensive Unicode-compliant multi-
lingual support, thus allowing stu-
dents to create components for lan-
guages other than English. Unlike
other NLP teaching tools which were
designed specifically and only for this
purpose, GATE is a system developed
for and used actively in language en-
gineering research. This unique dual-
ity allows students to contribute to re-
search projects and gain skills in em-
bedding HLT in practical applications.
1 Introduction
When students learn programming, they have
the benefit of integrated development environ-
ments, which support them throughout the en-
tire application development process: from writ-
ing the code, through testing, to documenta-
tion. In addition, these environments offer sup-
port and automation of common tasks, e.g., user
interfaces can be designed easily by assembling
them visually from components like menus and
windows. Similarly, NLP and CL students can
benefit from the existence of a graphical devel-
opment environment, which allows them to get
hands-on experience in every aspect of develop-
ing and evaluating language processing modules.
In addition, such a tool would enable students to
see clearly the practical relevance and need for
language processing, by allowing them to exper-
iment easily with building NLP-powered (Web)
applications.
This paper shows how an existing infrastruc-
ture for language engineering research – GATE
(Cunningham et al., 2002a; Cunningham, 2002)
– has been used successfully as an NLP teach-
ing environment, in addition to being a suc-
cessful vehicle for building NLP applications
and reusable components (Maynard et al., 2002;
Maynard et al., 2001). The key features of
GATE which make it particularly suitable for
teaching are:
• The system is designed to separate cleanly
low-level tasks such as data storage, data
visualisation, location and loading of com-
ponents and execution of processes from the
data structures and algorithms that actu-
ally process human language. In this way,
the students can concentrate on studying
and/or modifying the NLP data and algo-
rithms, while leaving the mundane tasks to
GATE.
                     July 2002, pp. 54-62.  Association for Computational Linguistics.
              Natural Language Processing and Computational Linguistics, Philadelphia,
         Proceedings of the Workshop on Effective Tools and Methodologies for Teaching
• Automating measurement of performance
of language processing components and fa-
cilities for the creation of the annotated cor-
pora needed for that.
• Providing a baseline set of language pro-
cessing components that can be extended
and/or replaced by students as required.
These modules typically separate clearly
the linguistic data from the algorithms that
use it, thus allowing teachers to present
them separately and the students to adapt
the modules to new domains/languages by
just modifying the linguistic data.
• It comes with exhaustive documenta-
tion, tutorials, and online movie demon-
strations, available on its Web site
(http://gate.ac.uk).
GATE and its language processing modules
were developed to promote robustness and scala-
bility of NLP approaches and applications, with
an emphasis on language engineering research.
Therefore, NLP/LE courses based on GATE
offer students the opportunity to learn from
non-toy applications, running on big, realistic
datasets (e.g., British National corpus or news
collected by a Web crawler). This unique re-
search/teaching duality also allows students to
contribute to research projects and gain skills in
embedding HLT in practical applications.
2 GATE from a Teaching
Perspective
GATE (Cunningham et al., 2002a) is an archi-
tecture, a framework and a development envi-
ronment for human language technology mod-
ules and applications. It comes with a set of
reusable modules, which are able to perform ba-
sic language processing tasks such as POS tag-
ging and semantic tagging. These eliminate the
need for students to re-implement useful algo-
rithms and modules, which are pre-requisites
for completing their assignments. For exam-
ple, Marin Dimitrov from Sofia University suc-
cessfully completed his masters’ degree by im-
plementing a lightweight approach to pronom-
inal coreference resolution for named entities1,
which uses GATE’s reusable modules for the
earlier processing and builds upon their results
(see Section 4).
For courses where the emphasis is more on
linguistic annotation and corpus work, GATE
can be used as a corpus annotation environment
(see http://gate.ac.uk/talks/tutorial3/). The
annotation can be done completely manually
or it can be bootstrapped by running some
of GATE’s processing resources over the cor-
pus and then correcting/adding new annota-
tions manually. These facilities can also be used
in courses and assignments where the students
need to learn how to create data for quantitative
evaluation of NLP systems.
If evaluated against the requirements for
teaching environments discussed in (Loper and
Bird, 2002), GATE covers them all quite well.
The graphical development environment and
the JAPE language facilitate otherwise difficult
tasks. Inter-module consistency is achieved by
using the annotations model to hold language
data, while extensibility and modularity are the
very reason why GATE has been successfully
used in many research projects (Maynard et al.,
2000). In addition, GATE also offers robustness
and scalability, which allow students to experi-
ment with big corpora, such as the British Na-
tional Corpus (approx. 4GB). In the following
subsections we will provide further detail about
these aspects of GATE.
2.1 GATE’s Graphical Development
Environment
GATE comes with a graphical development en-
vironment (or GATE GUI) that facilitates stu-
dents in inspecting the language processing re-
sults and debugging the modules. The envi-
ronment has facilities to view documents, cor-
pora, ontologies (including the popular Prot´eg´e
editor (Noy et al., 2001)), and linguistic data
(expressed as annotations, see below), e.g., Fig-
ure 1 shows the document viewer with some
annotations highlighted. It also shows the re-
source panel on the left with all loaded appli-
1The thesis is available at
http://www.ontotext.com/ie/thesis-m.pdf
Figure 1: GATE’s visual development environment
cations, language resources, and processing re-
sources (i.e., modules). There are also view-
ers/editors for complex linguistic data like coref-
erence chains (Figure 2) and syntax trees (Fig-
ure 3). New graphical components can be in-
tegrated easily, thus allowing lecturers to cus-
tomise the environment as necessary. The
GATE team is also developing new visualisation
modules, especially a visual JAPE rule develop-
ment tool.
2.2 GATE API and Data Model
The central concept that needs to be learned by
the students before they start using GATE is
the annotation data model, which encodes all
linguistic data and is used as input and out-
put for all modules. GATE uses a single uni-
fied model of annotation - a modified form of
the TIPSTER format (Grishman, 1997) which
has been made largely compatible with the Atlas
format (Bird and Liberman, 1999). Annotations
are characterised by a type and a set of features
represented as attribute-value pairs. The anno-
tations are stored in structures called annotation
sets which constitute independent layers of an-
notation over the text content. The annotations
format is independent of any particular linguis-
tic formalism, in order to enable the use of mod-
ules based on different linguistic theories. This
generality enables the representation of a wide-
variety of linguistic information, ranging from
very simple (e.g., tokeniser results) to very com-
Figure 2: The coreference chains viewer
plex (e.g., parse trees and discourse representa-
tion: examples in (Saggion et al., 2002)). In
addition, the annotation format allows the rep-
resentation of incomplete linguistic structures,
e.g., partial-parsing results. GATE’s tree view-
ing component has been written especially to
be able to display such disconnected and incom-
plete trees.
GATE is implemented in Java, which makes
it easier for students to use it, because typi-
cally they are already familiar with this lan-
guage from their programming courses. The
GATE API (Application Programming Inter-
face) is fully documented in Javadoc and also
examples are given in the comprehensive User
Guide (Cunningham et al., 2002b). However,
students generally do not need to familiarise
themselves with Java and the API at all, be-
cause the majority of the modules are based on
GATE’s JAPE language, so customisation of ex-
isting and development of new modules only re-
quires knowledge of JAPE and the annotation
model described above.
JAPE is a version of CPSL (Common Pattern
Specification Language) (Appelt, 1996) and is
used to describe patterns to match and annota-
tions to be created as a result (for further de-
tails see (Cunningham et al., 2002b)). Once fa-
miliar with GATE’s data model, students would
not find it difficult to write the JAPE pattern-
based rules, because they are effectively regular
expressions, which is a concept familiar to most
Figure 3: The syntax tree viewer, showing a par-
tial syntax tree for a sentence from a telecom
news text
CS students.
An example rule from an existing named en-
tity recognition grammar is:
Rule: Company1
Priority: 25
(
({Token.orthography == upperInitial})+
{Lookup.kind == companyDesignator}
):companyMatch
-->
:companyMatch.NamedEntity =
{kind = "company", rule = "Company1"}
The rule matches a pattern consisting of any
kind of word, which starts with an upper-cased
letter (recognised by the tokeniser), followed by
one of the entries in the gazetteer list for com-
pany designators (words which typically indi-
cate companies, such as ‘Ltd.’ and ‘GmBH’). It
then annotates this pattern with the entity type
“NamedEntity”, and gives it a feature “kind”
with value company and another feature “rule”
with value “Company1”. The rule feature is
simply used for debugging purposes, so it is clear
which particular rule has fired to create the an-
notation.
The grammars (which are sets of rules) do not
need to be compiled by the students, because
they are automatically analysed and executed by
the JAPE Transducer module, which is a finite-
Figure 4: The visual evaluation tool
state transducer over the annotations in the doc-
ument. Since the grammars are stored in files in
a plain text format, they can be edited in any
text editor such as Notepad or Vi. The rule de-
velopment process is performed by the students
using GATE’s visual environment (see Figure 1)
to execute the grammars and visualise the re-
sults. The process is actually a cycle, where the
students write one or more rules, re-initialise the
transducer in the GATE GUI by right-clicking
on it, then run it on the test data, check the re-
sults, and go back to improving the rules. The
evaluation part of this cycle is performed using
GATE’s visual evaluation tools which also pro-
duce precision, recall, and f-measure automati-
cally (see Figure 4).
The advantage of using JAPE for the student
assignments is that once learned by the students,
it enables them to experiment with a variety
of NLP tasks from tokenisation and sentence
splitter, to chunking, to template-based infor-
mation extraction. Because it does not need to
be compiled and supports incremental develop-
ment, JAPE is ideal for rapid prototyping, so
students can experiment with alternative ideas.
Students who are doing bigger projects, e.g., a
final year project, might want to develop GATE
modules which are not based on the finite-state
machinery and JAPE. Or the assignment might
require the development of more complex gram-
mars in JAPE, in which case they might have
to use Java code on the right-hand side of the
rule. Since such GATE modules typically only
access and manipulate annotations, even then
the students would need to learn only that part
of GATE’s API (i.e., no more than 5 classes).
Our experience with two MSc students – Partha
Lal and Marin Dimitrov – has shown that they
do not have significant problems with using that
either.
2.3 Some useful modules
The tokeniser splits text into simple tokens,
such as numbers, punctuation, symbols, and
words of different types (e.g. with an initial capi-
tal, all upper case, etc.). The tokeniser does not
generally need to be modified for different ap-
plications or text types. It currently recognises
many types of words, whitespace patterns, num-
bers, symbols and punctuation and should han-
dle any language from the Indo-European group
without modifications. Since it is available as
open source, one student assignment could be
to modify its rules to cope with other languages
or specific problems in a given language. The to-
keniser is based on finite-state technology, so the
rules are independent from the algorithm that
executes them.
The sentence splitter is a cascade of finite-
state transducers which segments the text into
sentences. This module is required for the tag-
ger. Both the splitter and tagger are domain-
and application-independent. Again, the split-
ter grammars can be modified as part of a stu-
dent project, e.g., to deal with specifically for-
matted texts.
The tagger is a modified version of the Brill
tagger, which assigns a part-of-speech tag to
each word or symbol. To modify the tagger’s
behaviour, students will have to re-train it on
relevant annotated texts.
The gazetteer consists of lists such as cities,
organisations, days of the week, etc. It not only
consists of entities, but also of names of useful
indicators, such as typical company designators
(e.g. ‘Ltd.’), titles, etc. The gazetteer lists are
compiled into finite state machines, which anno-
tate the occurrence of the list items in the given
document. Students can easily extend the exist-
ing lists and add new ones by double-clicking on
the Gazetteer processing resource, which brings
up the gazetteer editor if it has been installed,
or using GATE’s Unicode editor.
The JAPE transducer is the module that
runs JAPE grammars, which could be doing
tasks like chunking, named entity recognition,
etc. By default, GATE is supplied with an NE
transducer which performs named entity recog-
nition for English and a VP Chunker which
shows how chunking can be done using JAPE.
An even simpler (in terms of grammar rules
complexity) and somewhat incomplete NP chun-
ker can be obtained by request from the first
author.
The orthomatcher is a module, whose pri-
mary objective is to perform co-reference, or en-
tity tracking, by recognising relations between
entities, based on orthographically matching
their names. It also has a secondary role in im-
proving named entity recognition by assigning
annotations to previously unclassified names,
based on relations with existing entities.
2.4 Support for languages other than
English
GATE uses Unicode (Unicode Consortium,
1996) throughout, and has been tested on a va-
riety of Slavic, Germanic, Romance, and Indic
languages. The ability to handle Unicode data,
along with the separation between data and al-
gorithms, allows students to perform easily even
small-scale experiments with porting NLP com-
ponents to new languages. The graphical devel-
opment environment supports fully the creation,
editing, and visualisation of linguistic data, doc-
uments, and corpora in Unicode-supported lan-
guages (see (Tablan et al., 2002)). In order to
make it easier for foreign students to use the
GUI, we are planning to localise its menus, er-
ror messages, and buttons which currently are
only in English.
2.5 Installation and Programming
Languages Support
Since GATE is 100% Java, it can run on any
platform that has a Java support. To make it
easier to install and maintain, GATE comes with
installation wizards for all major platforms. It
also allows the creation and use of a site-wide
GATE configuration file, so settings need only
be specified once and all copies run by the stu-
dents will have the same configuration and mod-
ules available. In addition, GATE allows stu-
dents to have their own configuration settings,
e.g., specify modules which are available only
to them. The personal settings override those
from GATE’s default and site-wide configura-
tions. Students can also easily install GATE
on their home computers using the installation
program. GATE also allows applications to be
saved and moved between computers and plat-
forms, so students can easily work both at home
and in the lab and transfer their data and ap-
plications between the two.
GATE’s graphical environment comes config-
ured by default to save its own state on exit,
so students will automatically get their applica-
tions, modules, and data restored automatically
the next time they load GATE.
Although GATE is Java-based, modules writ-
ten in other languages can also be integrated
and used. For example, Prolog modules are eas-
ily executable using the Jasper Java-Prolog link-
ing library. Other programming languages can
be used if they support Java Native Interface
(JNI).
3 Existing Uses of GATE for
Teaching
Postgraduates in locations as diverse as Bul-
garia, Copenhagen and Surrey are using the
system in order to avoid having to write sim-
ple things like sentence splitters from scratch,
and to enable visualisation and management
of data. For example, Partha Lal at Impe-
rial College is developing a summarisation sys-
tem based on GATE and ANNIE as a final-
year project for an MEng Degree in Comput-
ing (http://www.doc.ic.ac.uk/˜ pl98/). His site
includes the URL of his components and once
given this URL, GATE loads his software over
the network. Another student project will be
discussed in more detail in Section 4.
Our colleagues in the Universities of Ed-
inburgh, UMIST in Manchester, and Sussex
(amongst others) have reported using previous
versions of the system for teaching, and the Uni-
versity of Stuttgart produced a tutorial in Ger-
man for the same purposes. Educational users of
early versions of GATE 2 include Exeter Univer-
sity, Imperial College, Stuttgart University, the
University of Edinburgh and others. In order to
facilitate the use of GATE as a teaching tool,
we have provided a number of tutorials, online
demonstrations, and exhaustive documentation
on GATE’s Web site (http://gate.ac.uk).
4 An Example MSc Project
The goal of this work was to develop a corefer-
ence resolution module to be integrated within
the named entity recognition system provided
with GATE. This required a number of tasks to
be performed by the student: (i) corpus anal-
ysis; (ii) implementation and integration; (iii)
testing and quantitative evaluation.
The student developed a lightweight approach
to resolving pronominal coreference for named
entities, which was implemented as a GATE
module and run after the existing NE modules
provided with the framework. This enabled him
also to use an existing annotated corpus from
an Information Extraction evaluation competi-
tion and the GATE evaluation tools to establish
how his module compared with results reported
in the literature. Finally, the testing process was
made simple, thanks to GATE’s visualisation fa-
cilities, which are already capable of displaying
coreference chains in documents.
GATE not only allowed the student to achieve
verifiable results quickly, but it also did not in-
cur substantial integration overheads, because
it comes with a bootstrap tool which automates
the creation of GATE-compliant NLP modules.
The steps that need to be followed are:2
• use the bootstrap tool to create an empty
Java module, then add the implementation
to it. A JAVA development environment
like JBuilder and VisualCafe can be used
for this and the next stages, if the students
are familiar with them;
• compile the class, and any others that it
uses, into a Java Archive (JAR) file (GATE
2For further details and an example see (Cunningham
et al., 2002b).
Figure 5: BootStrap Wizard Dialogue
generates automatically a Makefile too, to
facilitate this process);
• write some XML configuration data for the
new resource;
• tell GATE the URL of the new JAR and
XML files.
5 Example Topics
Since GATE has been used for a wide range of
tasks, it can be used for the teaching of a number
of topics. Topics that can be covered in (part of)
a course, based on GATE are:
• Language Processing, Language Engineer-
ing, and Computational Linguistics: differ-
ences, methodologies, problems.
• Architectures, portability, robustness, cor-
pora, and the Web.
• Corpora, annotation, and evaluation: tools
and methodologies.
• Basic modules: tokenisation, sentence split-
ting, gazetteer lookup.
• Part-of-speech tagging.
• Information Extraction: issues, tasks, rep-
resenting linguistic data in the TIPSTER
annotation format, MUC, results achieved.
– Named Entity Recognition.
– Coreference Resolution
– Template Elements and Relations
– Scenario Templates
• Parsing and chunking
• Document summarisation
• Ontologies and discourse interpretation
• Language generation
While language generation, parsing, summari-
sation, and discourse interpretation modules are
not currently distributed with GATE, they can
be obtained by contacting the authors. Modules
for text classification and learning algorithms in
general are to be developed in the near future.
A lecturer willing to contribute any such mod-
ules to GATE will be very welcome to do so and
will be offered integration support.
6 Example Assignments
The availability of example modules for a vari-
ety of NLP tasks allows students to use them
as a basis for the development of an entire NLP
application, consisting of separate modules built
during their course. For example, let us consider
two problems: recognising chemical formulae in
texts and making an IE system that extracts
information from dialogues. Both tasks require
students to make changes in a number of existing
components and also write some new grammars.
Some example assignments for the chemical
formulae recognition follow:
• tokeniser: while it will probably work well
for the dialogues, the first assignment would
be to make modifications to its regular ex-
pression grammar to tokenise formulae like
H4ClO2 and Al-Li-Ti in a more suitable
way.
• gazetteer: create new lists containing new
useful clues and types of data, e.g., all
chemical elements and their abbreviations.
• named entity recognition: write a new
grammar to be executed by a new JAPE
transducer module for the recognition of the
chemical formulae.
Some assignments for the dialogue application
are:
• sentence splitter: modify it so that it splits
correctly dialogue texts, by taking into ac-
count the speaker information (because dia-
logues often do not have punctuation). For
example:
A: Thank you, can I have your full name?
C: Err John Smith
A: Can you also confirm your postcode and
telephone number for security?
C: Erm it’s 111 111 11 11
A: Postcode?
C: AB11 1CD
• corpus annotation and evaluation: use the
default named entity recogniser to boot-
strap the manual annotation of the test
data for the dialogue application; evaluate
the performance of the default NE gram-
mars on the dialogue texts; suggest possi-
ble improvements on the basis of the infor-
mation about missed and incorrect anno-
tations provided by the corpus benchmark
tool.
• named entity recognition: implement the
improvements proposed at the previous
step, by changing the default NE grammar
rules and/or by introducing rules specific to
your dialogue domain.
Finally, some assignments which are not con-
nected to any particular domain or application:
• chunking: implement an NP chunker using
JAPE. Look at the VP chunker grammars
for examples.
• template-based IE: experiment with ex-
tracting information from the dialogues us-
ing templates and JAPE (an example im-
plementation will be provided soon).
• (for a group of students) building NLP-
enabled Web applications: embed one of the
IE applications developed so far into a Web
application, which takes a Web page and
returns it annotated with the entities. Use
http://gate.ac.uk/annie/index.jsp as an ex-
ample.
In the near future it will be also possible to
have assignments on summarisation and genera-
tion, but these modules are still under develop-
ment. It will be possible to demonstrate parsing
and discourse interpretation, but because these
modules are implemented in Prolog and some-
what difficult to modify, assignments based on
them are not recommended. However, other
such modules, e.g., those from NLTK (Loper
and Bird, 2002), can be used for such assign-
ments.
7 Conclusions
In this paper we have outlined the GATE sys-
tem and its key features that make it an effective
tool for teaching NLP and CL. The main advan-
tage is that GATE is a framework and a graph-
ical development environment which is suitable
both for research and teaching, thus making it
easier to connect the two, e.g., allow a student to
carry out a final-year project which contributes
to novel research, carried out by their lectur-
ers. The development environment comes with
a comprehensive set of tools, which cover the
entire application development cycle. It can be
used to provide students with hands-on experi-
ence in a wide variety of tasks. Universities will-
ing to use GATE as a teaching tool will benefit
from the comprehensive documentation, several
tutorials, and online demonstrations.

References
D.E. Appelt. 1996. The Common Pattern Specifi-
cation Language. Technical report, SRI Interna-
tional, Artificial Intelligence Center.
S. Bird and M. Liberman. 1999. A Formal Frame-
work for Linguistic Annotation. Technical Re-
port MS-CIS-99-01, Department of Computer and
Information Science, University of Pennsylvania.
http://xxx.lanl.gov/abs/cs.CL/9903003.
H. Cunningham, D. Maynard, K. Bontcheva, and
V. Tablan. 2002a. GATE: A framework and
graphical development environment for robust
NLP tools and applications. In Proceedings of the
40th Anniversary Meeting of the Association for
Computational Linguistics.
H. Cunningham, D. Maynard, K. Bontcheva,
V. Tablan, and C. Ursu. 2002b. The GATE User
Guide. http://gate.ac.uk/.
H. Cunningham. 2002. GATE, a General Archi-
tecture for Text Engineering. Computers and the
Humanities, 36:223–254.
R. Grishman. 1997. TIPSTER Architec-
ture Design Document Version 2.3. Techni-
cal report, DARPA. http://www.itl.nist.gov/-
div894/894.02/related projects/tipster/.
E. Loper and S. Bird. 2002. NLTK: The Natural
Language Toolkit. In ACL Workshop on Effective
Tools and Methodologies in Teaching NLP.
D. Maynard, H. Cunningham, K. Bontcheva,
R. Catizone, George Demetriou, Robert
Gaizauskas, Oana Hamza, Mark Hepple, Patrick
Herring, Brian Mitchell, Michael Oakes, Wim
Peters, Andrea Setzer, Mark Stevenson, Valentin
Tablan, Christian Ursu, and Yorick Wilks. 2000.
A Survey of Uses of GATE. Technical Report
CS–00–06, Department of Computer Science,
University of Sheffield.
D. Maynard, V. Tablan, C. Ursu, H. Cunningham,
and Y. Wilks. 2001. Named Entity Recognition
from Diverse Text Types. In Recent Advances
in Natural Language Processing 2001 Conference,
Tzigov Chark, Bulgaria.
D. Maynard, V. Tablan, H. Cunningham, C. Ursu,
H. Saggion, K. Bontcheva, and Y. Wilks. 2002.
Architectural elements of language engineering ro-
bustness. Journal of Natural Language Engineer-
ing – Special Issue on Robust Methods in Analysis
of Natural Language Data. forthcoming.
N.F. Noy, M. Sintek, S. Decker, M. Crubzy, R.W.
Fergerson, and M.A. Musen. 2001. Creating Se-
mantic Web Contents with Prot´eg´e-2000. IEEE
Intelligent Systems, 16(2):60–71.
H. Saggion, H. Cunningham, K. Bontcheva, D. May-
nard, C. Ursu, O. Hamza, and Y. Wilks. 2002.
Access to Multimedia Information through Mul-
tisource and Multilanguage Information Extrac-
tion. In 7th Workshop on Applications of Natural
Language to Information Systems (NLDB 2002),
Stockholm, Sweden.
V. Tablan, C. Ursu, K. Bontcheva, H. Cunningham,
D. Maynard, O. Hamza, Tony McEnery, Paul
Baker, and Mark Leisher. 2002. A unicode-based
environment for creation and use of language re-
sources. In Proceedings of 3rd Language Resources
and Evaluation Conference. forthcoming.
Unicode Consortium. 1996. The Unicode Standard,
Version 2.0. Addison-Wesley, Reading, MA.
