A flexible distributed architecture for NLP system 
development and use 
Freddy Y. Y. Choi 
Artificial Intelligence Group 
University of Manchester 
Manchester, U.K. 
choif@cs.man.ac.uk 
Abstract 
We describe a distributed, modular architecture 
for platform independent natural language sys- 
tems. It features automatic interface genera- 
tion and self-organization. Adaptive (and non- 
adaptive) voting mechanisms are used for inte- 
grating discrete modules. The architecture is 
suitable for rapid prototyping and product de- 
livery. 
1 Introduction 
This article describes TEA 1, a flexible architec- 
ture for developing and delivering platform in- 
dependent text engineering (TE) systems. TEA 
provides a generalized framework for organizing 
and applying reusable TE components (e.g. to- 
kenizer, stemmer). Thus, developers are able 
to focus on problem solving rather than imple- 
mentation. For product delivery, the end user 
receives an exact copy of the developer's edition. 
The visibility of configurable options (different 
levels of detail) is adjustable along a simple gra- 
dient via the automatically generated user inter- 
face (Edwards, Forthcoming). 
Our target application is telegraphic text 
compression (Choi (1999b); of Roelofs (Forth- 
coming); Grefenstette (1998)). We aim to im- 
prove the efficiency of screen readers for the 
visually disabled by removing uninformative 
words (e.g. determiners) in text documents. 
This produces a stream of topic cues for rapid 
skimming. The information value of each word 
is to be estimated based on an unusually wide 
range of linguistic information. 
TEA was designed to be a development en- 
vironment for this work. However, the target 
application has led us to produce an interesting 
tTEA is an acronym for Text Engineering Architec- 
ture. 
architecture and techniques that are more gen- 
erally applicable, and it is these which we will 
focus on in this paper. 
2 Architecture 
I System input 
and output 
I I L I I Plug*ins Shared knowledge System control 
s~ructure 
Figure 1: An overview of the TEA system 
framework. 
The central component of TEA is a frame- 
based data model (F) (see Fig.2). In this model, 
a document is a list of frames (Rich and Knight, 
1991) for recording the properties about each 
token in the text (example in Fig.2). A typical 
TE system converts a document into F with an 
input plug-in. The information required at the 
output determines the set of process plug-ins to 
activate. These use the information in F to add 
annotations to F. Their dependencies are auto- 
matically resolved by TEA. System behavior is 
controlled by adjusting the configurable param- 
eters. 
Frame 1: (:token An :pos art :begin_s 1) 
Frame 2: (:token example :pos n) 
Frame 3: (:token sentence :pos n) 
Frame 4: (:token . :pos punc :end_s 1) 
Figure 2: "An example sentence." in a frame- 
based data model 
615 
This type of architecture has been imple- 
mented, classically, as a 'blackboard' system 
such as Hearsay-II (Erman, 1980), where inter- 
module communication takes place through a 
shared knowledge structure; or as a 'message- 
passing' system where the modules communi- 
cate directly. Our architecture is similar to 
blackboard systems. However, the purpose of 
F (the shared knowledge structure in TEA) is 
to provide a single extendable data structure for 
annotating text. It also defines a standard in- 
terface for inter-module communication, thus, 
improves system integration and ease of soft- 
ware reuse. 
2.1 Voting mechanism 
A feature that distinguishes TEA from similar 
systems is its use of voting mechanisms for sys- 
tem integration. Our approach has two distinct 
but uniformly treated applications. First, for 
any type of language analysis, different tech- 
niques ti will return successful results P(r) on 
different subsets of the problem space. Thus 
combining the outputs P(rlti) from several ti 
should give a result more accurate than any one 
in isolation. This has been demonstrated in sev- 
eral systems (e.g. Choi (1999a); van Halteren 
et al. (1998); Brill and Wu (1998); Veronis and 
Ide (1991)). Our architecture currently offers 
two types of voting mechanisms: weighted av- 
erage (Eq.1) and weighted maximum (Eq.2). A 
Bayesian classifier (Weiss and Kulikowski, 1991) 
based weight estimation algorithm (Eq.3) is in- 
cluded for constructing adaptive voting mecha- 
nisms. 
P(r) = w P(rlti) 
i=1 
(1) 
P(r) = max{WlP(rltx),...,w,,P(rlt,)} (2) 
= P(rlt,)) (3) 
Second, different types of analysis a/ will pro- 
vide different information about a problem, 
hence, a solution is improved by combining sev- 
eral ai. For telegraphic text compression, we es- 
timate E(w), the information value of a word, 
based on a wide range of different information 
sources (Fig.2.1 shows a subset of our working 
system). The output of each ai are combined by 
a voting mechanism to form a single measure. 
Vo~ng mechanism 0 
Pmcoss 0 
I " ....... "I 
l I I ! Technique Ane~ysis 
com~na~on ¢om~n~on 
Figure 3: An example configuration of TEA for 
telegraphic text compression. 
Thus, for example, if our system encoun- 
ters the phrase 'President Clinton', both lexical 
lookup and automatic tagging will agree that 
'President' is a noun. Nouns are generally infor- 
mative, so should be retained in the compressed 
output text. However, grammar-based syntac- 
tic analysis gives a lower weighting to the first 
noun of a noun-noun construction, and bigram 
analysis tells us that 'President Clinton' is a 
common word pair. These two modules overrule 
the simple POS value, and 'President Clinton' 
is reduced to 'Clinton'. 
3 Related work 
Current trends in the development of reusable 
TE tools are best represented by the Edinburgh 
tools (LTGT) 2 (LTG, 1999) and GATE 3 (Cun- 
ningham et al., 1995). Like TEA, both LTGT 
and GATE are frameworks for TE. 
LTGT adopts the pipeline architecture for 
module integration. For processing, a text doc- 
ument is converted into SGML format. Pro- 
cessing modules are then applied to the SGML 
file sequentially. Annotations are accumulated 
as mark-up tags in the text. The architecture is 
simple to understand, robust and future proof. 
The SGML/XML standard is well developed 
and supported by the community. This im- 
proves the reusability of the tools. However, 
2LTGT is an acronym for the Edinburgh Language 
Technology Group Tools 
aGATE is an acronym for General Architecture for 
Text Engineering. 
616 
tile architecture encourages tool development 
rather than reuse of existing TE components. 
GATE is based on an object-oriented data 
model (similar to the TIPSTER architecture 
(Grishman, 1997)). Modules communicate by 
reading and writing information to and from a 
central database. Unlike LTGT, both GATE 
and TEA are designed to encourage software 
reuse. Existing TE tools are easily incorporated 
with Tcl wrapper scripts and Java interfaces, re- 
spectively. 
Features that distinguish LTCT, GATE and 
TEA are the configuration methods, portabil- 
ity and motivation. Users of LTGT write shell 
scripts to define a system (as a chain of LTGT 
components). With GATE, a system is con- 
structed manually by wiring TE components to- 
gether using the graphical interface. TEA as- 
sumes the user knows nothing but the available 
input and required output. The appropriate set 
of plug-ins are automatically activated. Module 
selection can be manually configured by adjust- 
ing the parameters of the voting mechanisms. 
This ensures a TE system is accessible to com- 
plete novices ~,,-I yet has sufficient control for 
developers. 
LTGT and GATE are both open-source C ap- 
plications. They can be recompiled for many 
platforms. TEA is a Java application. It can 
run directly (without compilation) on any Java 
supported systems. However, applications con- 
structed with the current release of GATE and 
TEA are less portable than those produced with 
LTGT. GATE and TEA encourage reuse of ex- 
isting components, not all of which are platform 
independent 4. We believe this is a worth while 
trade off since it allows developers to construct 
prototypes with components that are only avail- 
able as separate applications. Native tools can 
be developed incrementally. 
4 An example 
Our application is telegraphic text compression. 
The examples were generated with a subset of 
our working system using a section of the book 
HAL's legacy (Stork, 1997) as test data. First, 
we use different compression techniques to gen- 
erate the examples in Fig.4. This was done by 
simply adjusting a parameter of an output plug- 
4This is not a problem for LTGT since the architec- 
ture does not encourage component reuse. 
in. It is clear that the output is inadequate for 
rapid text skimming. To improve the system, 
the three measures were combine with an un- 
weighted voting mechanism. Fig.4 presents two 
levels of compression using the new measure. 
1. With science fiction films the more science 
you understand the less you admire the film or 
respect its makers 
2. fiction films understand less admire respect 
makers 
3. fiction understand less admire respect makers 
4. science fiction films science film makers 
Figure 4: Three measures of information value: 
(1) Original sentence, (2) Token frequency, (3) 
Stem frequency and (4) POS. 
1. science fiction films understand less admire 
film respect makers 
2. fiction makers 
Figure 5: Improving telegraphic text compres- 
sion by analysis combination. 
5 Conclusions and future directions 
We have described an interesting architecture 
(TEA) for developing platform independent 
text engineering applications. Product delivery, 
configuration and development are made sim- 
ple by the self-organizing architecture and vari- 
able interface. The use of voting mechanisms 
for integrating discrete modules is original. Its 
motivation is well supported. 
The current implementation of TEA is geared 
towards token analysis. We plan to extend 
the data model to cater for structural annota- 
tions. The tool set for TEA is constantly be- 
ing extended, recent additions include a proto- 
type symbolic classifier, shallow parser (Choi, 
Forthcoming), sentence segmentation algorithm 
(Reynar and Ratnaparkhi, 1997) and a POS 
tagger (Ratnaparkhi, 1996). Other adaptive 
voting mechanisms are to be investigated. Fu- 
ture release of TEA will support concurrent ex- 
ecution (distributed processing) over a network. 
Finally, we plan to investigate means of im- 
proving system integration and module orga- 
nization, e.g. annotation, module and tag set 
compatibility. 
617 

References 
E. Brill and J. Wu. 1998. Classifier combina- 
tion for improved lexical disambiguation. In 
Proceedings of COLING-A CL '98, pages 191- 
195, Montreal, Canada, August. 
F. Choi. 1999a. An adaptive voting mechanism 
for improving the reliability of natural lan- 
guage processing systems. Paper submitted 
to EACL'99, January. 
F. Choi. 1999b. Speed reading for the 
visually disabled. Paper submitted to 
SIGART/AAAI'99 Doctoral Consortium, 
February. 
F. Choi. Forthcoming. A probabilistic ap- 
proach to learning shallow linguistic patterns. 
In ProCeedings of ECAI'99 (Student Session), 
Greece. 
H. Cunningham, R.G. Gaizauskas, and 
Y. Wilks. 1995. A general architecture for 
text engineering (gate) - a new approach 
to language engineering research and de- 
velopment. Technical Report CD-95-21, 
Department of Computer Science, University 
of Sheffield. http://xxx.lanl.gov/ps/cmp- 
lg/9601009. 
M. Edwards. Forthcoming. An approach to 
automatic interface generation. Final year 
project report, Department of Computer Sci- 
ence, University of Manchester, Manchester, 
England. 
L. Erman. 1980. The hearsay-ii speech under- 
standing system: Integrating knowledge to 
resolve uncertainty. In A CM Computer Sur- 
veys, volume 12. 
G. Grefenstette. 1998. Producing intelligent 
telegraphic text reduction to provide an audio 
scanning service for the blind. In AAAI'98 
Workshop on Intelligent Text Summariza- 
tion, San Francisco, March. 
R. Grishman. 1997. Tipster architecture de- 
sign document version 2.3. Technical report, 
DARPA. http://www.tipster.org. 
LTG. 1999. Edinburgh univer- 
sity, hcrc, ltg software. WWW. 
http://www.ltg.ed.ac.uk/software/index.html. 
H. Rollfs of Roelofs. Forthcoming. Telegraph- 
ese: Converting text into telegram style. 
Master's thesis, Department of Computer Sci- 
ence, University of Manchester, Manchester, 
England. 
G. M. P. O'Hare and N. R. Jennings, edi- 
tots. 1996. Foundations of Distributed Ar- 
tificial Intelligence. Sixth generation com- 
puter series. Wiley Interscience Publishers, 
New York. ISBN 0-471-00675. 
A. Ratnaparkhi. 1996. A maximum entropy 
model for part-of-speech tagging. In Proceed- 
ings of the empirical methods in NLP confer- 
ence, University of Pennsylvania. 
J. Reynar and A. Ratnaparkhi. 1997. A max- 
imum entropy approach to identifying sen- 
tence boundaries. In Proceedings of the fifth 
conference on Applied NLP, Washington D.C. 
E. Rich and K. Knight. 1991. Artificial Intel- 
ligence. McGraw-Hill, Inc., second edition. 
ISBN 0-07-100894-2. 
D. Stork, editor. 1997. Hal's Legacy: 2001's 
Computer in Dream and Reality. MIT Press. 
http: / / mitpress.mit.edu\[ e-books /Hal /. 
H. van Halteren, J. Zavrel, and W. Daelemans. 
1998. Improving data driven wordclass tag- 
ging by system combination. In Proceedings 
of COLING-A CL'g8, volume 1. 
J. Veronis and N. Ide. 1991. An accessment of 
semantic information automatically extracted 
from machine readable dictionaries. In Pro- 
ceedings of EA CL'91, pages 227-232, Berlin. 
S. Weiss and C. Kulikowski. 1991. Computer 
Systems That Learn. Morgan Kaufmann. 
