Web tools for introductory computational linguistics 
Dafydd Gibbon 
Fakultilt fllr Ling. 8z Lit. 
Pf. 100131, D-33501 Bielefeld 
gibbon~spe ct rum. uni-bielef eld. de 
Julie Carson-Berndsen 
Department of Computer Science 
Belfield, Dublin 4, Ireland 
Julie. Berndsen@ucd. ie 
Abstract 
We introduce a notion of training 
methodology space (TM space) for 
specifying training methodologies in 
tile different disciplines and teaching 
traditions associated with computa- 
tional linguistics and the human lan- 
guage technologies, and pin our ap- 
proach to the concept of operational 
model; we also discuss different gen- 
eral levels of interactivity. A num- 
ber of operational models are intro- 
duced, with web interfaces for lexical 
databases, DFSA matrices, finite- 
state phonotactics development, and 
DATR lexica. 
1 Why tools for CL training? 
In computational linguistics, a number of 
teaching topics and traditions meet; for ex- 
ample: 
• tbrmal mathematical training, 
• linguistic argumentation using sources of 
independent evidence, 
• theory development and testing with em- 
pirical models, 
• corpus processing with tagging and sta- 
tistical classification. 
Correspondingly, teachers' expectations 
and teaching styles vary widely, and, likewise, 
students' expectations and accustomed styles 
of learning are very varied. Teaching methods 
and philosophies fluctuate, too, between more 
behaviouristic styles which are more charac- 
teristic of practical subjects, and the more 
rationalistic styles of traditional mathemat- 
ics training; none, needless to say, covers the 
special needs of all subjects. 
Without specifying the dimensions in de- 
tail, let us call this complex field training 
method space (TM space). The term train- 
ing is chosen because it is neutral between 
teaching and learning, and implies the inten- 
sive acquisition of both theoretical and prac- 
tical abilities. Let us assume, based on the 
variations outlined above, that we will need 
to navigate this space in sophisticated ways, 
but as easily as possible. What could be at 
the centre of TM space? As the centre of TM 
space, let us postulate a model-based training 
method, with the following properties: 
1. The models in TM space are both formal, 
and with operational, empirical interpre- 
tations. 
2. The empirical interpretations of models 
in TM space are in general operational 
models implemented in software. 
3. The models in TM space may be under- 
stod by different users from several differ- 
ent perspectives: from the point of view 
of the mathematician, the programmer, 
the software user etc., like 'real life pro- 
grammes'. 
4. Typical lingware and software models are 
grammars, lexica, annotated corpora, op- 
erationalised procedures, parsers, com- 
pilers; more traditional models are 
graphs, slides, blackboards, three- 
dimensional block or ball constructions, 
calculators. 
Why should operational models, in the 
sense outlined here, be at the centre of TM- 
i 
space? There are several facets to the answer: 
First, the use of operational models permits 
practice without succumbing to the naiveti@s 
of stimulus-response models. Second, this no- 
tion of model is integrative, that is, they are 
on the one hand mathematical, in that they 
are structures which are involved in the in- 
terpretation of theories, and at the same time 
they are empirical, in representing chunks of 
the world, and operational, in that they map 
temporal sequences of states on to real time 
sequences. But, third, working with opera- 
tional models is more fun. Ask our kids. 
This paper describes and motivates a range 
of such models: fbr arithmetic, for manipu- 
lating databases, for experimenting with fi- 
nite state devices, for writing phonological 
(or, analogously, orthographic) descriptions, 
for developing sophisticated inheritance lex- 
ica. 
2 What kind of interactivity? 
The second kind of question to be asked is: 
Why the Web? Interactive training tools are 
not limited to the Web; they have been dis- 
tributed on floppy disk for over two decades, 
and on CD-ROM for over a decade. In pho- 
netics, interactive operational models have a 
long history: audio I/O, visualisations as os- 
cillogrammes, spectrogrammes, pitch traces 
and so on, have been leading models for 
multi-media in teacher training and speech 
therapy education since the 1970s. So why 
the Web? The answers are relatively straight- 
fbrward: 
• The Web makes software easy to dis- 
tribute. 
• The Web is both a distributed user plat- 
form and a distributed archive. 
• New forms of cooperative distance learn- 
ing become possible. 
• Each software version is instantly avail- 
able. 
• The browser client software used for 
accessing the Web is (all but) univer- 
sal (modulo minor implementation dif- 
ferences) in many ways: platform inde- 
pendent, exists in every office and many 
houles, ... 
The tools describe here embody three dif- 
ferent approaches to the dependence of stu- 
dents on teachers with regard to the provision 
of materials: 
1. Server-side applications, realised with 
standard CGI scripts: practically un- 
limited functionality, with arbitrary pro- 
gramming facilities in the background, 
but with inaccessible source code. 
2. Compiled client-side applications, re- 
alised with Java: practically unlimited 
functionality, particularly with respect to 
graphical user interfaces, typically with 
inaccessible source code. 
3. Interpreted client-side applications, re- 
alised with JavaScript: limited func- 
tionality with respect to graphical 
user interfaces, functionality limited to 
text manipulation and manipulation of 
HTML attributes (including CGI pre- 
processing), typically with immediately 
accessible code. 
From the formal point of view, these pro- 
gramming environments are equally suitable. 
From the (professional) programming point of 
view, the object oriented programming style 
of Java is often the preferred, homogeneous 
environment, though it is hard to relate it to 
other styles. CGI provides an interface for 
arbitrary programming languages, and script- 
ing languages are highly relevant to linguis- 
tic tasks, particularly modern varieties such 
as perl, with respect to corpus tagging and 
lexicon processing, or Tcl to the visualisa- 
tion of formal models or speech transforma- 
tions. JavaScript is a pure client-side ap- 
plication, and has a number of practical ad- 
vantages which outweigh many of its limi- 
tations: JavaScript is interpreted, not com- 
piled, and the code is immediately available 
for inspection by the user; despite its sire- 
plicity, it permits arbitrarily complex textual 
and numerical manipulation and basic win- 
dow management; like other scripting lan- 
guages, Javascript is not designed for modu- 
lar programme development or library deploy- 
ment, but is best restricted to small applica- 
tions of the kind used in introductory work. 
There is another issue of interactivity at a 
very general level: in software development, 
perhaps less in the professional environment 
than in the training of non-professionals to 
understand what is 'going on under the bon- 
net', or to produce small custom applications: 
the open software, shared code philosophy. In 
the world 'outside' salaries are obviously de- 
pendent on measurable product output, and 
intellectual property right (IPR) regulations 
for shareware, licences and purchase are there 
to enable people to make a legitimate living 
from software development, given the prevail- 
ing structures of our society. 
As far as teaching is concerned, the de- 
bate mainly affects programmes with medium 
functionality such as basic speech editors or 
morphological analysers, often commercial, 
with products which can be produced in prin- 
ciple on a 'hidden' budget by a small group 
of advanced computer science or engineering 
students (hence the problem). Obviously, it 
is easy for those with in stable educational 
institutions to insist that software is common 
property; indeed it may be said to be their 
duty to provide such software, particularly in 
the small and medium functionality range. 
Finally, it is essential to consider design is- 
sues for interactive teaching systems, an area 
which has a long history in teaching method- 
ology, going back to the programmed learning 
and language lab concepts of the 1960s, and 
is very controversial (and beyond the scope of 
the present paper). We suggest that the dis- 
cussion can be generalised via the notion of 
TM space introduced above to conventional 
software engineering considerations: require- 
ments spec~ifieation (e.g. specification of loca- 
tion in TM space by topic, course and student 
type), system design (e.g. control structures, 
navigation, windowing, partition of material, 
use of graphics, audio etc.), implementation 
(e.g. server-side vs. client side), verifica- 
tion (e.g. 'subjective', by users; 'objective', 
in course context). 
Only a small amount of literature is avail- 
able on teaching tools; however, cf. (HH1999) 
{br speech applications, and the following 
for applications in phonetics and phonology 
(CB1998a), (CBG1999), English linguistics 
(CBG1997), and multimedia communication 
(G1997). The following sections will discuss 
a number of practical model-based applica- 
tions: a basic database environment; an in- 
terpreter for deterministic finite automata; 
a development environment for phonotactic 
and orthographic processing; a testbed and 
scratchpad for introducing the DATR lexi- 
cal representation language. The languages 
used are JavaScript (JS) for client-side appli- 
cations, and Prolog (P) or C (C) for server- 
side CGI applications. 
3 Database query interface 
generator 
Database methodology is an essential part of 
computational linguistic training; tradition- 
ally, UNIX ASCII databases have been at the 
core of many NLP lexical databases, though 
large scale applications require a professional 
DBMS. The example shown in Figure 1 shows 
a distinctive feature matrix (Jakobson and 
Halle consonant matrix) as a database rela- 
tion, with a query designed to access phono- 
logical 'natural classes'; any lexical database 
relation can be implemented, of course. In 
this JavaScript application with on-the-fly 
query interface generation the following func- 
tionality is provided: 
i. Input and query of single database rela- 
tions. 
2. Frame structure, with a control frame 
and a display/interaction frame which 
is allocated to on-the-fly or pre-stored 
database information. 
3. The control frame permits selection of: 
(a) a file containing a pre-compiled 
database in JavaScript notation, (b) on-the-fly generation of a query in- 
terface from the first record of the 
database, which contains the names 
of the fields/attributes/flolumns, 
(c) on-the-fly generation of tabular rep- 
resentation of the database, 
(d) input of databases in tabular form. 
4. Query interface with selection of arbi- 
trary conjunctions of query attributes 
and values, and output attributes. 
Figure 1: Database interface generator (JavaScript). 
5. Compilation of database into a 
JavaScript data structure: a one- 
dimensional array, with a presentation 
parameter for construction of the 
on-the-fly query interface. 
Typical applications include basic dictio- 
naries, simple multillingual dictionaries, rep- 
resentation of feature structures as a database 
relation with selection of natural classes by 
means of an appropriate conjunction of query 
attributes and values. 
Tasks range from user-oriented activities 
such as the construction of 'flat' databases, 
or of feature matrices, to the analysis of the 
code, and the addition of further input modal- 
ities. Advanced tasks include the analysis of 
the code, addition of character coding conven- 
tions, addition of further database features. 
4 DFSA interpreter 
There are many contexts in computational 
linguistics, natural language processing and 
spoken language technology in which devices 
based on finite state automata are used; for 
example, tokenisation, morphological analy- 
sis and lemmatisation, shallow parsing, syl- 
lable parsing, prosodic modelling, plain and 
hidden markov models. A standard compo- 
nent of courses in these disciplines is con- 
cerned with formal languages and automata 
theory. The basic form of finite state automa- 
ton is the deterministic finite state automa- 
ton (DFSA), whose vocabulary is epsilon-free 
and which has no more than one transition 
with a given label from any state. There 
are several equivalent representation convert- 
tions for DFSAs, such as a full transition 
matrix (Vocabulary × StateSet) with target 
states as entries; or sparse matrix represen- 
tation as a relation, i.e. a set of triples con- 
stituting a subset of the Cartesian product 
StateSet × Stateset x Vocabulary; or transi- 
tion network representation. 
The interface currently uses the full matrix 
representation, and permits the entry of arbi- 
trary automata into a fixed size matrix. The 
example shown illustrates the language a'b, 
but symbols consisting of an arbitrary number 
of characters, e.g. natural language examples, 
may be used. A state-sequence trace, and de- 
tailed online help explanations, as well as task 
suggestions of varying degrees of difficulty are 
Fil~ l~clil Vlsw GO Wlrxlow Help I 
DFSAi~p~t:Ij~.. I-~ I~°~l -~'~ 
Q~. ~abs~ ~ Q, set ot fia~l ~tatea 
Q: finite s¢~ o| ~t~t~ 
D: umsi~ l+ancc+tion D Rom Q ~d g to Q, D: Q,V 
DFSA ~cm ~ eu~ ~emRRm matrix: 
+:\[~-"~,+:1 ~'~ I~ 
D: 
t 
IX~RA log mrs: 
\[ 
qO. qO = . H zs~.+ qo r.~i., xmaqp~,~, al~,.,r. • .l 
qO. qO 151rllf+tJ~ qO ~618e, 15mF'l~ i~pa~ =~11~ 
qo ~ qO 
qo b q) 
F T1 i7-1F-\] F--117-1 
F-1F--IE~a-lg-17--\] + +, 
17117-1F-117--1F-1 t \[\] "~ ,I 
F-17-1F-17--1F-\] ,~ 
, ,'~'~ m,,~-~h,,,?,~,,*~0~, ,,., -~,~, v,~,, ~,.~. ........ ' .... , ,,,, ............. 
i! \[~ .................. I Dalydd Oi~bom Stm Feb 7 23:36:$0 ~'r 1999 
Figure 2: Deterministic finite state automaton (DFSA) workpad (JavaScript). 
provided. 
5 Phonological NDFSA 
development environment 
Phonology and prosody were the first areas 
in which finite state technologies were shown 
to be linguistically adequate and computa- 
tionally efficient, in the 1970s; a close second 
was morphological alternations in the 1980s 
(CB1998a). The goal of this application is 
to demonstrate the use of finite state tech- 
niques in computational linguistics, partic- 
ularly in the area of phonotactic and allo- 
phonic description for computational phonol- 
ogy. However, the tool does not contain any 
level specific information and can therefore 
be used also to demonstrate finite state de- 
scriptions of orthographic information. In 
this CGI application, implemented in Prolog 
(see (CBG1999)), the following functionality 
is provided: 
1. Display/Alter/Add to the current (non- 
deterministic) FSA descriptions. 
2. Generate the combinations described in 
the FSA. 
3. Compare two FSA descriptions. 
4. Parse a string using an FSA description. 
Typical applications of this tool include de- 
scriptions of phonological well-formedness of 
syllable models for various languages. Tasks 
range from testing and evaluating to parsing 
phonological (or orthographic) FSA descrip- 
tions. More advanced tasks include extension 
of the current toolbox functionality to cater 
for feature-based descriptions. 
6 Zdatr testbed and scratchpad 
DATR is a well-known theoretically well- 
founded and practically oriented lexicon rep- 
resentation language. It also has a high ra- 
tio of implementations to applications, and, 
until relatively recently, a low degree of stan- 
dardisation between implementations. In or- 
der to create a platform independent demon- 
stration and practice environment, a CGI 
interface was created. The engine was in- 
tended to be a Sicstus Prolog application; Sic- 
stus turned out to be non-CGI-compatible at 
the time (1995), so a UNIX shell version of 
DATR (mud, minimal UNIX DATR) was im- 
plemented using a combination of UNIX text 
stream processing tools, mainly awk. This 
File Edit View GO Window • "\] .:( ,,° " Help. 
, ...................................................................................... ~ ............................. .~--.~,.:,__~,~:,~ .................. ~. .................. - ......... ~ .......... ~.~. 
Back ¢or~¢.:~ Reload : Home ~;ea~ch Guide • Pdnt . Secuflty ~l~ 
ZI3A'FR HyprLex Scraud|pad 
THEORY 
QUERY 
Dafydd Gibbon, B Birdcield, 22 March 19~7 
( dr.: Ihgg"~rEt ¢o-galdmt._gmaple D~ ~ts~=i~ ( ........ f 11 h~. ~gl:i~l t=~taamt ot ~lg~ 
(al rza.dtt - Pinita St, ata ~ut~ar~=~ ~a= hzeq = ~*c 
(31 rsl .dr= F~ intezpzet~ ~ir.l~ type cc~ec4~ fo= h=~ - =b*c 
(11 tm.dtz - IL~('~ai ccmp~alti~l t:~t.~t og i~tqliEl~ 
~=~i gz: 
o 
Mo~fe 
(bu.* h o u • . 
wo=d : 
Figure 3: Zdatr scratchpad (CGI, UNIX shell, C). 
was later replaced by Zdatr Vl.n, and will 
shortly be replaced by Zdatr V2.0 (imple- 
mented by Grigoriy Strokin, Moscow State 
University). The Zdatr software is widely 
used in the teaching of lexicography, lexicol- 
ogy, and lexicon theory (Ginpress). 
Two interikces are driven by Zdatr: The 
testbed which permits interactions with pre- 
viously defined and integrated DATR theo- 
ries (CBG1999), and the scratchpad (shown 
in Figure 3), with which queries can be writ- 
ten and tested. The scratchpad permits the 
entry of short theories and queries, and the 
testbed has the following functionality: 
1. viewing of DATR theories; 
2. selection of manual, declared (#show and 
#hide) and pre-listed queries; 
3. selection of output properties (trace, 
atom spacing). 
4. a function for automatically integrating 
new theories sent by email (not docu- 
mented for external use). 
Sample DATR theories which are avail- 
able include a detailed model of composition- 
ality in English compound nouns (Gibbon), 
an application for developing alternative fea- 
ture or autosegmental phonological represen- 
tations (Carson-Berndsen), and a number of 
algorithm illustrations (bubble sort, shift reg- 
ister machine) for more theoretical purposes. 
7 Outlook 
Tools like those introduced here are not ubiq- 
uitous, and there are many areas of computa- 
tional linguistics, in particular formal train- 
ing in computing and training in linguistic 
argumentation, which require intensive face- 
to-face teaching. Our tools are restricted to 
'island' applications where we consider them 
to be most effective. For many students (and 
teachers), such tools provide an additional 
level of motivation because of their easy ac- 
cessibility, portability, and the absence of in- 
stallation problems, and can be used with dif- 
ferent levels of student accomplishment, from 
the relatively casual user in a foreign language 
or speech therapy context, to the more ad- 
vanced linguistic programmer in courses on 
database or automata theory or software de- 
velopment. 
For reasonably small scale applications, 
we favour client-side tools where possible. 
JavaScript is suitable in many cases, provided 
that minor browser incompatibilities are ham 
dled. The database application, for exam- 
ple, still provided very fast access when eval- 
uated with a 2000 record, 10 attributes per 
record database. JavaScript has a number 
of disadvantages (no mouse-graphics interac- 
tion, no library concept), but being an inter- 
preted language is very suitable for introduc- 
ing an 'open source code' policy in teaching. 
In contrast to CGI applications, where query 
and result transfer time can be considerable, 
client-side JavaScript (or Java) applications 
have a bandwidth dependent once-off down- 
load time for databases and scripts (or com- 
piled applets), but query and result transfer 
time are negligeable. 
The applications presented here are fully 
integrated (with references to related appli- 
cations at other commercial and educational 
institutions, e.g. parsers, morphology pro- 
grammes, speech synthesis demonstrations) 
into the teaching programme. Obvious areas 
where further development is possible and de- 
sirable are: 
• Automatic tool interface generation 
based more explicitly on general princi- 
ples of training methodology, e.g. with a 
more explicit account of TM space and 
with more systematic control, help, error 
detection, query and result panel design. 
• Automatic test generation for tool (and 
student) validation. 
• Further tools for formal language, pars- 
ing and automata theoretic applications. 
• Extension of database tool to include 
more database functionality. 
We plan to extend our repertoire of appli- 
cations in these directions, and will inte- 
grate more applications from other institu- 
tions when they become available. 

References 

Carson-Berndsen, J. 1998a. Computational 
Autosegmental Phonology in Pronunciation 
Teaching. In: Jager S; J. Nerbonne & A. van 
Essen (eds.) Language Teaching and Language 
Technology, Swets & Zeitlinger, Lisse. 

Carson-Berndsen, J. 1998b. Time Map Phonol- 
ogy: Finite State Methods and Event Logics 
in Speech Recognition. Kluwer Acadmic Press, 
Dordrecht. 

Carson-Berndsen J. & D. Gibbon 1997. In- 
teractive English, 2nd Bielefeld Multime- 
dia Day Demo, "coral. lili .uni-bielefeld. de/ 
MuMeT2", Universitit Bielefeld, November 1997. 

Carson-Berndsen J. & D. Gibbon 1999. Interac- 
tive Phonetics, Virtually! In: V. Hazan & M. 
Holland, eds., Method and Tool Innovations/or 
Speech Science Education. Proceedings off the 
MATISSE Workshop, University College, Lon- 
don, 16-17 April 1999, pp. 17-20. 

Gibbon, D. 1997. Phonetics and 
Multimedia Communication, Lecture 
Notes, "coral. lili. uni-bielefeld, de/ 
Classes/Winter97", Universit~it Bielefeld. 

Gibbon, D. in press. Computational lexicogra- 
phy. In: van Eynde, F. & D. Gibbon: Lexicon 
Development/or Speech and Language Process- 
ing. Kluwer Academic Press, Dordrecht. 

Hazan, V. & M. Holland, eds. 1999. Method and 
Tool Innovations/or Speech Science Education. 
Proceedings o/ the MATISSE Workshop, Uni- 
versity College, London, 16-17 April 1999. 
