Formal Language Theory for Natural Language Processing
Shuly Wintner
Computer Science Department
University of Haifa
Haifa 31905, Israel
shuly@cs.haifa.ac.il
Abstract
This paper reports on a course whose aim
is to introduce Formal Language Theory
to students with little formal background
(mostly linguistics students). The course
was first taught at the European Sum-
mer School for Logic, Language and In-
formation to a mixed audience of stu-
dents, undergraduate, graduate and post-
graduate, with various backgrounds. The
challenges of teaching such a course in-
clude preparation of highly formal, math-
ematical material for students with rela-
tively little formal background; attracting
the attention of students of different back-
grounds; preparing examples that will em-
phasize the practical importance of ma-
terial which is basically theoretical; and
evaluation of students’ achievements.
1 Overview
Computational linguistics students typically come
from two different disciplines: Linguistics or
Computer Science. As these are very different
paradigms, it is usually necessary to set up a com-
mon background for the two groups of students.
One way to achieve this goal is by introducing the
core topics of one paradigm to students whose back-
ground is in the other. This paper reports on such
an experiment: teaching Formal Language Theory,
a core computer science subject, to students with
no background in computer science or mathemat-
ics. The course was first taught at the 13th European
Summer School in Logic, Language and Informa-
tion (Helsinki, Finland) in the summer of 2001.
While formal language theory is not a core com-
putational linguistics topic, it is an essential prereq-
uisite for a variety of courses. For example, regular
expressions and finite-state technology are instru-
mental for many NLP applications, including mor-
phological analyzers and generators, part-of-speech
taggers, shallow parsers, intelligent search engines
etc. The mathematical foundations of context-free
grammars are necessary for a thorough understand-
ing of natural language grammars, and a discussion
of the Chomsky hierarchy is mandatory for students
who want to investigate more expressive linguistic
formalisms such as unification grammars.
The motivation for teaching such a course to stu-
dents with no background in formal methods, es-
pecially linguists, stems from the observation that
many students with background in linguistics are in-
terested in computational linguistics but are over-
whelmed by the requirements of computational lin-
guistics courses that are designed mainly for com-
puter science graduates. Furthermore, in order to es-
tablish a reasonable level of instruction even in intro-
ductory computational linguistics courses, I found it
essential to assume a firm knowledge of basic for-
mal language theory. This assumption does not hold
for many non-CS gradutes, and the course described
here is aimed at such students exactly.
The challenges of teaching such a course are
many. Teaching at the European Summer School is
always a challenge, as this institution attracts stu-
dents from a variety of disciplines, and one never
knows what background students in one’s class will
have. In this particular case, the course was adver-
tised as a foundational computation course. Founda-
tional courses presuppose absolutely no background
knowledge, and should especially be accessible to
                     July 2002, pp. 71-76.  Association for Computational Linguistics.
              Natural Language Processing and Computational Linguistics, Philadelphia,
         Proceedings of the Workshop on Effective Tools and Methodologies for Teaching
people from other disciplines. The material had to
be prepared in a way that would make it accessible
to students of linguistics, for example, who might
possess no knowledge of mathematics beyond high-
school level.
Another characteristic of the European Summer
Schools is that the students’ education levels vary
greatly. It is not uncommon to have, in one class,
undergraduate, graduate and post-graduate students.
This implies that the level of addressing the class has
to be very delicately determined: it is very easy to
bore most students or to speak over their heads. An
additional difficulty stems from the fact that while
the language of instruction at the Summer School
is English, most participants (students and lecturers
alike) are not native speakers of English.
Undoubtedly the greatest challenge was to pre-
pare the course in a way that will attract the atten-
tion of the class. Formal language theory is a highly
theoretical, mostly mathematical subject. Standard
textbooks (Hopcroft and Ullman, 1979; Harrison,
1978) present the material in a way that will appeal
to mathematicians: very formal, with one subject
built on top of its predecessor, and with very formal
(if detailed) examples. Even textbooks that aim at
introducing it to non-mathematicians (Partee et al.,
1990) use mostly examples of formal (as opposed to
natural) languages. In order to motivate the students,
I decided to teach the course in a way that empha-
sizes natural language processing applications, and
in particular, to use only examples of natural lan-
guages.
While this paper focuses on a particular course,
taught at a particular environment, I believe that the
lessons learned while developing and teaching it are
more generally applicable. A very similar course
can be taught as an introduction to NLP classes in
institutions whose majority of students come from
computer science, but who would like to attract lin-
guistics (and other non-CS) graduates and provide
them with the necessary background. I hope that the
examples given in the paper will prove useful for de-
velopers of such courses. More generally, the paper
demonstrates a gentle approach to formal, mathe-
matical material that builds on terminology familiar
to its audience, rather than use the standard math-
ematical paradigm in teaching. I believe that this
approach can be useful for other courses as well.
2 Structure of the course
Courses at the Summer School are taught in sessions
of 90 minutes, on a daily basis, either five or ten
days. This course was taught for five days, totaling
450 minutes (the equivalent of ten academic hours,
approximately one third of the duration of a stan-
dard course). However, the daily meetings eliminate
the need to recapitulate material, and the pace of in-
struction can be enhanced.
I decided to cover a substantial subset of a stan-
dard Formal Language Theory course, starting with
the very basics (e.g., set theory, strings, relations
etc.), focusing on regular languages and their com-
putational counterpart, namely finite-state automata,
and culminating in context-free grammars (without
their computational device, push-down automata). I
sketch the structure of the course below.
The course starts with a brief overview of essen-
tial set theory: the basic notions, such as sets, rela-
tions, strings and languages, are defined. All exam-
ples are drawn from natural languages. For exam-
ple, sets are demonstrated using the vowels of the
English alphabet, or the articles in German. Set op-
erations such as union or intersection, and set rela-
tions such as inclusion, are demonstrated again us-
ing subsets of the English alphabet (such as vow-
els and consonants). Cartesian product is demon-
strated in a similar way (example 1) whereas rela-
tions, too, are exemplified in an intuitive manner
(example 2). Of course, it is fairly easy to define
strings, languages and operations on strings and lan-
guages – such as concatenation, reversal, exponen-
tiation, Kleene-closure etc. – using natural language
examples.
The second (and major) part of the course dis-
cusses regular languages. The definitions of regular
expressions and their denotations are accompanied
by the standard kind of examples (example 3). After
a brief discussion of the mathematical properties of
regular languages (in particular, some closure prop-
erties), finite-state automata are gently introduced.
Following the practice of the entire course, no math-
ematical definitions are given, but a rigorous tex-
tual description of the concept which is accompa-
nied by several examples serves as a substitute to
a standard definition. Very simple automata, espe-
cially extreme cases (such as the automata accept-
Example 1 Cartesian product
Let a0 be the set of all the vowels in some
language and a1 the set of all consonants.
For the sake of simplicity, take a0 to be
a2 a, e, i, o, u
a3 and a1 to be
a2 b, d, f, k, l, m, n, p, s, t
a3 .
The Cartesian product a1 a4 a0 is the set
of all possible consonant–vowel pairs:
a2a6a5a8a7a10a9a12a11a14a13a15a9a16a5a18a17a14a9a12a11a19a13a15a9a16a5a18a17a14a9a21a20a12a13a15a9a16a5a8a22a23a9a12a24a6a13a15a9a16a5a26a25a27a9a12a24a28a13a15a9a16a5a30a29a31a9a33a32a34a13a15a9a16a5a30a29a15a9a21a35a36a13a15a9a38a37a38a37a38a37
a3 ,
etc. Notice that the Cartesian product a0 a4a39a1 is
different: it is the set of all vowel–consonant pairs,
which is a completely different entity (albeit with
the same number of elements). The Cartesian
product a1a40a4a41a1 is the set of all possible consonant–
consonant pairs, whereas a0 a4 a0 is the set of all
possible diphthongs.
Example 2 Relation
Let a0 be the set of all articles in German and a1
the set of all German nouns. The Cartesian product
a0
a4a42a1 is the set of all article–noun pairs. Any subset
of this set of pairs is a relation from a0 to a1 . In par-
ticular, the set a43a45a44 a2a6a5a30a46a27a9a21a47a19a13a49a48a50a46a52a51 a0 and a47a53a51 a1 and
a46 and a47 agree on number, gender and case
a3 is a rela-
tion. Informally, a43 holds for all pairs of article–noun
which form a grammatical noun phrase in German:
such a pair is in the relation if and only if the article
and the noun agree.
ing the empty language, or a54a56a55 ), are explicitly de-
picted. Epsilon-moves are introduced, followed by
a brief discussion of minimization and determiniza-
tion, which is culminated with examples such as 4.
Example 3 Regular expressions
Given the alphabet of all English letters, a54 a44
a2a16a11a57a9a33a7a16a9a12a58a59a9a38a37a38a37a38a37a59a9a21a47a57a9a12a60
a3 , the language a54 a55 is denoted by the
regular expression a54a56a55 (recall our convention of us-
ing a54 as a shorthand notation). The set of all strings
which contain a vowel is denoted by a54 a55a19a61a12a62 a11a64a63a65a32a19a63a66a20a10a63
a24a67a63a66a35a69a68
a61 a54a56a55 . The set of all strings that begin in “un” is
denoted by a62 a35a19a70a71a68 a54 a55 . The set of strings that end in ei-
ther “tion” or “sion” is denoted by a54a72a55 a61a21a62a8a73 a63a74a29a21a68 a61a75a62 a20a76a24a10a70a71a68 .
Note that all these languages are infinite.
To demonstrate the usefulness of finite-state au-
tomata in natural language applications, some op-
erations on automata are directly defined, includ-
Example 4 Equivalent automata
The following three finite-state automata are equiv-
alent: they all accept the set a2 go, gone, goinga3 .
a0a78a77
a70 a79
a20
a79 a24 a70 a32
a0a81a80
a79
a24 a20 a70 a79
a79 a24 a70 a32
a79
a24
a0a49a82
a79 a24 a20 a70 a79
a70
a32 a83
a83
a83
Note that a0 a77 is deterministic: for any state and al-
phabet symbol there is at most one possible transi-
tion. a0 a80 is not deterministic: the initial state has
three outgoing arcs all labeled by a79 . The third au-
tomaton, a0a49a82 , has
a83
-arcs and hence is not determinis-
tic. While a0a81a80 might be the most readable, a0 a77 is the
most compact as it has the fewest nodes.
ing concatenation and union. Finally, automata are
shown to be a natural representation for dictionaries
and lexicons (example 5).
This part of the course ends with a presentation of
regular relations and finite-state transducers. The
former are shown to be extremely common in natu-
ral language processing (example 6). The latter are
introduced as a simple extension of finite-state au-
tomata. Operations on regular relations, and in par-
ticular composition, conclude this part (example 7).
The third part of the course deals with context-free
grammars, which are motivated by the inability of
regular expressions to account for (and assign struc-
ture to) several phenomena in natural languages. Ex-
ample 8 is the running example used throughout this
part.
Basic notions, such as derivation and derivation
Example 5 Dictionaries as finite-state automata
Many NLP applications require the use of lexicons
or dictionaries, sometimes storing hundreds of thou-
sands of entries. Finite-state automata provide an
efficient means for storing dictionaries, accessing
them and modifying their contents. To understand
the basic organization of a dictionary as a finite-state
machine, assume that an alphabet is fixed (we will
use a54a40a44 a2 a, b, a37a38a37a38a37 , za3 in the following discussion)
and consider how a single word, say go, can be rep-
resented. As we have seen above, a na¨ıve represen-
tation would be to construct an automaton with a sin-
gle path whose arcs are labeled by the letters of the
word go:
a79 a24
To represent more than one word, we can simply add
paths to our “lexicon”, one path for each additional
word. Thus, after adding the words gone and going,
we might have:
a79
a24 a20 a70 a79
a79 a24 a70 a32
a79
a24
This automaton can then be determinized and mini-
mized:
a70 a79
a20
a79 a24 a70 a32
With such a representation, a lexical lookup oper-
ation amounts to checking whether a word a84 is a
member in the language generated by the automa-
ton, which can be done by “walking” the automaton
along the path indicated by a84 . This is an extremely
efficient operation: it takes exactly one “step” for
each letter of a84 . We say that the time required for
this operation is linear in the length of a84 .
trees are presented gently, with plenty of examples.
To motivate the discussion, questions of ambiguity
are raised. Context-free grammars are shown to be
sufficient for assigning structure to several natural
Example 6 Relations over languages
Consider a simple part-of-speech tagger: an applica-
tion which associates with every word in some nat-
ural language a tag, drawn from a finite set of tags.
In terms of formal languages, such an application
implements a relation over two languages. For sim-
plicity, assume that the natural language is defined
over a54 a77 a44 a2a16a11a57a9a33a7a10a9a38a37a38a37a38a37a16a9a12a60 a3 and that the set of tags is
a54
a80
a44
a2 PRON, V, DET, ADJ, N, P
a3 . Then the part-
of-speech relation might contain the following pairs,
depicted here vertically (that is, a string over a54
a77 is
depicted over an element of a54 a80 ):
I know some new tricks
PRON V DET ADJ N
said the Cat in the Hat
V DET N P DET N
As another example, assume that a54 a77 is as above, and
a54
a80 is a set of part-of-speech and morphological tags,
including a2 -PRON, -V, -DET, -ADJ, -N, -P, -1, -2, -3,
-sg, -pl, -pres, -past, -def, -indefa3 . A morpholog-
ical analyzer is basically an application defining a
relation between a language over a54 a77 and a language
over a54 a80 . Some of the pairs in such a relation are
(vertically):
I know
I-PRON-1-sg know-V-pres
some new tricks
some-DET-indef new-ADJ trick-N-pl
said the Cat
say-V-past the-DET-def cat-N-sg
Finally, consider the relation that maps every En-
glish noun in singular to its plural form. While the
relation is highly regular (namely, adding “a73 ” to the
singular form), some nouns are irregular. Some in-
stances of this relation are:
cat hat ox child mouse sheep
cats hats oxen children mice sheep
language phenomena, including subject-verb agree-
ment, verb subcategorization, etc. Finally, some
mathematical properties of context-free languages
are discussed.
The last part of the course deals with questions
of expressivity, and in particular strong and weak
Example 7 Composition of finite-state transducers
Let a43 a77 be the following relation, mapping some En-
glish words to their German counterparts:
a43
a77
a44
a2 tomato:Tomate, cucumber:Gurke,
grapefruit:Grapefruit, grapefruit:pampelmuse,
pineapple:Ananas, coconut:Koko,
coconut:Kokusnußa3
Let a43 a80 be a similar relation, mapping French words
to their English translations:
a43
a80
a44
a2 tomate:tomato, ananas:pineapple,
pampelmousse:grapefruit, concombre:cucumber,
cornichon:cucumber, noix-de-coco:coconuta3
Then a43 a80a36a85 a43 a77 is a relation mapping French words to
their German translations (the English translations
are used to compute the mapping, but are not part of
the final relation):
a43
a80a86a85
a43
a77
a44
a2 tomate:Tomate, ananas:Ananas,
pampelmousse:Grapefruit,
pampelmousse:Pampelmuse, concombre:Gurke,
cornichon:Gurke, noix-de-coco:Koko,
noix-de-coco:Kokusnußea3
Example 8 Rules
Assume that the set of terminals is a2 the, cat, in, hata3
and the set of non-terminals is a2 D, N, P, NP, PPa3 .
Then possible rules over these two sets include:
D a87 the NP a87 D N
N a87 cat PP a87 P NP
N a87 hat NP a87 NP PP
P a87 in
Note that the terminal symbols correspond to words
of English, and not to letters as was the case in the
previous chapter.
generative capacity of linguistic formalism. The
Chomsky hierarchy of languages is defined and ex-
plained, and substantial focus is placed on deter-
mining the location of natural languages in the
hierarchy. By this time, students will have ob-
tained a sense of the expressiveness of each of the
formalisms discussed in class, so they are more
likely to understand many of the issues discussed
in Pullum and Gazdar (1982), on which this part of
the course is based. The course ends with hints
to more expressive formalisms, in particular Tree-
Adjoining Grammars and various unification-based
formalisms.
3 Enrollment data
While the Summer School does not conduct teach-
ing evaluations, I felt that it would be useful to re-
ceive feedback from participants of the course. To
this end, I designed a standard teaching evaluation
form and asked students to fill it in on the last class.
The data in this section are drawn from the students’
responses.
The number of students who submitted the ques-
tionnaire was 52. Nationality was varied, with the
majority from Finland, Poland, Italy, Germany, the
United Kingdom and the United States, but also
from Canada, the Netherlands, Spain, Greece, Ro-
mania, France, Estonia, Korea, Iran, the Ukraine,
Belgium, Japan, Sweden, Russia and Denmark.
Thirty six defined themselves as graduate students,
thirteen as undergraduates and three as post-PhD.
The most interesting item was background. Par-
ticipants had to describe their backgrounds by
choosing from Linguistics, Mathematics, Computer
Science, Logic or Other. Only 32% described their
background as Linguistics; 29% chose Computer
Science; 21% chose Mathematics; and 15% —
Logic. Other backgrounds included mostly Philos-
ophy but also Biology and Physics. Why students
of Computer Science, and in particular graduate stu-
dents, should take Formal Language Theory in such
an interdisciplinary Summer School is unclear to
me.
Students were asked to grade their impression of
the course, on a scale of 1–5, along the following
dimensions:
a88 The course is interesting
a88 The course covers important and useful mate-
rial
a88 The course progresses at the right pace
a88 The course is fun
The average grade was 4.53 for the interest question;
4.47 for the usefulness question; 3.67 for the pace
question; and 4.13 for fun. These results show that
participants felt that the course was interesting and
useful, and even fun. However, many of them felt
that it did not progress in the right pace. This might
be partially attributed to the high rate of computer
science and mathematics students in the audience:
many of them must have seen the material earlier,
and felt that progress was too slow for them.
4 Conclusions
This paper demonstrates that it is possible to teach
formal, mathematical material to students with little
or no formal background by introducing the material
gently, albeit rigorously. By the end of the course,
students with background in linguistics or philos-
ophy are able to understand the computer science
theoretical foundations underlying many aspects of
natural language processing, in particular finite-state
technology and formal grammars. This sets up a
common background for more advanced classes in
computational linguistics.
The course was taught once at an international,
interdisciplinary summer school. I intend to teach it
again this summer in a similar, albeit smaller event;
I also intend to teach it to graduate Humanities stu-
dents who express interest in computational linguis-
tics, in order to introduce them to some founda-
tional theoretical aspects of computer science essen-
tial for working on natural language processing ap-
plications. The positive reaction of most students
to the course is an encouraging incentive to develop
more courses along the same lines.
Acknowledgments
I wish to extend my gratitude to my students at
ESSLLI-2001, who made teaching this course such
an enjoyable experience for me. I am grateful to the
reviewers for their useful comments. This work was
supported by the Israeli Science Foundation (grant
no. 136/1).

References
Michael A. Harrison. 1978. Introduction to formal lan-
guage theory. Addison-Wesley, Reading, MA.
John E. Hopcroft and Jeffrey D. Ullman. 1979. In-
troduction to automata theory, languages and com-
putation. Addison-Wesley Series in Computer Sci-
ence. Addison-Wesley Publishing Company, Reading,
Mass.
Brabara H. Partee, Alice ter Meulen, and Robert E.
Wall. 1990. Mathematical Methods in Linguistics,
volume 30 of Studies in Linguistics and Philosophy.
Kluwer Academic Publishers, Dordrecht.
Geoffrey K. Pullum and Gerald Gazdar. 1982. Natural
languages and context-free languages. Linguistics and
Philosophy, 4:471–504.
