File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0110_intro.xml
Size: 12,141 bytes
Last Modified: 2025-10-06 14:01:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0110"> <Title>Formal Language Theory for Natural Language Processing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Structure of the course </SectionTitle> <Paragraph position="0"> Courses at the Summer School are taught in sessions of 90 minutes, on a daily basis, either five or ten days. This course was taught for five days, totaling 450 minutes (the equivalent of ten academic hours, approximately one third of the duration of a standard course). However, the daily meetings eliminate the need to recapitulate material, and the pace of instruction can be enhanced.</Paragraph> <Paragraph position="1"> I decided to cover a substantial subset of a standard Formal Language Theory course, starting with the very basics (e.g., set theory, strings, relations etc.), focusing on regular languages and their computational counterpart, namely finite-state automata, and culminating in context-free grammars (without their computational device, push-down automata). I sketch the structure of the course below.</Paragraph> <Paragraph position="2"> The course starts with a brief overview of essential set theory: the basic notions, such as sets, relations, strings and languages, are defined. All examples are drawn from natural languages. For example, sets are demonstrated using the vowels of the English alphabet, or the articles in German. Set operations such as union or intersection, and set relations such as inclusion, are demonstrated again using subsets of the English alphabet (such as vowels and consonants). Cartesian product is demonstrated in a similar way (example 1) whereas relations, too, are exemplified in an intuitive manner (example 2). Of course, it is fairly easy to define strings, languages and operations on strings and languages - such as concatenation, reversal, exponentiation, Kleene-closure etc. - using natural language examples.</Paragraph> <Paragraph position="3"> The second (and major) part of the course discusses regular languages. The definitions of regular expressions and their denotations are accompanied by the standard kind of examples (example 3). After a brief discussion of the mathematical properties of regular languages (in particular, some closure properties), finite-state automata are gently introduced. Following the practice of the entire course, no mathematical definitions are given, but a rigorous textual description of the concept which is accompanied by several examples serves as a substitute to a standard definition. Very simple automata, especially extreme cases (such as the automata accept-Example 1 Cartesian product Let a0 be the set of all the vowels in some language and a1 the set of all consonants.</Paragraph> <Paragraph position="4"> For the sake of simplicity, take a0 to be</Paragraph> <Paragraph position="6"> The Cartesian product a1 a4 a0 is the set of all possible consonant-vowel pairs:</Paragraph> <Paragraph position="8"> etc. Notice that the Cartesian product a0 a4a39a1 is different: it is the set of all vowel-consonant pairs, which is a completely different entity (albeit with the same number of elements). The Cartesian product a1a40a4a41a1 is the set of all possible consonantconsonant pairs, whereas a0 a4 a0 is the set of all possible diphthongs.</Paragraph> <Paragraph position="9"> Example 2 Relation Let a0 be the set of all articles in German and a1 the set of all German nouns. The Cartesian product a4a42a1 is the set of all article-noun pairs. Any subset of this set of pairs is a relation from a0 to a1 . In particular, the set a43a45a44 a2a6a5a30a46a27a9a21a47a19a13a49a48a50a46a52a51 a0 and a47a53a51 a1 and a46 and a47 agree on number, gender and case a3 is a relation. Informally, a43 holds for all pairs of article-noun which form a grammatical noun phrase in German: such a pair is in the relation if and only if the article and the noun agree.</Paragraph> <Paragraph position="10"> ing the empty language, or a54a56a55 ), are explicitly depicted. Epsilon-moves are introduced, followed by a brief discussion of minimization and determinization, which is culminated with examples such as 4. Example 3 Regular expressions Given the alphabet of all English letters, a54 a44 a2a16a11a57a9a33a7a16a9a12a58a59a9a38a37a38a37a38a37a59a9a21a47a57a9a12a60 a3 , the language a54 a55 is denoted by the regular expression a54a56a55 (recall our convention of using a54 as a shorthand notation). The set of all strings which contain a vowel is denoted by a54 a55a19a61a12a62 a11a64a63a65a32a19a63a66a20a10a63 a24a67a63a66a35a69a68 a61 a54a56a55 . The set of all strings that begin in &quot;un&quot; is denoted by a62 a35a19a70a71a68 a54 a55 . The set of strings that end in either &quot;tion&quot; or &quot;sion&quot; is denoted by a54a72a55 a61a21a62a8a73 a63a74a29a21a68 a61a75a62 a20a76a24a10a70a71a68 . Note that all these languages are infinite.</Paragraph> <Paragraph position="11"> To demonstrate the usefulness of finite-state automata in natural language applications, some operations on automata are directly defined, includ-Example 4 Equivalent automata The following three finite-state automata are equivalent: they all accept the set a2 go, gone, goinga3 .</Paragraph> <Paragraph position="13"> Note that a0 a77 is deterministic: for any state and alphabet symbol there is at most one possible transition. a0 a80 is not deterministic: the initial state has three outgoing arcs all labeled by a79 . The third automaton, a0a49a82 , has -arcs and hence is not deterministic. While a0a81a80 might be the most readable, a0 a77 is the most compact as it has the fewest nodes.</Paragraph> <Paragraph position="14"> ing concatenation and union. Finally, automata are shown to be a natural representation for dictionaries and lexicons (example 5).</Paragraph> <Paragraph position="15"> This part of the course ends with a presentation of regular relations and finite-state transducers. The former are shown to be extremely common in natural language processing (example 6). The latter are introduced as a simple extension of finite-state automata. Operations on regular relations, and in particular composition, conclude this part (example 7). The third part of the course deals with context-free grammars, which are motivated by the inability of regular expressions to account for (and assign structure to) several phenomena in natural languages. Example 8 is the running example used throughout this part.</Paragraph> <Paragraph position="16"> Basic notions, such as derivation and derivation Example 5 Dictionaries as finite-state automata Many NLP applications require the use of lexicons or dictionaries, sometimes storing hundreds of thousands of entries. Finite-state automata provide an efficient means for storing dictionaries, accessing them and modifying their contents. To understand the basic organization of a dictionary as a finite-state machine, assume that an alphabet is fixed (we will use a54a40a44 a2 a, b, a37a38a37a38a37 , za3 in the following discussion) and consider how a single word, say go, can be represented. As we have seen above, a na&quot;ive representation would be to construct an automaton with a single path whose arcs are labeled by the letters of the word go: a79 a24 To represent more than one word, we can simply add paths to our &quot;lexicon&quot;, one path for each additional word. Thus, after adding the words gone and going, we might have:</Paragraph> <Paragraph position="18"> This automaton can then be determinized and minimized: null</Paragraph> <Paragraph position="20"> With such a representation, a lexical lookup operation amounts to checking whether a word a84 is a member in the language generated by the automaton, which can be done by &quot;walking&quot; the automaton along the path indicated by a84 . This is an extremely efficient operation: it takes exactly one &quot;step&quot; for each letter of a84 . We say that the time required for this operation is linear in the length of a84 .</Paragraph> <Paragraph position="21"> trees are presented gently, with plenty of examples.</Paragraph> <Paragraph position="22"> To motivate the discussion, questions of ambiguity are raised. Context-free grammars are shown to be sufficient for assigning structure to several natural Example 6 Relations over languages Consider a simple part-of-speech tagger: an application which associates with every word in some natural language a tag, drawn from a finite set of tags.</Paragraph> <Paragraph position="23"> In terms of formal languages, such an application implements a relation over two languages. For simplicity, assume that the natural language is defined over a54 a77 a44 a2a16a11a57a9a33a7a10a9a38a37a38a37a38a37a16a9a12a60 a3 and that the set of tags is</Paragraph> <Paragraph position="25"> of-speech relation might contain the following pairs, depicted here vertically (that is, a string over a54 a77 is depicted over an element of a54 a80 ): I know some new tricks PRON V DET ADJ N said the Cat in the Hat V DET N P DET N As another example, assume that a54 a77 is as above, and a80 is a set of part-of-speech and morphological tags, including a2 -PRON, -V, -DET, -ADJ, -N, -P, -1, -2, -3, -sg, -pl, -pres, -past, -def, -indefa3 . A morphological analyzer is basically an application defining a relation between a language over a54 a77 and a language over a54 a80 . Some of the pairs in such a relation are (vertically): I know I-PRON-1-sg know-V-pres some new tricks some-DET-indef new-ADJ trick-N-pl said the Cat say-V-past the-DET-def cat-N-sg Finally, consider the relation that maps every English noun in singular to its plural form. While the relation is highly regular (namely, adding &quot;a73 &quot; to the singular form), some nouns are irregular. Some instances of this relation are: cat hat ox child mouse sheep cats hats oxen children mice sheep language phenomena, including subject-verb agreement, verb subcategorization, etc. Finally, some mathematical properties of context-free languages are discussed.</Paragraph> <Paragraph position="26"> The last part of the course deals with questions of expressivity, and in particular strong and weak</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Example 7 Composition of finite-state transducers </SectionTitle> <Paragraph position="0"> Let a43 a77 be the following relation, mapping some English words to their German counterparts:</Paragraph> <Paragraph position="2"> grapefruit:Grapefruit, grapefruit:pampelmuse, pineapple:Ananas, coconut:Koko, coconut:Kokusnussa3 Let a43 a80 be a similar relation, mapping French words to their English translations:</Paragraph> <Paragraph position="4"> pampelmousse:grapefruit, concombre:cucumber, cornichon:cucumber, noix-de-coco:coconuta3 Then a43 a80a36a85 a43 a77 is a relation mapping French words to their German translations (the English translations are used to compute the mapping, but are not part of the final relation): Assume that the set of terminals is a2 the, cat, in, hata3 and the set of non-terminals is a2 D, N, P, NP, PPa3 . Then possible rules over these two sets include: D a87 the NP a87 D N N a87 cat PP a87 P NP N a87 hat NP a87 NP PP P a87 in Note that the terminal symbols correspond to words of English, and not to letters as was the case in the previous chapter.</Paragraph> <Paragraph position="5"> generative capacity of linguistic formalism. The Chomsky hierarchy of languages is defined and explained, and substantial focus is placed on determining the location of natural languages in the hierarchy. By this time, students will have obtained a sense of the expressiveness of each of the formalisms discussed in class, so they are more likely to understand many of the issues discussed in Pullum and Gazdar (1982), on which this part of the course is based. The course ends with hints to more expressive formalisms, in particular Tree-Adjoining Grammars and various unification-based formalisms.</Paragraph> </Section> </Section> class="xml-element"></Paper>