THE RECOGNITION CAPACITY OF LOCAL SYNTACTIC CONSTRAINTS 
Mori Rimon' 
Jacky Herz ~ 
The Computer Science Department 
The Hebrew University of Jerusalem, 
Giv'at Ram, Jerusalem 91904, ISRAEL 
E-mail: rimon@hujics.BITNET 
Abstract 
Givcn a grammar for a language, it is possible to 
create finite state mechanisms that approximate 
its recognition capacity. These simple automata 
consider only short context information~ drawn 
from local syntactic constraints which the 
grammar hnposes. While it is short of providing 
the strong generative capacity of the grammar, 
such an approximation is useful for removing 
most word tagging ambiguities, identifying many 
cases of iU-fonncd input, and assisting efficiently 
in othcr natural language processing tasks. Our 
basic approach to the acquisition and usage of 
local syntactic constraints was presented clse- 
whcre; in this papcr we present some formal and 
empiric-,d results pertaining to properties of the 
approximating automata. 
1. Introduction 
Parsing is a process by which an input sentence 
is not only recognized as belonging to the lan- 
guage, but is also assigned a structure. As 
\[l\]erwick/Wcinbcrg 84\] commcnt, recognition 
per se (i.e. a weak generative capacity analysis) is 
not of much value for a theory of language 
understanding, but it can be useful "as a diag- 
nostic". We claim that if an cfficient recognition 
procedure is availat~le, it can be tnost valuable as 
a prc-parsing reducer of lcxical ambiguity (espe- 
cially, as \[Milne 86\] points out, for detcnninistic 
parsers), and cvcn more useful in applications 
where full parsing is not absolutely required - 
e.g. identification of iU-formed inputs in a text 
critique program. Still weaker than recognition 
procedures are 'methods which approximate the 
recognition capacity. This is the kind of methods 
that we discuss in this paper. 
More specifically, we analyze the recognition 
capacity of automata based on local (short 
context) considerations. In \[Herz/Rimon 91\] we 
prescnted our approach to the acquisition and 
usage of local syntactic constraints, focusing on 
its use for reduction of word-level ambiguity. 
After briefly reviewing this method in section 2 
below, we examine in more detail various char- 
acteristics of the approximating automata, and 
suggest several applications. 
2. Background: Local Syntactic 
Constraints 
Let S = Wi,..., W  be a sentence of length N, 
{Wi} being the words composing the sentence. 
And let ti ..... t  be a tag image corresponding to 
the sentence S, {ti} belonging to the tag set T - 
the set of word-class tags used as terminal 
symbols in a given grammar G. Typically, 
M=N, but in a more general environment we 
allow M > N . This is useful when dealing with 
languages where morphology allows cliticization, 
concatenation of conjunctions, prepositions, Or 
determiners to a verb or a noun, etc.; in gram- 
mars for l lebrew, for example, it is convenient 
J M. Rimon's main atfiliafion is the IBM Scientific Center, i laifa, Israel, E-mail: rimon@haifasc3.iinusl.ibm.com 
2 j. I Icrz was partly supported by the I.eihniz ('enter for R.esearch in Computer Science, the ! lebrew University, 
and by the Rau foundation of the Open University. 
155 - 
to assume that a preliminary morphological 
phase separated word-forms to basic sequences 
of tags, and then state syntactic rules in terms of 
standard word classes. 
In any case, it is reasonable to assume that the 
tag image it ..... IM cannot be uniquely assigned. 
Fven with a coarse tag set (e.g. parts of speech 
with no features) many words have more than 
one interpretation, thus giving rise to exponen- 
tially many tag images for a sentence. 3 
Following \[Karlsson 90\], we use the term cohort 
to refer to the set of lcxicaUy valid readings of a 
given word. We use the term path to refer to a 
sequence of M tags (M~ N) which is a tag- 
image corresponding to the words W,..., WN of 
a given sentence S. This is motivated by a view 
of lexical mnbiguity as a graph problem: we try 
to reduce the number of tentative paths in 
ambiguous cases by removing arcs from the Sen- 
tence Graph (SG) - a directed graph with ver- 
tices for all tags in all cohorts of the words in 
the given sentence, and arcs connecting each tag 
to ~dl tags in the cohort which follows it. 
The removal of arcs and the testing of paths for 
validity as complete sentence interpretations are 
done using local constraints. A local constraint 
of length k on a given tag t is a rule allowing or 
disaUowing a sequence of k tags from being in 
its right (or left) neighborhood in any tag image 
of a sentence. In our approach, the local con- 
straints are extractcd from the grammar (and this 
is the major aspect distinguishing it from some 
other short context methods such as \[Beale 881, 
\[DeRose 88\], \[Karlsson 90\], \[Katz 851, 
\[Marcus 80\], \[Marshall 831, and \[Milnc 861). 
For technical convenience we add the symbol 
"$ <" at the beginning of tag images and "> $~ at 
the etad. Given a grammar G (wlfich for the time 
being we assume to be an unrestricted context- 
free phrase structure grammar), with a:set T of 
terminal symbols (tag set), a set V of variables 
(non-terminals, among which S is the root vail- 
able for derivations), and a set P of production 
rules of the form A --. a, where A is in V and a 
is in (VUT)* , we define the Right Short 
Context of length k of a terminal t (tag): 
SCr (t,k) for t in T and for k = 0,1,2,3... 
tz I z ~ T*, Izl=k or Izl < k if 
"> $' is the last tag in z, 
and there exists a derivation 
S = > atz// (a,//~ (V U T)* ) 
The l.eft Short Context of length k of a tag t rel- 
ative to the grammar G is denoted by SCI (t,k) 
and defined in; a similar way. 
It is sometimes useful to define Positional Short 
Contexts. The definition is similar to the above, 
with a restriction that t may start only in a given 
position in a tag image of a sentence. 
The basis for the automaton Which checks a tag 
stream (path) for validity as a tag-image relative 
to the local constraints, is the function next(t), 
which for any t in T defines a set, as follows: : 
next (t) = { z I tz E SCr (t,l) } 
In \[Ilerz/Rimon 911 we gave a procedure for 
computing next(t) from a given context free 
grammar, using standard practices of parsing of 
formal languages (see \[Aho/Ulhnan 72\]). 
3. Local Constraints Automata 
We denote by LCA(I) the simple finite state 
automaton which uses the pre-processed 
{next(t)} sets to check if a given tag stream 
(path) satisfies the SCr(t,l) constraints. 
In a similar: manner it is possible to define 
LCA(k), relative to the short context of length k. 
We denote by L the language generated by the 
3 Our studies of modern written ! lebrew suggest that about 60% of the word-forms in running texts are ambiguous 
with respect to a basic tag set, and the :average number of possible readings of such word-forms is 2.4. Even 
when counting only "natural readings', i.e. interpretations which are likely to occur in typical corpora, this 
number is quite large, around 1.8 (it is somewhat larger for the small subset of the most common words). 
156 - 
underlying grammar, and by L(k) the language 
accepted by the automaton LCA(k). The fol- 
lowing relations hold for the family of automata 
(LCA(i)}: 
L(I) _~ L(2) _~ ... ~ L 
"llfis guarantees a security feature: If for some i, 
I.CA(i) does not recognize (accept) a string of 
tags, then this string is sure to be illegM (i.e. not 
in 1.). On the other hand, any LCA(k) may rec- 
ognize sentences not in L (or, from a dual point 
of view, will reject only part of the illegal tag 
images). The important question is how tight are 
the inclusion relations above - i.e. how well 
LCA(k) approximates the language I.. in partic- 
ular we are interestcd in LCA(I). 
There is no simple analytic answer to tiffs ques- 
tion. Contradictory forces play here: the nature 
of the language -- c.g a rigid word order and 
constituent order yield stronger constraints; the 
grain of the tag set -- better refined tags (dif- 
ferent languages may require different tag sets) 
help express refined syntactic claims, hence more 
specific constraints, but they "also create a greater 
level of tagging ambiguity; the size of the 
grammar -- a larger grammar offers more infor- 
mation, but, covering a richer set of structures, it 
• allows more tag-pairs to co-occur; etc. 
It is interesting to note that for l lebrew, short 
context methods are most needed because of the 
considerable ambiguity at the lexical level, but 
their cll~:ctiveness suffers from the rather free 
word/constituent order. 
Finally, a comment about the computational 
efficiency of the LCA(k) automaton. The time 
complexity of checking a tag string of length n 
using I,CA(k) is at most O(n x k x loglTI), 
while a non-deterministic parser for a context 
free grmntnar may require O(n3x IGI2). (IT\] is 
the size of the tag set, IGI is the size of the 
grammar). The space complexity of l,CA(k) is 
proportionM to \]7\] k÷~ ; this is why otfly truly 
short contexts should be used. 
Note that for a sentence of length k, the power 
of LCA(k) is idcnticM to the weak generative 
capacity of the full underlying grammar. But 
since the size of sentences (tag sequences) in L is 
unbounded, there is no fixed k which suffices. 
4. A Sample Grammar 
To illustrate claims made in the sections below, 
we will use the following toy grammar of a small 
fragment of English. Statements about the cor- 
rectness of sentences etc., are of course relative 
to this toy grammar. 
The tag set T includes: n (noun), v (verb), det 
(determiner), adj ( adjective ) and prep (preposi- 
tion). The context free grammar G is: 
S --> $< NP VP >$ 
NP--> (det) (adj) n 
NP --> NP PP 
PP --> prep NP 
VP --> v NP 
VP --> VP PP 
To extract the local constraints from this 
grammar, we first compute the function next(t) 
for every tag t in T, and from the resulting sets 
we obtain the graph below, showing valid pairs 
in the short context of length 1 (again, validity is 
relative to the given toy grammar): 
>$ 
This graph, or more conveniently the table of 
"valid neighbors" below, define the LCA(I) 
automaton. The table is actually the union of 
the SCr(t,l) sets for all t in T, and it is derived 
directly from the graph: 
$< det adj n prep adj 
$< adj v det prep n 
$< n v adj n prep 
det adj v n n v 
det n prep det n >$ 
- 157 - 
5. A "Lucky Bag" Experiment 
Consider the following sentence, which is in the 
language gcncratcd by grammar G of section 4: 
(1) Thc channing princess kissed a frog. 
The unique tag image corresponding to this sen- 
tence is: \[ $ <, dot, adi, n, v, det, n, > $ \]. 
Now let us look at the 720 "random inputs" gen- 
erated by permutations of the six words in (i), 
and the set of corresponding tag images. 
Applying I.CA(I), only two tag images are 
rccog~.ed as valid: \[ $ <, det, adj, n, v, det, n, 
>$ \], and \[ $<, dct, n, v, dot, adj, n, >$ \]. 
These are exactly the images corresponding to 
the eight syntactically correct sentences (relative 
to G), 
(la-b) The/a charming princess kissed a/the frog. 
(lc-d) The/a chamfing frog kissed a/the princess. 
(lc-t') The/a princess kissed a/the charming frog. 
(lg-h) The/a frog kissed a/the charming princess. 
This result is not surprising, given the simple 
scntence and toy grammar. (In general, a 
grammar with a small number of rules relative to 
the size of the tag set cannot produce too many 
valid short contexts). It is therefore interesting 
to examine another example, where each word is 
associated with a cohort of several interpreta- 
tions. We borrow from \[llcrz/Rimon 9.1\]: 
(2) All old people like books about fish. 
Assuming the word tagging shown in section 6, 
there are 256 (2 x 2 x 2 x 4 x 2 x 2 x 2) tentative 
tag hnages (paths) for this sentence and for each 
of its 5040 permutations. This generates a very 
htrge number of rather random tag images. 
Applying LCA(I), only a small number of 
hnages are rccogtfizcd as potentially valid. 
Among them are syntactically correct sentences 
such as: 
(2a) Fish like old books about all people. 
,and only less than 0.1% sentences which are 
locally valid but globally incorrect, such as: 
(2b) * Old tish all about books like people. 
(tagged as \[$ <, n, v, n, prep, n, v, n, > $\]). 
These two examples do not suggest any kind of 
proof, but they well illustrate the recognition 
power of even the least powerful automaton in 
the {LeA(i)} family. To get another point of 
view, one may consider the simple formal lan- 
guage L consisting of the strings {ar"b m} for 
I < rn, which can be generated by a context-free 
grammar (} over T = {a, b}. I.CA(I) based on 
(; will recognize all strings of the form (a'b ~} for 
1 <j,k, but none of the very many other strings 
over T. It can be shown that, given arbitrary 
strings of length n over T, the probability that 
LeA(I) will not reject strings not belonging to L 
is proportional to n/2", a term which tends 
rapidly to 0. This is the over-recognition margin. 
6. Use of LeA in Conjunction with a 
Parser 
The number of potentially valid tag images 
(paths) for a given sentence can be exponential 
in the length of the sentence if all words are 
ambiguous. It is therefore desirable to filter out 
invalid tag images before (or during) parsing. 
To examine the power of LCAs as a pre-parsing 
fdter, we use example (2) again, demonstrating 
lexical ambiguities as shown in the chart below. 
The chart shows the Reduced Sentence Graph 
(RSG) - the original SG from which invalid arcs 
(relative to the SCr(t,l) table) were removed. 
ALL OLD PEOPLE LIKE BOOKS ABOUT FISH 
det--~adj--~n ~v - ~ n---~prep--->n 
n n ) v__prepj --e v >$ 
n 
We are left with four valid paths through the 
sentence, out of the 256 tentative paths in SG. 
Two paths represent legal syntactic interpreta- 
tions (of which one is "the intended" meaning). 
The other two are locally valid but globally 
incorrect, having either two verbs or no verb at 
- 158 - 
all, in contrast to the grammar. SCr(t,2) would 
have rejected one of the wrong two. 
Note that in this particular example the method 
was quite effective in reducing sentence-wide 
interpretations (leaving an easy job even for a 
deterministic parser), but it was not very good in 
individual word tagging disambiguation. These 
two sub-goals of raging disambiguation 
reducing the number of paths and reducing 
word-level possibilities - are not identical. It is 
possible to construct sentences in which all 
words are two-way ambiguous and only two dis- 
joint paths out of the 2 N possible paths are legal, 
thus preserving all word-level ambiguity. 
We demonstrated the potential of efficient path 
reduction for a pre-parsing filter. But short-con- 
text techniques can also be integrated into the 
parsing process itself. In this mode, when the 
parser hypothesizes the existence of a constit- 
uent, it will first check if local constraints do not 
rule out that hypothesis. In the example above, 
a more sophisticated method could have used 
the fact that our grammar does not allow verbs 
in constituents other than VP, or that it requires 
one and only one verb in the whole sentence. 
The motiwttion for this method, and its princi- 
ples of operation, are similar to those behind dif- 
ferent tecimiques combining top-down and 
bottom-up considerations. The performance 
gains depend on the parsing technique; in 
general, allowing early decisions regarding incon- 
sistent tag assignments, based on information 
Which may be only implicit in the grammar, 
offers considerable savings. 
7. Educated Guess of Unknown Words 
Another interesting aid Which local syntactic 
constraints can provide for practical parsers is 
"an oracle" which makes "educated guesses ~ 
about unknown words. It is typical for language 
analysis systems to assume a noun whenever an 
unknown word is encountered. There is sense in 
tiffs strategy, but the use of LCA, even LCA(I), 
can do much better. 
To illustrate this feature, we go back to the prin- 
cess and the frog. Suppose that an adjective 
unknown to the system, say 'q'ransylvanian" was 
used rather than "charming" in example (1), 
yielding the input sentence: 
(3) The Transylvanian princess kissed a frog. 
Checking out all tags in T in the second position 
of the tag image of this sentence, the only tag 
that satisfies the constraints of LCA(1) is adj. 
8. "Context Sensitive" Spelling 
Verification 
A related application of local syntactic con- 
straints is spelling verification beyond the basic 
word level (which is, in fact, SCr(t,0) ). 
Suppose that while typing sentence (1), a user 
made a typing error and instead of the adjective 
"charming u wrote "charm" (or "arming", or any 
other legal word which is interpreted as a noun): 
(4) The charm princess kissed a frog. 
This is the kind of errors* that a full parser 
would recognize but a word-based spell-checker 
would not. But in many such cases there is no 
need for the "full power (and complexity) of a 
parser; even LCA(I) can detect the error. In 
general, an LCA which is based on a detailed 
grammar, offers cheap and effective means for 
invalidation of a large set of ill-formed inputs. 
Here too, one may want to get another point of 
view by considering the simple formal language 
L = {ambm}. A single typo results in a string 
with one "a', changed for a "W, or vice versa. 
Since LCA(i) recognizes strings of the form 
{aJb ~} for 1 <_j,k, given arbitrary strings of length 
n over T = (a, b}, LCA(I) will detect "all but 
two of the n single typos possible - those on the 
borderline between the a's and b's. 
Remember that everything is relative to ~ the toy grammar used throughout this paper. Hence, although "the 
charm princess" may be a perfect noun phrase, it is illegal relative to our grammar. 
- 159 - 
9. Assistance to Tagging Systems 
Taggcd corpora are important resources for 
many applications. Since manual tagging is a 
slow and expensive process, it is a common 
approach to try automatic hcuristics and resort 
to user interaction only when there is no dccisive 
information. A well-built tagging system can 
"learn" and improve its performance as more 
text is processed (e.g. by using the already tagged 
corpus as a statistical knowledge base). 
Arguments such as those given in sections 7 and 
8 above suggest that the use of local constraints 
can resolve many tagging ambiguities, thus 
incrcasing the "specd of convergence" of an auto- 
matic tagging system• This seems to be true even 
for the rather simple and inexpensive I,CA(I) for 
laaaguagcs with a relatively rigid word order. For 
related work cf. \[Grccne/Rubin 71\], I~Church 
88\], \[l)cRose 88\], and \[Marshall 83\]. 
10. Final Remarks 
To make our presentation simpler, we have 
limited thc discussion to straightforward context 
free grammars. But the method is more gcnerzd. 
It can, for example, he extended to Ci:Gs aug- 
mented with conditional equations on features 
(such as agrccmcnt)- cither by translathag such 
grammars to equivalent CFGs with a more 
detailed tag set (assuming a finite range of 
feature values), or by augmenting our a:utomata 
with conditions on arcs. It can also be extended 
for a probabilistic language model, generating 
probabilistic constraints on tag sequences from a 
probabilistic CFG (such as of \[Fujisaki et ",3.1. 
89\]). 
Perhaps more interestingly, the method can be 
used even without an underlying grammar, if a 
large corpus and a lexical analyzer (which sug- 
gests prc-disambiguatcd cohorts) are available. 
This variant is based on a tcchnique of invali- 
dation of tag pairs (or longer sequences) which 
satisfy certain conditions over the whole lan- 
guage L, and the fact that L can be approxi- 
matcd by a large corpus. We cannot elaborate 
on this extcnsion here. 
References 
\[ Aho/UIIman 72\] Alfred V. Aho and Jeffrey D. Jllman. 7"he Theory of Parsing, Translation and 
Compiling. Prentice-! lall, 1972-3. 
f Bcalc 88\] Andrew David 13eale. I~exicon and ;rammar in Probabilistic Tagging of Written 
Fnglish. Proc. of the 26th Annual Meeting of the ACL, 
Buffalo NY, 1988. 
\[Berwick/Wcinberg 84\] Robert C. Berwick and 
Amy S. Weinberg. "/'he Grammatical Basis of Linguistic Performance, 
The M IT Press, 1984. 
\[Church 88\] Kenneth W. Church. A Sto- 
chastic Parts Program and Noun Phrase Parser 
for Running Text. Proc. of the 2nd A CL conf. on Applied Natural Language Processing. 
1988. 
\[DcRose 88\] Steven J. l)eRose. Grammatical 
Category Dnsambiguation by Statistical Opti- 
mization. Computational Linguistics, vol. 14, no. 
1, 1988. 
Fujisaki et al. 89\] T. Fujisaki, F. Jelinek, J. 
~'ocke, E. Black, T. Nishimo. A Probabilistic 
Parsing Method for Sentence l)isambiguation. 
Proc. of the Ist International Parsing Workshop, 
Pittsburgh, June 1989. 
~ ;rcene/Rubin 71\] Barbara Greene and Gerald ubin. Automated Grammatical Tagging of 
ll:~ish. Technical Report, Brown Umversity, 
llerz/Rinnon 91\] Jacky llerz and Mori Rimon. 
,ocal Syntactic Constraints. Proc. of the 2nd International Workshop on Parsing Technologies, 
Cancun, February 1991. 
Karlsson 90\] Fred Karlsson. Constraint 
rammar as a Framework for Parsing Running 
Text. The 13th COLING Conference, Helsinki, 
1990. 
\[Katz 85\] Slava Katz. Recursive M-gram l_,an- 
guage Model Via Smoothing of Turing Formula. IBM Technical Disclosure Bulletin, 
1985. ~ 
larcus 80\] Mitchell P. Marcus. A Theo~ of ntactic Recognition for Natural Language, 
l'he 
IT Press, 1980. 
\[Marshall 83\] lan Marshall. Choice of Gram- 
matical Word-Class Without Global Syntactic 
Analysis: Tagging Words in the LOB Corpus. Computers in the llumanities, 
vol. 17, pp. 
139-150, 1983. 
Milne 86\] Robert Milne. Resolving Lexical 
mbiguity in a Deterministic Parser. Computa- tionalLinguistics, 
vol. 12, no. 1, pp. 1-12, 1986• 
- 160- 
